Full Text Search using MongoDB

If you build a web-service around content, then it is a matter of time when you need to introduce a search functionality.

Depending on the shape of your data you may go the easiest way: search using a regular expression against some field (title, name, etc.). But what if you have more than one field? What if the data is spread across several collections? You would probably also want to have a ‘search rank,’ i.e. to know how close the match is.

This way the simple solution will not work anymore. Or it won’t be simple.

‘Classic’ solution for this problem is to pick something like Elastic Search or Apache Lucene. But if you are running on MongoDB, then there is no need to introduce another dependency into your stack. You can use the MongoDB Full-Text Search and MongoDB Aggregation Framework to build powerful search functionality.

Getting Started

I highly recommend to create a database locally, import sample data and follow my actions step by step. I crafted small data set. The data is normalized on purpose. I chose the “worst” case to show a real-world application.

Just connect to some sandbox database locally (you can pick any other name):

mongo "mongodb://localhost:27017/sandbox"

Then create few books:

db.books.insertMany([{
  _id: "lord_of_the_rings",
  title: "The Lord of the Rings",
  country: "United Kingdom",
  language: "english",
  author_id: "tolkien",
  genres: "fantasy, adventure"
},
{
  _id: "hobbit",
  title: "The Hobbit",
  country: "United Kingdom",
  language: "english",
  author_id: "tolkien",
  genres: "juvenile fantasy, high fantasy"
},
{
  _id: "harry_potter",
  title: "Harry Potter and the Philosopher's Stone",
  country: "United Kingdom",
  language: "english",
  author_id: "rowling",
  genres: "fantasy"
},
{
  _id: "castle",
  title: "The Castle",
  country: "Czechoslovakia",
  language: "german",
  author_id: "kafka",
  genres: "philosophical fiction, dystopian novel, political fiction, comedy"
},
{
  _id: "atlas_shrugged",
  title: "Atlas Shrugged",
  country: "United States",
  language: "English",
  author_id: "rand",
  genres: "Philosophical fiction, Science fiction, Mystery fiction, Romance novel, Utopia"
}])

And authors:

db.authors.insertMany([{
  _id: "tolkien",
  name: "John Ronald Reuel Tolkien",
  nationality: "British",
  genres: "fantasy, high fantasy, translation, literary criticism"
},
{
  _id: "rowling",
  name: "Joanne Rowling",
  nationality: "British",
  genres: "fantasy, drama, young adult fiction, tragicomedy, crime fiction"
},
{
  _id: "kafka",
  name: "Franz Kafka",
  nationality: "Czech",
  genres: "philosophical novella, absurdist fiction"
},
{
  _id: "rand",
  name: "Ayn Rand",
  nationality: "American",
  genres: "philosophical novella, absurdist fiction"
}])

Now we are ready to go further.

Given this dataset, we already can look for books based on a dumb simple regular expression. The ‘i’ at the end means that we want the search to be case-insensitive.

var searchTerm = /hobbit/i;
db.books.find({ $or : [
  { title : searchTerm },
  { genres : searchTerm }
]})

The same works just perfect if we are interested in fantasy in general.

var searchTerm = /fantasy/i;
db.books.find({ $or : [
  { title : searchTerm },
  { genres : searchTerm }
]})

However, this approach might be slow, especially if we add few more fields. Also, it probably makes sense to order results based on some criteria. E.g., matches by title go first, then go matches by genres.

This way we would need to do few searches, for each field, and then combine results manually.

Fortunately, there is a much better approach.

MongoDB supports text indexes that enable Full Text Search. The great thing about text indexes: they provide textScore, which is controlled via weights. Here is an example:

db.books.createIndex(
{
  title : "text",
  genres: "text",
  language: "text"
},
{
  weights: {
    title: 10,
    genres: 5,
    language: 5
  },
  name: "TextIndex"
})

Please, notice the weights entry. Numbers there are pretty random, just to show that the title is more relevant than genres and languages.

Now we can use powerful $text operator. The result will have some ‘hidden’ metadata attached, from which we can get the textScore.

var searchTerm = "hobbit";
db.books.find(
{
  $text : { $search : searchTerm }
},
{
  searchRank : { $meta : "textScore" }
})

Output:

{
  "_id" : "hobbit",
  "title" : "The Hobbit",
  "country" : "United Kingdom",
  "language" : "english",
  "author_id" : "tolkien",
  "genres" : "juvenile fantasy, high fantasy",
  "searchRank" : 10
}

We can sort results based on the searchRank:

var searchTerm = "hobbit";
db.books.find(
{
  $text : { $search : searchTerm }
},
{
  searchRank : { $meta : "textScore" }
}).sort( { searchRank: { $meta: "textScore" } } )

We will get the following:

{
  "_id" : "hobbit",
  "title" : "The Hobbit",
  "country" : "United Kingdom",
  "language" : "english",
  "author_id" : "tolkien",
  "genres" : "juvenile fantasy, high fantasy",
  "searchRank" : 5.625
}
{
  "_id" : "harry_potter",
  "title" : "Harry Potter and the Philosopher's Stone",
  "country" : "United Kingdom",
  "language" : "english",
  "author_id" : "rowling",
  "genres" : "fantasy",
  "searchRank" : 5
}
{
  "_id" : "lord_of_the_rings",
  "title" : "The Lord of the Rings",
  "country" : "United Kingdom",
  "language" : "english",
  "author_id" : "tolkien",
  "genres" : "fantasy, adventure",
  "searchRank" : 3.75
}

It looks awesome and definitely much better than naive regex-based search. But, there is another problem:

var searchTerm = "tolkien";
db.books.find(
{
  $text : { $search : searchTerm }
},
{
  searchRank : { $meta : "textScore" }
})

No results. Let’s fix that.

Search Index

Here the Aggregation Framework comes into play.

To search through authors, we can craft another collection, which will serve us a very accurate search index. It will have the following form:

{
  "book_id": "hobbit",
  "title": "The Hobbit",
  "genres": "juvenile fantasy, high fantasy",
  "author_name": "John Ronald Reuel Tolkien",
  "author_genres": "fantasy, high fantasy, translation, literary criticism"
}

To create this collection, we will use aggregation pipeline. The algorithm is the following:

  • take all books;
  • extend each book with information about an author;
  • change representation to rename or remove some fields;
  • save results into another collection.

It can be expressed using ‘aggregation language’:

db.books.aggregate([
  {
    $lookup : {
      from : "authors",
      localField: "author_id",
      foreignField: "_id",
      as : "author"
    }
  },
  {
    $unwind : "$author"
  },
  {
    $project : {
      book_id : "$_id",
      title : "$title",
      genres : "$genres",
      author_name : "$author.name",
      author_genres : "$author.genres"
    }
  },
  {
    $out : "book_search_index"
  }
])

Let’s look at each stage in the pipeline.

$lookup joins two collections using localField and foreignField as a connection. $lookup results in an array. In our case, the array always contains one element.

To get rid of the array we use $unwind, which creates a copy of a document for each entry.

Then we use $project to shape the data.

At the very end of the pipeline, we use $out operator who puts all results into another collection. $out overrides all existing records.

Now, given the new collection, we can create a text index on it as we did initially with books.

db.book_search_index.createIndex(
{
  title : "text",
  genres : "text",
  author_name : "text",
  author_genres : "text"
},
{
  weights: {
    title: 10,
    genres: 6,
    author_name: 4,
    author_genres: 1,
  },
  name: "TextIndex"
})

Again, the weights are random. Let’s check out previous results.

var searchTerm = "hobbit";
db.book_search_index.find(
{
  $text : { $search : searchTerm }
},
{
  searchRank : { $meta : "textScore" }
})
{
  "_id" : "hobbit",
  "book_id" : "hobbit",
  "title" : "The Hobbit",
  "genres" : "juvenile fantasy, high fantasy",
  "author_name" : "John Ronald Reuel Tolkien",
  "author_genres" : "fantasy, high fantasy, translation, literary criticism",
  "searchRank" : 10
}

One more:

var searchTerm = "tolkien";
db.book_search_index.find(
{
  $text : { $search : searchTerm }
},
{
  searchRank : { $meta : "textScore" }
})
{
  "_id" : "lord_of_the_rings",
  "book_id" : "lord_of_the_rings",
  "title" : "The Lord of the Rings",
  "genres" : "fantasy, adventure",
  "author_name" : "John Ronald Reuel Tolkien",
  "author_genres" : "fantasy, high fantasy, translation, literary criticism",
  "searchRank" : 2.5
}
{
  "_id" : "hobbit",
  "book_id" : "hobbit",
  "title" : "The Hobbit",
  "genres" : "juvenile fantasy, high fantasy",
  "author_name" : "John Ronald Reuel Tolkien",
  "author_genres" : "fantasy, high fantasy, translation, literary criticism",
  "searchRank" : 2.5
}

Awesome. Though, we are getting back documents from book_search_index, not from books. We can easily fix this using aggregation framework:

var searchTerm = "tolkien";
db.book_search_index.aggregate([
  {
    $match : {
      $text : { $search : searchTerm }
    }
  },
  {
    $addFields : {
      searchRank : { $meta : "textScore" }
    }
  },
  {
    $lookup : {
      from : "books",
      localField : "book_id",
      foreignField : "_id",
      as : "book"
    }
  },
  {
    $unwind : "$book"
  },
  {
    $addFields : {
      "book.searchRank" : "$searchRank"
    }
  },
  {
    $replaceRoot : {
      newRoot : "$book"
    }
  }
])

Result:

{
  "_id" : "lord_of_the_rings",
  "title" : "The Lord of the Rings",
  "country" : "United Kingdom",
  "language" : "english",
  "author_id" : "tolkien",
  "genres" : "fantasy, adventure",
  "searchRank" : 2.5
}
{
  "_id" : "hobbit",
  "title" : "The Hobbit",
  "country" : "United Kingdom",
  "language" : "english",
  "author_id" : "tolkien",
  "genres" : "juvenile fantasy, high fantasy",
  "searchRank" : 2.5
}

To make it work correctly, we would just need to keep book_search_index up to date. The update can be easily triggered via a Cron task. Or after each change in the database.

That’s it

Full-Text Search is very common requirement in software development. The implementation itself is not straightforward, but nowadays one could easily integrate working solution in a couple of hours.

Subscribe to our mailing list