We recently produced an app (built in Rails 4 with a Postgres database) for a client who wanted an app-wide, full-text search. The search had to retrieve records of different classes, rank results by relevance to the search-term, and be fast.
There are a number of ways to approach this problem, not all of which will result in a 'fast' search. Here's how we did it using Postgres tsvector and tsquery.
What are we actually trying to do?
- When certain fields in our database are updated (or created), we want to create a list of words used in the content of those fields.
- We want to ignore words like 'and', 'a', 'it' etc.
- We want to make records containing words in certain fields, rank more highly/less high in search results than records with the same words contained in different fields
- We need to associate a particular record with our list of words used in its fields
- So that when a user searches for a particular word or words, we can return the relevant records
Deciding on the models/fields we wanted the search to return
Because the search function was site-wide, the results needed to include more than one type of record. For the purposes of this example, assume our app has two models*:
- Ship
- Operator
And let's say each of our two models has three attributes that we wish to make 'searchable':
- Title
- Description
- Booking info
If a user searches for 'American', we need ship and operator records containing that word in a title, description or booking info field to be returned in our list of results.
*It doesn't matter how these models are related for the moment.
Deciding on which fields are more important in search results.
Let's say we have three ship records:
- Ship 1
- Ship 2
- Ship 3
Now say each ship record contains the word 'American':
- Ship 1 - in its title field
- Ship 2 - in its description field
- Ship 3 - in its booking info field
If a user searches for 'American', we want Ship 1 to rank higher in the search results than 'Ship 3'. We've decided, for the purposes of this example, that an instance of a word in a title field is more relevant to a search term than an instance of a word in a booking info field.
We used a weighting system of A, B, C and D to assign importance to the fields we wanted to search. With A being the most and D being the least important.
- Title (weight = A)
- Description (weight = B)
- Booking info (weight = C)
Using the PostgreSQL 'to_tsvector()' function
Now we've decided on the models and the fields we want to make searchable, it's time to head to Postgres. We need to create a PostgreSQL to_tsvector function to process our fields on update.
"to_tsvector() parses a textual document into tokens, reduces the tokens to lexemes, and returns a tsvector which lists the lexemes together with their positions in the document."
to_tsvector('english', 'some text to process')
The to_tsvector function takes two arguments - a language (which defaults to 'english') and the text to process. Because our text will always be in English, we can omit the first argument.
to_tsvector('some text to process')
The to_tsvector() function will return NULL if the text passed to it is NULL. Because of this, we pass our text through the coalesce() function before passing the result of that to to_tsvector. If the text in our field is NULL, then the second argument we pass to coalesce() is returned. Coalesce will only return NULL if ALL of it's arguments are NULL.
to_tsvector(coalesce('some text to process', ''))
Finally, we want to assign the grade from our weighting system to the results of the t_tsvector function. We do this using the setweight function.
setweight(to_tsvector(coalesce('some text to process', '')), 'A')
Storing the results of the to_tsvector() function
Using our ship model as an example, we've added a 'tsv' column to ships to store the results of to_tsvector().
class AddTsvToShips < ActiveRecord::Migration
def change
add_column :ships, :tsv, :tsvector
end
end
Most simple Rails applications will include a schema dump in db/schema.rb. Because db/schema.rb cannot express database triggers or stored procedures, we need to set our schema format to :sql. This is set in config/application.rb by the config.active_record.schema_format setting. It can be set to either :ruby or :sql.
Now we create a migration that will create a function to update a ship's tsv field and a function that will trigger our update function when a record is created or updated. Here's what our migration looks like:
class AddShipTsvTrigger < ActiveRecord::Migration
def up
execute <
This migration creates an update_ships_tsv() function that will be called by another function, ships_tsv_trigger() which RETURNS a trigger - Postgres will call our trigger function on INSERT or UPDATE of a ship record.
We then repeated the process for our operator model
So let's say we have the following ship record:
Ship:
title: 'Initforthe'
description: 'A big red ship'
booking_info: 'Book now for a discount'
When the ship was created, our update_ships_tsv() function saves the following lexeme into the ships tsv field:
"'initforthe':1A 'big':3B 'red':4B 'ship':5B 'book':6C 'discount':10C"
This means any search terms containing these strings, will return this particular ship record. As you can see, each string has a 'weighting' attached which is used to rank this record in a list of search results.
When a user searches
In order to retrieve ship records from a search, we created a function on our ships model:
def self.search(term)
if term.blank?
none
else
query = sanitize_search(term)
return none if query.blank? # empty result set
where("ships.tsv @@ #{query}").order("ts_rank_cd(ships.tsv, #{query}) DESC")
end
end
Here, we use
PostgreSQL's ts_rank_cd() function which interprets the lexeme stored in out records tsv field and orders by importance/relevance.
So, by calling our search method on the Ship class and passing in the search term, we search all our records and return them ranked by importance/relevance.
Further information