, by  Tom Simnett

Creating a full text search with Rails 4 and Postgres

Super fast text searching without a search engine

We recently produced an app (built in Rails 4 with a Postgres database) for a client who wanted an app-wide, full-text search. The search had to retrieve records of different classes, rank results by relevance to the search-term, and be fast.

There are a number of ways to approach this problem, not all of which will result in a 'fast' search. Here's how we did it using Postgres tsvector and tsquery.

What are we actually trying to do?

  1. When certain fields in our database are updated (or created), we want to create a list of words used in the content of those fields.
  2. We want to ignore words like 'and', 'a', 'it' etc.
  3. We want to make records containing words in certain fields, rank more highly/less high in search results than records with the same words contained in different fields
  4. We need to associate a particular record with our list of words used in its fields
  5. So that when a user searches for a particular word or words, we can return the relevant records

Deciding on the models/fields we wanted the search to return

Because the search function was site-wide, the results needed to include more than one type of record. For the purposes of this example, assume our app has two models*:

  1. Ship
  2. Operator

And let's say each of our two models has three attributes that we wish to make 'searchable':

  1. Title
  2. Description
  3. Booking info

If a user searches for 'American', we need ship and operator records containing that word in a title, description or booking info field to be returned in our list of results.

*It doesn't matter how these models are related for the moment.

Deciding on which fields are more important in search results.

Let's say we have three ship records:

  1. Ship 1
  2. Ship 2
  3. Ship 3

Now say each ship record contains the word 'American':

  • Ship 1 - in its title field
  • Ship 2 - in its description field
  • Ship 3 - in its booking info field

If a user searches for 'American', we want Ship 1 to rank higher in the search results than 'Ship 3'. We've decided, for the purposes of this example, that an instance of a word in a title field is more relevant to a search term than an instance of a word in a booking info field.

We used a weighting system of A, B, C and D to assign importance to the fields we wanted to search. With A being the most and D being the least important.

  1. Title (weight = A)
  2. Description (weight = B)
  3. Booking info (weight = C)

Using the PostgreSQL 'to_tsvector()' function

Now we've decided on the models and the fields we want to make searchable, it's time to head to Postgres. We need to create a PostgreSQL to_tsvector function to process our fields on update.

From the PostegeSQL docs:

"to_tsvector() parses a textual document into tokens, reduces the tokens to lexemes, and returns a tsvector which lists the lexemes together with their positions in the document."
to_tsvector('english', 'some text to process')

The to_tsvector function takes two arguments - a language (which defaults to 'english') and the text to process. Because our text will always be in English, we can omit the first argument.

to_tsvector('some text to process')

The to_tsvector() function will return NULL if the text passed to it is NULL. Because of this, we pass our text through the coalesce() function before passing the result of that to to_tsvector. If the text in our field is NULL, then the second argument we pass to coalesce() is returned. Coalesce will only return NULL if ALL of it's arguments are NULL.

to_tsvector(coalesce('some text to process', ''))

Finally, we want to assign the grade from our weighting system to the results of the t_tsvector function. We do this using the setweight function.

setweight(to_tsvector(coalesce('some text to process', '')), 'A')

Storing the results of the to_tsvector() function

Using our ship model as an example, we've added a 'tsv' column to ships to store the results of to_tsvector().

class AddTsvToShips < ActiveRecord::Migration
  def change
    add_column :ships, :tsv, :tsvector
  end
end

Most simple Rails applications will include a schema dump in db/schema.rb. Because db/schema.rb cannot express database triggers or stored procedures, we need to set our schema format to :sql. This is set in config/application.rb by the config.active_record.schema_format setting. It can be set to either :ruby or :sql.

Now we create a migration that will create a function to update a ship's tsv field and a function that will trigger our update function when a record is created or updated. Here's what our migration looks like:

class AddShipTsvTrigger < ActiveRecord::Migration
 def up
   execute <<-DOC
     CREATE OR REPLACE FUNCTION update_ships_tsv(ship_id integer) RETURNS void
     LANGUAGE plpgsql
     AS $_$
     DECLARE
     BEGIN
       UPDATE ships
          SET tsv =
                setweight(to_tsvector('english', coalesce(title, '')), 'A')
                || setweight(to_tsvector('english', coalesce(description, '')), 'B')
                || setweight(to_tsvector('english', coalesce(booking_info, '')), 'C')
       WHERE ships.id = ship_id;
     END
     $_$;

     CREATE OR REPLACE FUNCTION ships_tsv_trigger() RETURNS trigger
       LANGUAGE plpgsql
       AS $$
     declare
     begin
       if (tg_op = 'INSERT') then
         perform update_ships_tsv(new.id);
       elsif (tg_op = 'UPDATE') then
         if row(new) is distinct from row(old) then
           perform update_ships_tsv(new.id);
         end if;
       end if;
       return null;
     end;
     $$;

     CREATE TRIGGER ships_tsv_tr
     AFTER INSERT OR UPDATE ON ships
     FOR EACH ROW
     EXECUTE PROCEDURE ships_tsv_trigger();
   DOC
 end

 def down
   execute <<-DOC
     DROP TRIGGER ships_tsv_tr ON ships;
     DROP FUNCTION ships_tsv_trigger();
     DROP FUNCTION update_ships_tsv(integer);
   DOC
 end
end

This migration creates an update_ships_tsv() function that will be called by another function, ships_tsv_trigger() which RETURNS a trigger - Postgres will call our trigger function on INSERT or UPDATE of a ship record.

We then repeated the process for our operator model

So let's say we have the following ship record:

Ship:
  title:        'Initforthe'
  description:  'A big red ship'
  booking_info: 'Book now for a discount'

When the ship was created, our update_ships_tsv() function saves the following lexeme into the ships tsv field:

"'initforthe':1A 'big':3B 'red':4B 'ship':5B 'book':6C 'discount':10C"

This means any search terms containing these strings, will return this particular ship record. As you can see, each string has a 'weighting' attached which is used to rank this record in a list of search results.

When a user searches

In order to retrieve ship records from a search, we created a function on our ships model:

def self.search(term)
 if term.blank?
   none
 else
   query = sanitize_search(term)
   return none if query.blank? # empty result set
   where("ships.tsv @@ #{query}").order("ts_rank_cd(ships.tsv, #{query}) DESC")
 end
end

Here, we use PostgreSQL's ts_rank_cd() function which interprets the lexeme stored in out records tsv field and orders by importance/relevance.

So, by calling our search method on the Ship class and passing in the search term, we search all our records and return them ranked by importance/relevance.

Further information