Using PostgreSQL for an Ajax Autocomplete Field

Using PostgreSQL for an Ajax Autocomplete Field

I recently implemented an Ajax based autocomplete field for a web application using a PostgreSQL database on the back-end. I had trouble finding appropriate PostgreSQL queries, so I’ve decided to share my approach in the hope that it will be useful to others.

Data Format

The application uses the following PostgreSQL table to store the author data:

CREATE TABLE authors (
authors_id serial PRIMARY KEY,

author_name character varying UNIQUE NOT NULL
);

The author_name field represents the full name of an author. E.g. ‘jane austen’, ‘john smith’, ‘john q. public’, ‘william shakespeare’, etc.

Client Side JavaScript

On the client side, the application uses the jQuery UI Autocomplete widget.(http://jqueryui.com/demos/autocomplete/) jQuery is a great library and if you’re creating dynamic web pages and not using it, you really should be. However, jQuery has been well documented so I’m not going to discuss the details of the client side implementation. Additionally, the back-end code is general enough that it should work with other JavaScript libraries with minimal modifications. Essentially, when the user begins typing, the jQuery UI Autocomplete widget makes a get request to the server with the text the user entered and expects the server to respond with a json object containing the results to show the user.

Initial Implementation

Creating the initial version of the autocomplete field was relatively straightforward. To handle the Ajax we simply added the following method to our controller:

use JSON;
sub json_author_search : Local
{
my ( $self, $c, $dashboards_id ) = @_;
my $term = $c->req->param( 'term' ) || 0;
$term = '%' . $term . '%';
my $terms =
$c->dbis->query( "select authors_id, author_name as label from authors where author_name ilike ? ", $term )->hashes;
$c->res->body( encode_json( $terms ) );
return;
}

This code was fully functional and worked with small data sets on a development machine. However, once we started testing on our production machine with a large dataset, it became clear that there were performance problems.

Using an explain showed that PostgreSQL was doing a sequential scan. I.e. it was reading through the entire database table for every query:

mediacloud=# explain select authors_id, author_name as label from authors where author_name ilike '%bla%';
QUERY PLAN
---------------------------------------------------------
Seq Scan on authors (cost=0.00..37.33 rows=1 width=22)
Filter: ((author_name)::text ~~* '%bla%'::text)
(2 rows)

Possible Solutions

I did some web searches looking for a solution. I found Nikolay Samokhvalov’s slides on Using PostgresSQL In Web 2.0 Applications. http://www.scribd.com/doc/4844027/Using-…)
These slides have useful information and are worth reading. However, their suggested approaches focus on key word search rather than autocomplete. The first approach that Samokhvalov gives is to use prefix search. This solution uses like ‘bla%’ queries and relies on text_pattern_ops indexes for quick performance. Samokhvalov’s other tip is to use lower instead of using ilike. So in our case we would add the following index:

CREATE INDEX author_name_prefix ON authors USING btree ( lower (author_name) text_pattern_ops);

and change our search query to:

$term = ‘%’ . $term;
$c->dbis->query( "select authors_id, author_name as label from authors where lower(author_name) like ? ", $term )->hashes;

This approach is fast but it doesn’t give us the behavior that we wanted. For example, if the user typed ‘sm’, we would want ‘john smith’ to be included in the search results. However, “john smith” would only be matched if the user instead typed ‘jo’.

Another solution that Samkhvalov suggests is using GIN indexes with full text search. This is an interesting approach but it didn’t seem like the right fit for an autocomplete service. As far as I was able to determine, PostgreSQL’s full text search is only designed to match entire words so ‘sm’ would not match ‘smith’. Finally Samkhvalov suggests an interesting hybrid approach that uses both prefix search and full text search with GIN indexes. This approach relies on having a separate table that contains all the distinct words in every string in the table that’s being searched. So in our case we would have an author words table that contained words such as ‘alice’,’bob’,’brown’,’david’,’smith’,etc. This approach might work well for relatively static data — creating the initial table of distinct words is easy. However, in our data set we are constantly adding new author strings. I wanted to avoid the additional complexity of managing and updating an author words table.

My Solution

I added the following indexes to the authors table:

create index authors_name_varchar_pattern on authors(lower(author_name) varchar_pattern_ops);
create index authors_name_varchar_pattern_1 on authors(lower(split_part(author_name, ' ', 1)) varchar_pattern_ops);
create index authors_name_varchar_pattern_2 on authors(lower(split_part(author_name, ' ', 2)) varchar_pattern_ops);
create index authors_name_varchar_pattern_3 on authors(lower(split_part(author_name, ' ', 3)) varchar_pattern_ops);

Then I changed the query to:

$term = $term . '%';
my $terms =
$c->dbis->query( "select authors_id, author_name as label from authors where lower(author_name) like lower(?) OR " .
" lower(split_part(author_name, ' ', 1)) like lower(?) OR " .
" lower(split_part(author_name, ' ', 2)) like lower(?) OR " .
" lower(split_part(author_name, ' ', 3)) like lower(?) LIMIT 100 ",
$term, $term, $term, $term )->hashes

The above query performs prefix searches on each of the first three words within the author name string. Additionally, it performs a prefix search on the entire author name string. We OR these prefix searches so that we return the results that match any of the prefix searches. The nice thing about this search is that the user can start typing either the first name or the last name (or the middle name). For example, when the user starts typing ‘joh’, they’ll get a list that includes the ‘john’ names. When the user starts typing ‘smi’, they’ll get a list that includes the ‘smith’ names. Because we also do a prefix search on the entire string, the user could start typing ‘john’ to get a list of names starting with ‘john’ and then expand that to ‘john sm’ to get a list of ‘john smith’s and similar names.

(Note: it might have been safe to omit lower(split_part(author_name, ' ', 1)) like lower(?) from the query since it will often be redundant with lower(author_name) like lower(?). We decided to leave it in place to enhance readability and out of concern that it may be necessary to handle improperly formatted author strings.)

Conclusion

This approach does have its weakness. It won’t work for names involving more than 3 words. Fortunately those names are rare. It would also have been nice to be able to do full wild card searches such as like ‘%bla%’. Unfortunately, PostgreSQL does not have any index to optimize those types of searches.

In an ideal world, the database would have a set of indexes and operators that would perfectly handle the search queries necessary for autocomplete and abstract them away from the programmer. However, I’ve presented a solution that has performed well for me and has added only a minimal amount of complexity to the code and the database schema. I’d love to hear comments or suggestions for alternate approaches.

Be Sociable, Share!

5 thoughts on “Using PostgreSQL for an Ajax Autocomplete Field

  1. You could also add a pg_trgm index to the author_name column and use that to return the results in a ‘most likely first’ kind of order:

    CREATE EXTENSION pg_trgm;
    CREATE INDEX trgm_idx_author_name ON authors USING GIST (author_name gist_trgm_ops)

    Then:

    $c->dbis->query( “select authors_id, author_name as label from authors where lower(author_name) like lower(?) OR ” .
    ” lower(split_part(author_name, ‘ ‘, 1)) like lower(?) OR ” .
    ” lower(split_part(author_name, ‘ ‘, 2)) like lower(?) OR ” .
    ” lower(split_part(author_name, ‘ ‘, 3)) like lower(?)
    ORDER BY author_name <-> lower(?) ASC
    LIMIT 100 “,
    $term, $term, $term, $term,$term )->hashes

  2. Dan,

    Thanks for the tip!

    When I wrote this post back in 2010, Postgresql’s trgm didn’t have a distance (<->) operator.

    I’m curious about the efficiency of the ORDER BY distance operation when there are many matches. I.e. if the user types ‘e’, would Postgresql need to perform a distance comparison for every author with a name beginning with ‘e’ or would the trgm index allow it to consider only the top ranking items?

  3. Also since 9.1, Postgresql’s trgm indexes allow efficient like queries with non-prefixed searches, however this only works if the search term is 3 or more characters.

    Thus, if author_name has a gin or gist index then:

    select authors_id, author_name as label from authors where author_name ilike ‘%bla%’

    would be able to use it the index and avoid a sequential scan. However, if the search term was less than 3 characters, e.g.:

    select authors_id, author_name as label from authors where author_name ilike ‘%bl%’

    then a sequential scan would still be performed.

Comments are closed.