Tuning Django ORM (part 1): Text Queries

Concerning text queries, Django ORM offers several helper functions. However, some of those are not backed by a database index and will be slow. Fortunately, on PostgreSQL these searches can be sped up with GIN or GIST indexes quite easily.

This article will focus on Django 1.6 with a backing PostgreSQL. It probably applies to Django 1.7, as well.

The Django ORM layer offers some useful functionality when it comes to text search:

  1. contains and icontains (https://docs.djangoproject.com/en/1.6/ref/models/querysets/#std:fieldlookup-contains)
  2. startswith and istartswith (https://docs.djangoproject.com/en/1.6/ref/models/querysets/#startswith)
  3. endswith and iendswith (https://docs.djangoproject.com/en/1.6/ref/models/querysets/#endswith)

When querying the database, these will result in wildcard SQL queries. ‘endswith’ will use a trailing wildcard, ‘startswith’ a leading wildcard, and ‘contains’ leading and trailing wildcards.

In the following, I will focus on ‘contains’ as it covers both the leading and trailing wildcards.

The Django ORM layer will create the following WHERE clause when a ‘contains’ lookup is requested:

# python
models.Article.objects.filter(title__icontains='fancy')

# SQL
WHERE UPPER("blog_article"."title"::text) LIKE UPPER('%fancy%') 

Asking PostgreSQL to explain a query making use of ‘icontains’ on shows:

blog_db=# explain analyze SELECT COUNT(*) FROM "blog_article" WHERE UPPER("blog_article"."title"::text) LIKE UPPER('%fancy%');
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=557.64..557.65 rows=1 width=0) (actual time=29.156..29.156 rows=1 loops=1)
   ->  Seq Scan on blog_article  (cost=0.00..557.64 rows=2 width=0) (actual time=0.236..29.126 rows=3 loops=1)
         Filter: (upper(title) ~~ '%FANCY%'::text)
         Rows Removed by Filter: 19373
 Total runtime: 29.280 ms
(5 rows) 

So, a table with around 20 thousand entries will already require around 30 milliseconds for a simple ‘count’ when the result is filtered with ‘icontains’. Can this be reduced?

It can, but it requires a GIN or GIST index, that Django does not create by default as either requires an extension.

I decided to go with GIN, which seemed more appropriate in our case. Installing the extension is fairly simply under Ubuntu:

$ sudo apt-get install postgresql-contrib-9.3
$ psql blog_db
blog_db=# CREATE EXTENSION pg_trgm;
CREATE EXTENSION 

To add the index to a specific column run:

blog_db=# CREATE INDEX blog_article_title_trigram ON blog_article USING gin (UPPER(title) gin_trgm_ops);
CREATE INDEX 

Now, check that the index is really being used in the query:

_db=# explain analyze SELECT COUNT(*) FROM "blog_article" WHERE UPPER("blog_article"."title"::text) LIKE UPPER('%fancy%');
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=51.53..51.54 rows=1 width=0) (actual time=0.323..0.323 rows=1 loops=1)
   ->  Bitmap Heap Scan on blog_article  (cost=44.02..51.53 rows=2 width=0) (actual time=0.307..0.319 rows=3 loops=1)
         Recheck Cond: (upper(title) ~~ '%FANCY%'::text)
         ->  Bitmap Index Scan on blog_article_title_trigram  (cost=0.00..44.02 rows=2 width=0) (actual time=0.287..0.287 rows=3 loops=1)
               Index Cond: (upper(title) ~~ '% FANCY%'::text)
 Total runtime: 0.411 ms
(6 rows) 

The query now lasts around half a millisecond instead of nearly 30 milliseconds.

To find any other queries, that might profit from a special index, simply turn on PostgreSQL’s logging. You can tell Postgres to only log queries that last longer than a configured amount of time.

Further reading: Tuning Django ORM (part 2): Many-to-Many Queries