Index-only scans in Postgres

Index-only scans can be a really effective way to speed up table reads that hit an index. Of course, they’re not a silver bullet to all your performance problems, but they are a very welcome and useful part of the toolbox.

In order to understand index-only scans, why (and when) they’re valuable, let’s recap how a “normal” index scan works.

Index scans

An index is just references to the rows in a table, stored in a data structure (usually a B-tree) based on their values in the indexed columns.

An index scan reads through the index and uses it to quickly look up the rows that match your filter (something like WHERE x > 10), and return them in the order they’re stored in the index.

Postgres then goes to look up the data in these rows from the table, in the heap, where it would have found them if it had done a sequential scan.

It checks that they are visible to the current transaction — for example they haven’t been deleted or replaced by a newer version — and passes them on to the next operation.

Photo by  João Silas

Photo by João Silas

It’s a bit like using an index in a book. Instead of starting at page one and turning over the pages until you find the ones that deal with, say, soil erosion, you skip to “s” in the index, look up “soil erosion” and turn to the listed pages to read about where all the dirt is going.

Enter index-only scans

Index-only scans start off like index scans, but they get all their column information from the index, obviating the need to go back to the table to fetch the row data — the second step in the index scan process.

Returning to our book example, if we want to produce a list of topics in the book, ordered by the number of pages they appear on, then all that information is stored in the book’s index, so we can do so purely from reading the index without ever actually turning to the pages in question.

As you can imagine, under the right circumstances this can be an incredibly fast way for Postgres to access the table data. In pgMustard, we suggest considering an index-only scan to improve slow index scans, since cutting out the extra trips to the heap will often be worth adding to the size and maintenance cost of the index.

Making it possible for the query planner to use index-only scans on your data is actually really easy. There are two criteria you need to meet in order for Postgres to be able to perform an index-only scan.

Condition one: a compatible index

This is pretty simple — not every index type in Postgres supports index-only scans. Some, for example GIN indexes, don’t store all the underlying data that they index, so retrieving row data is impossible.

Most of the time, this won’t be a problem, since B-tree indexes — the Postgres default and the most common index by far — support index-only scans. However, if you’re using one of the fancier index types it’s worth checking whether or not it supports index-only scans before spending a lot of time wondering why the planner is refusing to use one.

Condition two: indexed data columns

Normally, you’d only index the columns that you want to filter or sort on — the ones you use in ORDER BY or WHERE clauses or, in certain circumstances, a JOIN condition. So if you’re creating an index for the statement

SELECT ingredients, name
FROM recipes
WHERE chef = 'rose elliot';

then you’d usually create an index on the chef column of the recipes table.

CREATE INDEX chefs_index
ON recipes (chef);

But if Postgres is going to return all the row data from the index, then the index is going to need the columns ingredients and name too. Remember that the columns you want to add constraints to (in this case chef) should always come first:

CREATE INDEX recipe_query_index
ON recipes (chef, ingredients, name);

Edit (thanks to charettes): From version 11, Postgres supports covering indexes. Covering indexes allow you to specify columns which are just “payload” you want to be able to read directly from the index, but do not need to search on. The payload columns are listed after the searchable columns, using the INCLUDE keyword. So the example above becomes:

CREATE INDEX cov_idx 
ON recipes (chef) 
INCLUDE (ingredients, name);

That’s all you need!

That’s enough for Postgres to be able to execute index-only scans. There are some caveats with partial indexes before Postgres v9.6, and some extra tweaks to make index-only scans work on expression indexes, both of which you can read about in the documentation, but in the vast majority of cases, all you need is a compatible index on all the columns you want to read

When is an index-only scan not an index-only scan?

Great! So now all my queries (except the ones that would be better as sequential scans) are going to only read information from the index? Well… not quite.

Remember when we talked about index scans, we said that one of the things Postgres needs to do was check to make sure the row was visible to the query’s transaction before allowing it to be passed on? Well, in an index-only scan Postgres still needs to be sure that the row is visible before it can return it, and that information is on the heap, not in the index.

Low visibility can be a problem — Photo by  Wojtek Witkowski

Low visibility can be a problem — Photo by Wojtek Witkowski

Postgres can often get around going to the heap for a particular row by checking the visibility map, which records whether or not each page has been changed recently. This represents each page as a single bit, so it’s very small and usually fits in memory, making reading it very fast.

If the page hasn’t changed, then Postgres can just return any rows it wants that are stored in it, and we’ve got away without having to read the row from the heap.

But if the page has changed, then that means a trip to the heap to read the page direct from the table, the same as if we were doing a normal index scan — in fact it’s slightly worse than an index scan, because we’ve added in an extra check of the visibility map!

Reducing heap fetches

You can tell how many times an index-only scan has had to fetch from the heap because it’s listed in the query plan output:

{
  "Node Type": "Index Only Scan",
  "Index Name": "recipe_query_index",
  "Heap Fetches": 599,
  ...
}

In some cases — where your data change very frequently, or your queries often happen soon after a write — you might just have to live with a few heap fetches. Otherwise, it can be a sign that your table is crufty and needs a good vacuum. Vacuuming the table will clean out any rows that shouldn’t be visible to queries any more, and update the visibility map accordingly.

If a vacuum leads to short-term improvements which then deteriorate over time, then consider tweaking your autovacuum settings to make sure your data (and the visibility of its pages) are kept more up-to-date.

So there it is — when, why and how to use index-only scans. They’re not always the right solution, but they’re a really cool performance feature that can make a big difference when used well.