Postgres Distinct On
By day, I am a mild mannered owner/programmer at Box Out Sports. By no means do you need to understand all that we do there or how we do it, but recently I fixed a tricky N+1 query and thought I should write it up.
The need arises when you have a parent child relationship and you need the first child matching some criteria. Simple enough. The tricky part is when you want to bulk load a single child for several parents to avoid an N+1 query.
The Background
On our graphics page, which lists all the available templates for a team, we show the templates along with a preview. An elegant touch we added is that the preview is greyscale if you have never used the template to create a graphic, but in color if you have.
In addition to showing the template preview in color when used, we show the latest graphic you created using that template as the preview image instead of the stock preview that comes in our template store.
The Problem
Seems like a nice touch and innocent enough, right? Yes and no. To determine which preview to show, we were doing a latest graphic query for each template. The latest_graphic
method looked something like this:
def latest_graphic
owner.graphics.where(template_id: id).order("created_at desc").first
end
If any graphics had been created for the template, we used the most recent one as the preview. If there were no latest graphics, we fell back to the store preview. To accomplish that, preview_url
looked something like this:
def preview_url
if latest_graphic.present?
latest_graphic.preview_url
else
# plain old template store preview
end
end
Many teams have many templates, which meant many queries for the latest graphic of each template each time the page loaded. The more templates our customers had, the slower this page was. You might think to eager load the association, but this is not a simple includes
in Rails.
My goal was to get rid of the N+1 in favor of a single query that would return the latest graphic for each template. From there, I could preload the latest graphic to avoid the N+1 lookup. The part I was stuck on was how to do a bulk query for all the latest graphics of multiple templates, while only returning the most recent graphic over the wire.
The Solution
In order to understand the solution, a little background on the schema would likely be useful, so let’s pretend it is as simple as this:
teams
hasid
andname
.templates
hasid
,team_id
andtitle
.graphics
hasid
,template_id
, andcreated_at
.
The SQL
From the Postgres documentation on the DISTINCT clause:
SELECT DISTINCT ON ( expression [, …] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above).
Boom. That is exactly what I needed — a distinct graphic based on template_id
ordered by created_at
. That led me to the following query:
SELECT DISTINCT on (graphics.template_id) graphics.*
FROM graphics
WHERE team_id = :team_id AND template_id IN (:template_ids)
ORDER BY template_id, created_at DESC
The Ruby
From there, I plopped it in Ruby land. Along with the query, I added memoization to the reader method (latest_graphic
) that supports memoizing nil values and a writer method (latest_graphic=
) that can be used to preload the instance variable used for memoization.
class Template
def self.preload_latest_graphics(team, templates)
binds = {
team_id: team.id,
template_ids: templates.map(&:id),
}
statement = <<-SQL
SELECT DISTINCT on (graphics.template_id) graphics.*
FROM graphics
WHERE team_id = :team_id AND template_id IN (:template_ids)
ORDER BY template_id, created_at DESC
SQL
graphics = Graphic.find_by_sql([statement, binds])
# Index by template_id for fast lookup.
by_template_id = graphics.index_by(&:template_id)
# Loop through templates and preload the latest_graphic ivar.
templates.each do |template|
template.latest_graphic = by_template_id.fetch(template.id, nil)
end
end
# Reader method that is memoized. defined? lets us memoize nil.
def latest_graphic
return @latest_graphic if defined?(@latest_graphic)
@latest_graphic = owner.graphics.where(template_id: id).
order("created_at desc").first
end
# Writer method used above to populate the latest_graphic ivar.
def latest_graphic=(graphic)
@latest_graphic = graphic
end
end
With that in place, all I needed to avoid the N+1 call was to use that in the controller.
class GraphicsController < TeamAdminController
def index
@team = Team.find(params[:team_id])
@templates = @team.templates
Template.preload_latest_graphics(@team, @templates)
end
end
Just like that my server logs went from N+1 queries to the graphics table to 1 query from the graphics table. This dramatically reduced the load time for this page, especially for some larger customers that have 100’s of templates.
Another place I used this pattern was on Speaker Deck. I used it for the cover slide. A deck has_many slides. Given an array of decks, I can use DISTINCT ON
to query the first slide for each deck and show the image for that as a preview for the deck. We call it the cover image.
The Downside
The only real downside of this approach is that postgres itself still has to filter through all the matching records to do the distinct part. If you really can’t afford that then your best bet is likely to denormalize some field that you can index and use as your criteria.
In the Speaker Deck example, I could add a column named slide_number
or something. Then, I could query all the slides WHERE slide_number = 0 and deck_id IN (:deck_ids)
in a quick, indexed fashion.
Super handy and super easy. Go forth and distinct on whatever your heart desires.