PostgreSQL's Group By

Last night I noticed a user on IRC complaint on two different channels (#heroku and #rubyonrails) claiming something along the lines of “PostgreSQL sucks: i have this code

1
2
3
named_scope :with_questions,
:joins => :questions,
:group => "categories.id, categories.name, categories.created_at, categories.updated_at"

because of the way postgresql handles group by. It should only be "categories.id".”

The user was surprised that this query works on MySQL. Surely, the user was getting the PostgreSQL message: ERROR: column "categories.name" must appear in the group by clause or be used in an aggregate function. It turns out that this is not a bug, and PostgreSQL does not suck as this user initially thought. Furthermore, I tried a similar query on MS SQL Server, and it rightfully complaints: Column 'categories.name' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Let’s look at solutions.

Alternative Queries

The first thing that’s wrong about this query is that what the user really wanted was a distinct list of categories that had questions. This is the requirement. To that end, the query should look something like the following two options.

  • Option 1: Drop the join and group by, and just use a condition checking whether a question exists for the category:
1
2
Category.all(:conditions =>
'exists (select 1 from questions where categories.id = questions.category_id)')

A variation of this can be achieved with the in operator:

1
2
Category.all(:conditions =>
'clients.id in (select client_id from questions)')
  • Option 2: Again, drop the group by, and use a distinct instead:
1
2
Category.all(:select => 'distinct items.*',
:joins => :questions)

Why PostgreSQL doesn’t like the original query

The group by clause is used to collect data from multiple records having common values in a select statement, and project the result based on some aggregate function. It really does not make any sense to add a group by to a query that does not have an aggregate such as sum(), avg(), min(), max(), count(). There is an exception, but we’ll talk about that later.

As an example, we could retrieve every item along with a count of categories per item:

1
2
3
4
5
select id, name, count(id)
from items
inner join categories
on items.id = categories.item_id
group by id, name

Note that every non-aggregated column on the select list must appear on the group by list. This is necessary for PostgreSQL to know which item’s to count on (or sum, or calculate the max on). Let’s walk through a simplified example of what happens if we don’t include one of these columns on the group by list.

Suppose the following table

code | city

0 | Cambridge 0 | Boston 1 | Foxboro

What happens if we run the following query:

1
2
3
select code, city
from table
group by code

What would you expect PostgreSQL to return for the row with a code equal to 0? Cambridge or Boston? When PostgreSQL is presented with an ambiguous query such as the above, it will stop and report an error. Some other databases may go on and make their own decision as to what to return. To me, this is a broken spec. Futhermore, the result set may be inconsistent and unpredictable across DBMSes, or even queries on the same DB.

Exception to the rule

On previous versions of PostgreSQL (pre 8.2), the query plan for a group by was much more efficient than a select distinct. In some older Rails apps, we wrote things like the following to optimize performance:

1
2
3
Question.find(:all
:group => Question.column_names.join(', '),
:conditions => '...')

Instead of the more natural:

1
2
3
Question.find(:all,
:select => 'distinct items.*',
:conditions => '...')

This was an optimization that was specific to our environment and helped us avoid the relatively poor query plan and expensive Seq Scan that was slowing our app down.

I hope that after reading this you realize that this error is helping you as a user write better SQL. Complaining that the example query doesn’t run on PostgreSQL is like complaining that your new Fender Strat sucks because when you play Here comes the Sun the very same way you played it on your Beatles Rock Band guitar, it doesn’t sound the same. /endrant