Last night I noticed a user on IRC complaint on two different channels (#heroku and #rubyonrails) claiming something along the lines of “PostgreSQL sucks: i have this code
1 2 3
because of the way postgresql handles group by. It should only be
The user was surprised that this query works on MySQL. Surely, the user was getting the PostgreSQL message:
ERROR: column "categories.name" must appear in the group by clause or be used in an aggregate function. It turns out that this is not a bug, and PostgreSQL does not suck as this user initially thought. Furthermore, I tried a similar query on MS SQL Server, and it rightfully complaints:
Column 'categories.name' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Let’s look at solutions.
The first thing that’s wrong about this query is that what the user really wanted was a distinct list of categories that had questions. This is the requirement. To that end, the query should look something like the following two options.
- Option 1: Drop the
group by, and just use a condition checking whether a question exists for the category:
A variation of this can be achieved with the
- Option 2: Again, drop the
group by, and use a
Why PostgreSQL doesn’t like the original query
group by clause is used to collect data from multiple records having common values in a select statement, and project the result based on some aggregate function. It really does not make any sense to add a
group by to a query that does not have an aggregate such as
count(). There is an exception, but we’ll talk about that later.
As an example, we could retrieve every item along with a count of categories per item:
1 2 3 4 5
Note that every non-aggregated column on the
select list must appear on the
group by list. This is necessary for PostgreSQL to know which item’s to
count on (or
sum, or calculate the
max on). Let’s walk through a simplified example of what happens if we don’t include one of these columns on the
group by list.
Suppose the following table
code | city
0 | Cambridge 0 | Boston 1 | Foxboro
What happens if we run the following query:
1 2 3
What would you expect PostgreSQL to return for the row with a code equal to 0? Cambridge or Boston? When PostgreSQL is presented with an ambiguous query such as the above, it will stop and report an error. Some other databases may go on and make their own decision as to what to return. To me, this is a broken spec. Futhermore, the result set may be inconsistent and unpredictable across DBMSes, or even queries on the same DB.
Exception to the rule
On previous versions of PostgreSQL (pre 8.2), the query plan for a
group by was much more efficient than a
select distinct. In some older Rails apps, we wrote things like the following to optimize performance:
1 2 3
Instead of the more natural:
1 2 3
This was an optimization that was specific to our environment and helped us avoid the relatively poor query plan and expensive
Seq Scan that was slowing our app down.
I hope that after reading this you realize that this error is helping you as a user write better SQL. Complaining that the example query doesn’t run on PostgreSQL is like complaining that your new Fender Strat sucks because when you play Here comes the Sun the very same way you played it on your Beatles Rock Band guitar, it doesn’t sound the same.