ONLY_FULL_GROUP_BY and you


I have developers come to me occasionally mystified by this error messages like this:

ERROR 1055 (42000): '<tablename>.<columnname>' isn't in GROUP BY

What does that mean, exactly, and why is the query getting kicked back?

To illustrate, let’s say we have a tabe, geodata, which has this data for all US cities–the table has just north of 33K rows.

MariaDB [scratch]> select * from geodata limit 5;
+------------+---------+------------+---------+------------+------------+
| state_fips | state   | state_abbr | zipcode | county     | city       |
+------------+---------+------------+---------+------------+------------+
|          1 | Alabama | AL         | 35004   | St. Clair  | Acmar      |
|          1 | Alabama | AL         | 35005   | Jefferson  | Adamsville |
|          1 | Alabama | AL         | 35006   | Jefferson  | Adger      |
|          1 | Alabama | AL         | 35007   | Shelby     | Keystone   |
|          1 | Alabama | AL         | 35010   | Tallapoosa | New site   |
+------------+---------+------------+---------+------------+------------+

Every ZIP will have exactly one county, city, and state.  However, every county may have multiple ZIPs and cities, and every city may have multiple ZIPs.

The developer wants to run this query:

SELECT
   county,
   state,
   city,
   zipcode,
   COUNT(*)
FROM
   geodata
GROUP BY state, 
   county
ORDER BY county;

What this query is asking for is the number of rows per state and county.  However, the selected columns include city and ZIP code, which aren’t part of the GROUP BY clause.  Since MySQL 5.7, the sql_mode variable ONLY_FULL_GROUP_BY defaults to “enabled.”

Why?

Because there is a one-to-many relationship between city and ZIP.  Therefore you’ll get an error like this (this message may vary between MySQL distributions:

ERROR 1055 (42000): 'scratch.geodata.city' isn't in GROUP BY

This throws an error because the query, as written, will return misleading data.  If I disable it by executing this at the session or global level:

SET sql_mode='STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';

Now the query will give me a resultset:

+---------------+----------------+-----------+---------+----------+
| county        | state          | city      | zipcode | COUNT(*) |
+---------------+----------------+-----------+---------+----------+
| Abbeville     | South carolina | Abbeville | 29620   |        6 |
| Acadia Parish | Louisiana      | Branch    | 70516   |       11 |
| Accomack      | Virginia       | Accomac   | 23301   |       34 |
| Ada           | Idaho          | Eagle     | 83616   |       18 |
| Adair         | Iowa           | Adair     | 50002   |        5 |
+---------------+----------------+-----------+---------+----------+

The county, state, and COUNT columns are correct.  However, this resultset gives the incorrect impression that there’s only one city and one ZIP for the county/state combination.  This is false. Adair county in Iowa has more than one city and zipcode:

+--------+-------------+---------+
| county | city        | zipcode |
+--------+-------------+---------+
| Adair  | Adair       | 50002   |
| Adair  | Bridgewater | 50837   |
| Adair  | Fontanelle  | 50846   |
| Adair  | Greenfield  | 50849   |
| Adair  | Orient      | 50858   |
+--------+-------------+---------+

Therefore the results in the first query which only show one city and zipcode are misleading and incorrect.

Since these columns are not germane to the results requested, they should be removed from the SELECT portion of the query.  Now the data returned is 100% correct without the extra columns which are incorrect and confusing.

+---------------+----------------+----------+
| county        | state          | COUNT(*) |
+---------------+----------------+----------+
| Abbeville     | South carolina |        6 |
| Acadia Parish | Louisiana      |       11 |
| Accomack      | Virginia       |       34 |
| Ada           | Idaho          |       18 |
| Adair         | Iowa           |        5 |
+---------------+----------------+----------+

Note: You can certainly disable ONLY_FULL_GROUP_BY, but as the manual states:

If ONLY_FULL_GROUP_BY is disabled, a MySQL extension to the standard SQL use of GROUP BY permits the select list, HAVING condition, or ORDER BY list to refer to nonaggregated columns even if the columns are not functionally dependent on GROUP BY columns. This causes MySQL to accept the preceding query. In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are nondeterministic, which is probably not what you want. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause. Result set sorting occurs after values have been chosen, and ORDER BY does not affect which value within each group the server chooses. Disabling ONLY_FULL_GROUP_BY is useful primarily when you know that, due to some property of the data, all values in each nonaggregated column not named in the GROUP BY are the same for each group.

MySQL is doing its best to keep you from shooting yourself in the foot, but if you are bound and determined to do so, you can blow your whole leg off, but you have to do it in a premeditated manner by deliberately removing ONLY_FULL_GROUP_BY from the sql_mode.

ONLY_FULL_GROUP_BY is the default in 5.7 and 8.0 and will remain the default going forward. If you disable it, you do so at your own peril, as inaccurate returns may be the result.

 

https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html
https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html