GROUP BY
GROUP BY
and aggregation functionsGROUP BY
What is GROUP BY
?
The GROUP BY
clause divides the rows returned from the SELECT
statement into groups of non null values. For each group, you can apply an aggregate function (e.g. SUM()
) to get the aggregation calculation for that particular group. Each group, then, is represented by a single row in the output table.
DISTINCT versus GROUP BY
The barebones functionality of GROUP BY
is similar to DISTINCT
. Consider the following:
The DISTINCT
deduplicates the column title
, leaving it with only unique values.
SELECT DISTINCT
title
FROM titles
;
By specifying the column name in the SELECT
clause as well as in the GROUP BY
clause, GROUP BY
can achieve the same output.
SELECT
title
FROM titles
GROUP BY
title
;
The output from both:
title
--------------------
Engineer
Senior Engineer
Manager
Assistant Engineer
Staff
Senior Staff
Technique Leader
(7 rows)
Exercise
Re-write the following SQL using GROUP BY
instead of DISTINCT
SELECT DISTINCT
gender
FROM employees
;
Advantages of GROUP BY
However, what GROUP BY
can do, that DISTINCT
cannot, is to add additional aggregation calculations for the de-duplicated groupings.
For example:
SELECT
title,
COUNT(*) AS cnt_row
FROM titles
GROUP BY
title
;
gives us, not just the de-duplicated values in the title
column, but also the number of times this value occurs in the titles
table.
title | cnt_row
-------------------+---------
Assistant Engineer | 15128
Engineer | 115003
Manager | 24
Senior Engineer | 97750
Senior Staff | 92853
Staff | 107391
Technique Leader | 15159
(7 rows)
Exercise
Find the number of times each gender value occurs in the employees
table.
(e.g. can you re-produce the SQL output below?)
gender | cnt_row
-------+---------
F | 120051
M | 179973
(2 rows)
It is possible to use all aggregation functions in conjunction with GROUP BY
.
The most common aggregation functions are:
SUM()
: totalMIN()
: minimum valueMAX()
: maximum valueAVG()
: averageCOUNT(*)
: count the number of rowsCOUNT(DISTINCT <column name>)
: count the number of distinct values in the column specifiedThe following query deduplicates the values in the column title
and then uses the COUNT(*)
aggregation function to calculate how many rows in the table contains this value.
SELECT
title,
COUNT(*) AS cnt_rows
FROM titles
GROUP BY
title
;
Important: Note that only the column we are de-duplicating and grouping by (e.g. title
) needs to be in the GROUP BY
clause. Aggregation functions do NOT go in the GROUP BY
clause.
We can modify the SQL so that instead of counting the number of rows, we can count the number of distinct employees (emp_no) that have the same title.
SELECT
title,
COUNT(DISTINCT emp_no) AS cnt_emps
FROM titles
GROUP BY
title
;
We could also apply multiple aggregations to the same level of grouping. However, since we are only de-duplicating and aggregating by column title
, that’s the only column that goes into the GROUP BY
clause.
SELECT
title,
COUNT(*) AS cnt_rows,
COUNT(DISTINCT emp_no) AS cnt_emps
FROM titles
GROUP BY
title
;
Exercise
Using GROUP BY
, find the count of # distinct employee numbers for each gender
in the employees
table.
GROUP BY
All aggregation functions are fair game. However, sometimes it is hard to sanity check if the aggregation is performing as expected.
For example, consider the following aggregation query:
SELECT
emp_no,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary
FROM salaries
GROUP BY
emp_no
;
outputs:
emp_no | min_salary | max_salary
-------+------------+------------
10001 | 60117 | 88958
10002 | 65828 | 72527
10003 | 40006 | 43699
10004 | 40054 | 74057
10005 | 78228 | 94692
10006 | 40000 | 60098
10007 | 56724 | 88070
10008 | 46671 | 52668
.... <truncated>
To spot check, it’s useful to filter down and spot check 1 data point. E.g. emp_no = 10001:
SELECT
emp_no,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary
FROM salaries
WHERE emp_no = 10001
GROUP BY
emp_no
;
outputs:
emp_no | min_salary | max_salary
-------+------------+------------
10001 | 60117 | 88958
Then, we can compare against:
SELECT *
FROM salaries
WHERE emp_no = 10001
;
which allows us to visually scan and confirm the minimum and maximum salary values.
emp_no | salary | from_date | to_date
-------+--------+------------+------------
10001 | 60117 | 1986-06-26 | 1987-06-26
10001 | 62102 | 1987-06-26 | 1988-06-25
10001 | 66074 | 1988-06-25 | 1989-06-25
10001 | 66596 | 1989-06-25 | 1990-06-25
10001 | 66961 | 1990-06-25 | 1991-06-25
10001 | 71046 | 1991-06-25 | 1992-06-24
10001 | 74333 | 1992-06-24 | 1993-06-24
10001 | 75286 | 1993-06-24 | 1994-06-24
10001 | 75994 | 1994-06-24 | 1995-06-24
10001 | 76884 | 1995-06-24 | 1996-06-23
10001 | 80013 | 1996-06-23 | 1997-06-23
10001 | 81025 | 1997-06-23 | 1998-06-23
10001 | 81097 | 1998-06-23 | 1999-06-23
10001 | 84917 | 1999-06-23 | 2000-06-22
10001 | 85112 | 2000-06-22 | 2001-06-22
10001 | 85097 | 2001-06-22 | 2002-06-22
10001 | 88958 | 2002-06-22 | 9999-01-01
(17 rows)
Exercise
Spot check a few results from the following query to make sure it’s performing as expected:
SELECT
emp_no,
COUNT(*) AS num_occurrences,
COUNT(DISTINCT emp_no) AS num_employees,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary,
AVG(salary) AS avg_salary
FROM salaries
GROUP BY
emp_no
;
Based on our existing vocabulary in SQL: here is the order of evaluation.
PostgreSQL evaluates the GROUP BY
clause after the FROM
and the WHERE
but before the SELECT
, DISTINCT
, and LIMIT
clause(s).
FROM
-> [WHERE
] -> [GROUP BY
] -> SELECT
-> [DISTINCT
] -> [LIMIT
]
Consider the following:
SELECT -- 4. calculates aggregations based on groups
emp_no,
COUNT(*) AS num_occurrences,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary,
AVG(salary) AS avg_salary
FROM salaries -- 1. identifies the table
WHERE emp_no BETWEEN 10001 AND 10010 -- 2. filters row by criteria
GROUP BY
emp_no -- 3. groups by emp_no
LIMIT 5 -- 5. further limits output to 5 rows only
;
GROUP BY
manipulated columns or expressions
Just like DISTINCT
, GROUP BY
can aggregate manipulated columns or expressions just as well as raw data columns.
The following query extracts only the year from the birth date, and aggregates the number of employees that are born within each year.
SELECT
EXTRACT(year FROM birth_date) AS birth_year,
COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
EXTRACT(year FROM birth_date)
;
Important: Note that in the GROUP BY
, we only included the manipulated SQL expression (e.g. EXTRACT(year FROM birth_date)
and NOT the column alias (e.g. AS birth_year
) portion.
Also note that while PostgreSQL also allows GROUP BY
using the column alias alone (see below), this is NOT best practice and not accepted in most other SQL variants.
SELECT
EXTRACT(year FROM birth_date) AS birth_year,
COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
birth_year
;
GROUP BY multiple columns
It is also possible to aggregate by multiple columns at the same time. The following SQL query aggregates the number of employees that are born in the same birth year and same birth month:
SELECT
EXTRACT(year FROM birth_date) AS birth_year,
EXTRACT(month FROM birth_date) AS birth_month,
COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
EXTRACT(year FROM birth_date),
EXTRACT(month FROM birth_date)
;
outputs:
birth_year | birth_month | num_employees
-----------+-------------+---------------
1952 | 2 | 1788
1952 | 3 | 2062
1952 | 4 | 1904
1952 | 5 | 1973
1952 | 6 | 1887
1952 | 7 | 1978
... <output truncated>
Shorthand for GROUP BY
Because of the repetitiveness of re-writing the same SQL expression in both the SELECT
and the GROUP BY
clauses, the columns can be referenced in the GROUP BY
using short hand numbers. The numbering is based on the order of appearance of the columns as referred in the SQL SELECT
clause.
This query:
SELECT
title,
SUM(1) AS cnt_rows
FROM titles
GROUP BY
title
;
can be re-written as:
SELECT
title,
SUM(1) AS cnt_rows
FROM titles
GROUP BY 1
;
since the 1 denotes the 1st column mentioned in the SELECT
clause, which is title
.
Likewise, this query:
SELECT
EXTRACT(year FROM birth_date) AS birth_year,
EXTRACT(month FROM birth_date) AS birth_month,
COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
EXTRACT(year FROM birth_date),
EXTRACT(month FROM birth_date)
;
can be re-written as:
SELECT
EXTRACT(year FROM birth_date) AS birth_year,
EXTRACT(month FROM birth_date) AS birth_month,
COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY 1, 2
;
The styling convention is to write the groups (non-aggregate columns) first, and leave the aggregate functions later.
For example:
SELECT
column_name,
column_name,
column_name,
column_name,
AGGREGATION_FUNCTION(column_name),
AGGREGATION_FUNCTION(column_name)
FROM table_name
GROUP BY 1, 2, 3, 4
;
Exercise
SELECT
LEFT(first_name, 1) AS first_name_initial,
LEFT(last_name, 1) AS last_name_initial,
COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY 1, 2
;
GROUP BY
GROUP BY
altogether:
SELECT
title,
SUM(1) AS cnt_row
FROM titles
-- where is the GROUP BY?
;
SELECT
LEFT(first_name, 1) AS first_name_initial,
LEFT(last_name, 1) AS last_name_initial,
COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY 1 -- where is 2?
;
SELECT
clause so that the grouped columns come first and the aggregation functions come last. This threw off your GROUP BY
enumeration.
SELECT
LEFT(first_name, 1) AS first_name_initial,
COUNT(DISTINCT emp_no) AS num_employees, -- this should come last
LEFT(last_name, 1) AS last_name_initial
FROM employees
GROUP BY 1, 2 -- right now, "2" refers to COUNT(DISTINCT emp_no) which is wrong
;
Breakout exercises
from_date
in the dept_emp
table to denote the start date)to_date
in the dept_emp
table to denote the end date)employees
table data by birth year and by gender. How do I generate this output?
gender | birth_year | num_employees
-------+------------+---------------
F | 1952 | 8502
F | 1953 | 9104
F | 1954 | 9385
...
M | 1952 | 12707
M | 1953 | 13753
M | 1954 | 13843
...
(28 rows)
birth_year | gender | num_employees
-----------+--------+---------------
1952 | Female | 8502
1952 | Male | 12707
1953 | Female | 9104
1953 | Male | 13753
...
1965 | Female | 790
1965 | Male | 1150
(28 rows)
birth_year | num_female_employees | num_male_employees
-----------+----------------------+--------------------
1952 | 8502 | 12707
1953 | 9104 | 13753
1954 | 9385 | 13843
1955 | 9267 | 13837
1956 | 9229 | 13822
1957 | 9069 | 13781
1958 | 9312 | 13964
1959 | 9286 | 14025
1960 | 9383 | 13743
1961 | 9269 | 13796
1962 | 9216 | 13798
1963 | 9194 | 13886
1964 | 9045 | 13868
1965 | 790 | 1150
(14 rows)
GROUP BY
GROUP BY
syntax