Home

Lecture 10: GROUP BY

Learning objective


Introduction to GROUP BY

What is GROUP BY?

The GROUP BY clause divides the rows returned from the SELECT statement into groups of non null values. For each group, you can apply an aggregate function (e.g. SUM()) to get the aggregation calculation for that particular group. Each group, then, is represented by a single row in the output table.

DISTINCT versus GROUP BY

The barebones functionality of GROUP BY is similar to DISTINCT. Consider the following:

The DISTINCT deduplicates the column title, leaving it with only unique values.

SELECT DISTINCT
    title
FROM titles
;

By specifying the column name in the SELECT clause as well as in the GROUP BY clause, GROUP BY can achieve the same output.

SELECT
    title
FROM titles
GROUP BY
    title
;

The output from both:

title
--------------------
Engineer
Senior Engineer
Manager
Assistant Engineer
Staff
Senior Staff
Technique Leader
(7 rows)



Exercise

Re-write the following SQL using GROUP BY instead of DISTINCT

SELECT DISTINCT
    gender
FROM employees
;




Advantages of GROUP BY

However, what GROUP BY can do, that DISTINCT cannot, is to add additional aggregation calculations for the de-duplicated groupings.

For example:

SELECT
    title,
    COUNT(*) AS cnt_row
FROM titles
GROUP BY
    title
;

gives us, not just the de-duplicated values in the title column, but also the number of times this value occurs in the titles table.

title              | cnt_row
-------------------+---------
Assistant Engineer |   15128
Engineer           |  115003
Manager            |      24
Senior Engineer    |   97750
Senior Staff       |   92853
Staff              |  107391
Technique Leader   |   15159
(7 rows)



Exercise

Find the number of times each gender value occurs in the employees table. (e.g. can you re-produce the SQL output below?)

gender | cnt_row
-------+---------
F      |  120051
M      |  179973
(2 rows)





GROUP BY with aggregation functions

It is possible to use all aggregation functions in conjunction with GROUP BY.

The most common aggregation functions are:

The following query deduplicates the values in the column title and then uses the COUNT(*) aggregation function to calculate how many rows in the table contains this value.

SELECT
    title,
    COUNT(*) AS cnt_rows
FROM titles
GROUP BY
    title
;

Important: Note that only the column we are de-duplicating and grouping by (e.g. title) needs to be in the GROUP BY clause. Aggregation functions do NOT go in the GROUP BY clause.



We can modify the SQL so that instead of counting the number of rows, we can count the number of distinct employees (emp_no) that have the same title.

SELECT
    title,
    COUNT(DISTINCT emp_no) AS cnt_emps
FROM titles
GROUP BY
    title
;



We could also apply multiple aggregations to the same level of grouping. However, since we are only de-duplicating and aggregating by column title, that’s the only column that goes into the GROUP BY clause.

SELECT
    title,
    COUNT(*) AS cnt_rows,
    COUNT(DISTINCT emp_no) AS cnt_emps
FROM titles
GROUP BY
    title
;



Exercise

Using GROUP BY, find the count of # distinct employee numbers for each gender in the employees table.





Sanity check on GROUP BY

All aggregation functions are fair game. However, sometimes it is hard to sanity check if the aggregation is performing as expected.

For example, consider the following aggregation query:

SELECT
    emp_no,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary
FROM salaries
GROUP BY
    emp_no
;    

outputs:

emp_no | min_salary | max_salary
-------+------------+------------
 10001 |      60117 |      88958
 10002 |      65828 |      72527
 10003 |      40006 |      43699
 10004 |      40054 |      74057
 10005 |      78228 |      94692
 10006 |      40000 |      60098
 10007 |      56724 |      88070
 10008 |      46671 |      52668
.... <truncated>

To spot check, it’s useful to filter down and spot check 1 data point. E.g. emp_no = 10001:

SELECT
    emp_no,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary
FROM salaries
WHERE emp_no = 10001
GROUP BY
    emp_no
;    

outputs:

emp_no | min_salary | max_salary
-------+------------+------------
 10001 |      60117 |      88958

Then, we can compare against:

SELECT *
FROM salaries
WHERE emp_no = 10001
;

which allows us to visually scan and confirm the minimum and maximum salary values.

emp_no | salary | from_date  |  to_date
-------+--------+------------+------------
 10001 |  60117 | 1986-06-26 | 1987-06-26
 10001 |  62102 | 1987-06-26 | 1988-06-25
 10001 |  66074 | 1988-06-25 | 1989-06-25
 10001 |  66596 | 1989-06-25 | 1990-06-25
 10001 |  66961 | 1990-06-25 | 1991-06-25
 10001 |  71046 | 1991-06-25 | 1992-06-24
 10001 |  74333 | 1992-06-24 | 1993-06-24
 10001 |  75286 | 1993-06-24 | 1994-06-24
 10001 |  75994 | 1994-06-24 | 1995-06-24
 10001 |  76884 | 1995-06-24 | 1996-06-23
 10001 |  80013 | 1996-06-23 | 1997-06-23
 10001 |  81025 | 1997-06-23 | 1998-06-23
 10001 |  81097 | 1998-06-23 | 1999-06-23
 10001 |  84917 | 1999-06-23 | 2000-06-22
 10001 |  85112 | 2000-06-22 | 2001-06-22
 10001 |  85097 | 2001-06-22 | 2002-06-22
 10001 |  88958 | 2002-06-22 | 9999-01-01
(17 rows)



Exercise

Spot check a few results from the following query to make sure it’s performing as expected:

SELECT
    emp_no,
    COUNT(*) AS num_occurrences,
    COUNT(DISTINCT emp_no) AS num_employees,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary,
    AVG(salary) AS avg_salary
FROM salaries
GROUP BY
    emp_no
;





Order of evaluation

Based on our existing vocabulary in SQL: here is the order of evaluation.

PostgreSQL evaluates the GROUP BY clause after the FROM and the WHERE but before the SELECT, DISTINCT, and LIMIT clause(s).

FROM -> [WHERE] -> [GROUP BY] -> SELECT -> [DISTINCT] -> [LIMIT]

Consider the following:

SELECT                                -- 4. calculates aggregations based on groups
    emp_no,
    COUNT(*) AS num_occurrences,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary,
    AVG(salary) AS avg_salary
FROM salaries                         -- 1. identifies the table
WHERE emp_no BETWEEN 10001 AND 10010  -- 2. filters row by criteria
GROUP BY
    emp_no                            -- 3. groups by emp_no
LIMIT 5                               -- 5. further limits output to 5 rows only
;    





More advanced GROUP BYs

GROUP BY manipulated columns or expressions

Just like DISTINCT, GROUP BY can aggregate manipulated columns or expressions just as well as raw data columns.

The following query extracts only the year from the birth date, and aggregates the number of employees that are born within each year.

SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
    EXTRACT(year FROM birth_date)
;

Important: Note that in the GROUP BY, we only included the manipulated SQL expression (e.g. EXTRACT(year FROM birth_date) and NOT the column alias (e.g. AS birth_year) portion.


Also note that while PostgreSQL also allows GROUP BY using the column alias alone (see below), this is NOT best practice and not accepted in most other SQL variants.

SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
    birth_year
;




GROUP BY multiple columns

It is also possible to aggregate by multiple columns at the same time. The following SQL query aggregates the number of employees that are born in the same birth year and same birth month:

SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    EXTRACT(month FROM birth_date) AS birth_month,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
    EXTRACT(year FROM birth_date),
    EXTRACT(month FROM birth_date)
;

outputs:

birth_year | birth_month | num_employees
-----------+-------------+---------------
      1952 |           2 |          1788
      1952 |           3 |          2062
      1952 |           4 |          1904
      1952 |           5 |          1973
      1952 |           6 |          1887
      1952 |           7 |          1978

...  <output truncated>




Shorthand for GROUP BY

Because of the repetitiveness of re-writing the same SQL expression in both the SELECT and the GROUP BY clauses, the columns can be referenced in the GROUP BY using short hand numbers. The numbering is based on the order of appearance of the columns as referred in the SQL SELECT clause.

This query:

SELECT
    title,
    SUM(1) AS cnt_rows
FROM titles
GROUP BY
    title
;

can be re-written as:

SELECT
    title,
    SUM(1) AS cnt_rows
FROM titles
GROUP BY 1
;

since the 1 denotes the 1st column mentioned in the SELECT clause, which is title.



Likewise, this query:

SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    EXTRACT(month FROM birth_date) AS birth_month,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
    EXTRACT(year FROM birth_date),
    EXTRACT(month FROM birth_date)
;

can be re-written as:

SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    EXTRACT(month FROM birth_date) AS birth_month,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY 1, 2
;



The styling convention is to write the groups (non-aggregate columns) first, and leave the aggregate functions later.

For example:

SELECT
  column_name,
  column_name,
  column_name,
  column_name,
  AGGREGATION_FUNCTION(column_name),
  AGGREGATION_FUNCTION(column_name)
FROM table_name
GROUP BY 1, 2, 3, 4
;



Exercise

  1. In layman’s terms, what is the following query doing?
      SELECT
       LEFT(first_name, 1) AS first_name_initial,
       LEFT(last_name, 1) AS last_name_initial,
       COUNT(DISTINCT emp_no) AS num_employees
      FROM employees
      GROUP BY 1, 2
      ;
    
  2. [Stretch question] Given that there are 26 letters in the alphabet. The expected number of rows in question 1 should have been 676 (26 x 26 = 676). Any hypotheses as to why we have more rows?





Common errors in using GROUP BY

  1. You forgot the GROUP BY altogether:
      SELECT
       title,
       SUM(1) AS cnt_row
      FROM titles
      -- where is the GROUP BY?
      ;
    
  2. You miscounted how many columns are aggregates and how many are non aggregates. (e.g. more often than note, off by one error)
      SELECT
       LEFT(first_name, 1) AS first_name_initial,
       LEFT(last_name, 1) AS last_name_initial,
       COUNT(DISTINCT emp_no) AS num_employees
      FROM employees
      GROUP BY 1 -- where is 2?
      ;
    
  3. You did not order the SELECT clause so that the grouped columns come first and the aggregation functions come last. This threw off your GROUP BY enumeration.
      SELECT
       LEFT(first_name, 1) AS first_name_initial,
       COUNT(DISTINCT emp_no) AS num_employees, -- this should come last
       LEFT(last_name, 1) AS last_name_initial      
      FROM employees
      GROUP BY 1, 2 -- right now, "2" refers to COUNT(DISTINCT emp_no) which is wrong
      ;
    





Breakout exercises

  1. How many employees started each year? (use from_date in the dept_emp table to denote the start date)
  2. How many employees quit their jobs each year? (use to_date in the dept_emp table to denote the end date)
  3. How many employees quit their jobs specifically in the years 1991, 1992, and 1993?
  4. I want to aggregate the employees table data by birth year and by gender. How do I generate this output?
      gender | birth_year | num_employees
      -------+------------+---------------
      F      |       1952 |          8502
      F      |       1953 |          9104
      F      |       1954 |          9385
      ...
      M      |       1952 |         12707
      M      |       1953 |         13753
      M      |       1954 |         13843
      ...
      (28 rows)
    
  5. I want to modify the output in Q4 slightly. How do I get this output?
      birth_year | gender | num_employees
      -----------+--------+---------------
         1952 | Female |          8502
         1952 | Male   |         12707
         1953 | Female |          9104
         1953 | Male   |         13753
         ...
         1965 | Female |           790
         1965 | Male   |          1150
      (28 rows)
    
  6. One last time, how do I get this final modified version?
      birth_year | num_female_employees | num_male_employees
      -----------+----------------------+--------------------
         1952 |                 8502 |              12707
         1953 |                 9104 |              13753
         1954 |                 9385 |              13843
         1955 |                 9267 |              13837
         1956 |                 9229 |              13822
         1957 |                 9069 |              13781
         1958 |                 9312 |              13964
         1959 |                 9286 |              14025
         1960 |                 9383 |              13743
         1961 |                 9269 |              13796
         1962 |                 9216 |              13798
         1963 |                 9194 |              13886
         1964 |                 9045 |              13868
         1965 |                  790 |               1150
      (14 rows)
    

Resources


Wrapping up