Home

Lecture 10: `GROUP BY`

Learning objective

Aggregating data using GROUP BY and aggregation functions

Introduction to `GROUP BY`

What is GROUP BY?

The GROUP BY clause divides the rows returned from the SELECT statement into groups of non null values. For each group, you can apply an aggregate function (e.g. SUM()) to get the aggregation calculation for that particular group. Each group, then, is represented by a single row in the output table.

DISTINCT versus GROUP BY

The barebones functionality of GROUP BY is similar to DISTINCT. Consider the following:

The DISTINCT deduplicates the column title, leaving it with only unique values.

SELECT DISTINCT
    title
FROM titles
;

By specifying the column name in the SELECT clause as well as in the GROUP BY clause, GROUP BY can achieve the same output.

SELECT
    title
FROM titles
GROUP BY
    title
;

The output from both:

title
--------------------
Engineer
Senior Engineer
Manager
Assistant Engineer
Staff
Senior Staff
Technique Leader
(7 rows)

Exercise

Re-write the following SQL using GROUP BY instead of DISTINCT

SELECT DISTINCT
    gender
FROM employees
;

Advantages of GROUP BY

However, what GROUP BY can do, that DISTINCT cannot, is to add additional aggregation calculations for the de-duplicated groupings.

For example:

SELECT
    title,
    COUNT(*) AS cnt_row
FROM titles
GROUP BY
    title
;

gives us, not just the de-duplicated values in the title column, but also the number of times this value occurs in the titles table.

title              | cnt_row
-------------------+---------
Assistant Engineer |   15128
Engineer           |  115003
Manager            |      24
Senior Engineer    |   97750
Senior Staff       |   92853
Staff              |  107391
Technique Leader   |   15159
(7 rows)

Exercise

Find the number of times each gender value occurs in the employees table. (e.g. can you re-produce the SQL output below?)

gender | cnt_row
-------+---------
F      |  120051
M      |  179973
(2 rows)

GROUP BY with aggregation functions

It is possible to use all aggregation functions in conjunction with GROUP BY.

The most common aggregation functions are:

SUM(): total
MIN(): minimum value
MAX(): maximum value
AVG(): average
COUNT(*): count the number of rows
COUNT(DISTINCT <column name>): count the number of distinct values in the column specified

The following query deduplicates the values in the column title and then uses the COUNT(*) aggregation function to calculate how many rows in the table contains this value.

SELECT
    title,
    COUNT(*) AS cnt_rows
FROM titles
GROUP BY
    title
;

Important: Note that only the column we are de-duplicating and grouping by (e.g. title) needs to be in the GROUP BY clause. Aggregation functions do NOT go in the GROUP BY clause.

We can modify the SQL so that instead of counting the number of rows, we can count the number of distinct employees (emp_no) that have the same title.

SELECT
    title,
    COUNT(DISTINCT emp_no) AS cnt_emps
FROM titles
GROUP BY
    title
;

We could also apply multiple aggregations to the same level of grouping. However, since we are only de-duplicating and aggregating by column title, that’s the only column that goes into the GROUP BY clause.

SELECT
    title,
    COUNT(*) AS cnt_rows,
    COUNT(DISTINCT emp_no) AS cnt_emps
FROM titles
GROUP BY
    title
;

Exercise

Using GROUP BY, find the count of # distinct employee numbers for each gender in the employees table.

Sanity check on `GROUP BY`

All aggregation functions are fair game. However, sometimes it is hard to sanity check if the aggregation is performing as expected.

For example, consider the following aggregation query:

SELECT
    emp_no,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary
FROM salaries
GROUP BY
    emp_no
;    

outputs:

emp_no | min_salary | max_salary
-------+------------+------------
|      60117 |      88958
|      65828 |      72527
|      40006 |      43699
|      40054 |      74057
|      78228 |      94692
|      40000 |      60098
|      56724 |      88070
|      46671 |      52668
.... <truncated>

To spot check, it’s useful to filter down and spot check 1 data point. E.g. emp_no = 10001:

SELECT
    emp_no,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary
FROM salaries
WHERE emp_no = 10001
GROUP BY
    emp_no
;    

outputs:

emp_no | min_salary | max_salary
-------+------------+------------
 10001 |      60117 |      88958

Then, we can compare against:

SELECT *
FROM salaries
WHERE emp_no = 10001
;

which allows us to visually scan and confirm the minimum and maximum salary values.

emp_no | salary | from_date  |  to_date
-------+--------+------------+------------
|  60117 | 1986-06-26 | 1987-06-26
|  62102 | 1987-06-26 | 1988-06-25
|  66074 | 1988-06-25 | 1989-06-25
|  66596 | 1989-06-25 | 1990-06-25
|  66961 | 1990-06-25 | 1991-06-25
|  71046 | 1991-06-25 | 1992-06-24
|  74333 | 1992-06-24 | 1993-06-24
|  75286 | 1993-06-24 | 1994-06-24
|  75994 | 1994-06-24 | 1995-06-24
|  76884 | 1995-06-24 | 1996-06-23
|  80013 | 1996-06-23 | 1997-06-23
|  81025 | 1997-06-23 | 1998-06-23
|  81097 | 1998-06-23 | 1999-06-23
|  84917 | 1999-06-23 | 2000-06-22
|  85112 | 2000-06-22 | 2001-06-22
|  85097 | 2001-06-22 | 2002-06-22
|  88958 | 2002-06-22 | 9999-01-01
(17 rows)

Exercise

Spot check a few results from the following query to make sure it’s performing as expected:

SELECT
    emp_no,
    COUNT(*) AS num_occurrences,
    COUNT(DISTINCT emp_no) AS num_employees,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary,
    AVG(salary) AS avg_salary
FROM salaries
GROUP BY
    emp_no
;

Order of evaluation

Based on our existing vocabulary in SQL: here is the order of evaluation.

PostgreSQL evaluates the GROUP BY clause after the FROM and the WHERE but before the SELECT, DISTINCT, and LIMIT clause(s).

FROM -> [WHERE] -> [GROUP BY] -> SELECT -> [DISTINCT] -> [LIMIT]

Consider the following:

SELECT                                -- 4. calculates aggregations based on groups
    emp_no,
    COUNT(*) AS num_occurrences,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary,
    AVG(salary) AS avg_salary
FROM salaries                         -- 1. identifies the table
WHERE emp_no BETWEEN 10001 AND 10010  -- 2. filters row by criteria
GROUP BY
    emp_no                            -- 3. groups by emp_no
LIMIT 5                               -- 5. further limits output to 5 rows only
;    

More advanced GROUP BYs

GROUP BY manipulated columns or expressions

Just like DISTINCT, GROUP BY can aggregate manipulated columns or expressions just as well as raw data columns.

The following query extracts only the year from the birth date, and aggregates the number of employees that are born within each year.

SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
    EXTRACT(year FROM birth_date)
;

Important: Note that in the GROUP BY, we only included the manipulated SQL expression (e.g. EXTRACT(year FROM birth_date) and NOT the column alias (e.g. AS birth_year) portion.

Also note that while PostgreSQL also allows GROUP BY using the column alias alone (see below), this is NOT best practice and not accepted in most other SQL variants.

SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
    birth_year
;

GROUP BY multiple columns

It is also possible to aggregate by multiple columns at the same time. The following SQL query aggregates the number of employees that are born in the same birth year and same birth month:

SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    EXTRACT(month FROM birth_date) AS birth_month,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
    EXTRACT(year FROM birth_date),
    EXTRACT(month FROM birth_date)
;

outputs:

birth_year | birth_month | num_employees
-----------+-------------+---------------
      1952 |           2 |          1788
      1952 |           3 |          2062
      1952 |           4 |          1904
      1952 |           5 |          1973
      1952 |           6 |          1887
      1952 |           7 |          1978

...  <output truncated>

Shorthand for GROUP BY

Because of the repetitiveness of re-writing the same SQL expression in both the SELECT and the GROUP BY clauses, the columns can be referenced in the GROUP BY using short hand numbers. The numbering is based on the order of appearance of the columns as referred in the SQL SELECT clause.

This query:

SELECT
    title,
    SUM(1) AS cnt_rows
FROM titles
GROUP BY
    title
;

can be re-written as:

SELECT
    title,
    SUM(1) AS cnt_rows
FROM titles
GROUP BY 1
;

since the 1 denotes the 1st column mentioned in the SELECT clause, which is title.

Likewise, this query:

SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    EXTRACT(month FROM birth_date) AS birth_month,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
    EXTRACT(year FROM birth_date),
    EXTRACT(month FROM birth_date)
;

can be re-written as:

SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    EXTRACT(month FROM birth_date) AS birth_month,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY 1, 2
;

The styling convention is to write the groups (non-aggregate columns) first, and leave the aggregate functions later.

For example:

SELECT
  column_name,
  column_name,
  column_name,
  column_name,
  AGGREGATION_FUNCTION(column_name),
  AGGREGATION_FUNCTION(column_name)
FROM table_name
GROUP BY 1, 2, 3, 4
;

Exercise

In layman’s terms, what is the following query doing?

  SELECT
   LEFT(first_name, 1) AS first_name_initial,
   LEFT(last_name, 1) AS last_name_initial,
   COUNT(DISTINCT emp_no) AS num_employees
  FROM employees
  GROUP BY 1, 2
  ;

[Stretch question] Given that there are 26 letters in the alphabet. The expected number of rows in question 1 should have been 676 (26 x 26 = 676). Any hypotheses as to why we have more rows?

Common errors in using `GROUP BY`

You forgot the GROUP BY altogether:

  SELECT
   title,
   SUM(1) AS cnt_row
  FROM titles
  -- where is the GROUP BY?
  ;

You miscounted how many columns are aggregates and how many are non aggregates. (e.g. more often than note, off by one error)

  SELECT
   LEFT(first_name, 1) AS first_name_initial,
   LEFT(last_name, 1) AS last_name_initial,
   COUNT(DISTINCT emp_no) AS num_employees
  FROM employees
  GROUP BY 1 -- where is 2?
  ;

You did not order the SELECT clause so that the grouped columns come first and the aggregation functions come last. This threw off your GROUP BY enumeration.

  SELECT
   LEFT(first_name, 1) AS first_name_initial,
   COUNT(DISTINCT emp_no) AS num_employees, -- this should come last
   LEFT(last_name, 1) AS last_name_initial      
  FROM employees
  GROUP BY 1, 2 -- right now, "2" refers to COUNT(DISTINCT emp_no) which is wrong
  ;

Breakout exercises

How many employees started each year? (use from_date in the dept_emp table to denote the start date)
How many employees quit their jobs each year? (use to_date in the dept_emp table to denote the end date)
How many employees quit their jobs specifically in the years 1991, 1992, and 1993?

I want to aggregate the employees table data by birth year and by gender. How do I generate this output?

  gender | birth_year | num_employees
  -------+------------+---------------
  F      |       1952 |          8502
  F      |       1953 |          9104
  F      |       1954 |          9385
  ...
  M      |       1952 |         12707
  M      |       1953 |         13753
  M      |       1954 |         13843
  ...
  (28 rows)

I want to modify the output in Q4 slightly. How do I get this output?

  birth_year | gender | num_employees
  -----------+--------+---------------
     1952 | Female |          8502
     1952 | Male   |         12707
     1953 | Female |          9104
     1953 | Male   |         13753
     ...
     1965 | Female |           790
     1965 | Male   |          1150
  (28 rows)

One last time, how do I get this final modified version?

  birth_year | num_female_employees | num_male_employees
  -----------+----------------------+--------------------
|                 8502 |              12707
|                 9104 |              13753
|                 9385 |              13843
|                 9267 |              13837
|                 9229 |              13822
|                 9069 |              13781
|                 9312 |              13964
|                 9286 |              14025
|                 9383 |              13743
|                 9269 |              13796
|                 9216 |              13798
|                 9194 |              13886
|                 9045 |              13868
|                  790 |               1150
  (14 rows)

Resources

Postgresqltutorial on GROUP BY
Style guide has been updated to reflect our new knowledge on GROUP BY syntax

Wrapping up

Hackerrank invites will be coming to your Hunter email inbox over the next 2 weeks. Please keep an eye out for them.

Home

Lecture 10: GROUP BY

Learning objective

Introduction to GROUP BY

GROUP BY with aggregation functions

Sanity check on GROUP BY

Order of evaluation

More advanced GROUP BYs

Common errors in using GROUP BY

Resources

Wrapping up

Lecture 10: `GROUP BY`

Introduction to `GROUP BY`

Sanity check on `GROUP BY`

Common errors in using `GROUP BY`