GROUP BYGROUP BY and aggregation functionsGROUP BYWhat is GROUP BY?
The GROUP BY clause divides the rows returned from the SELECT statement into groups of non null values.  For each group, you can apply an aggregate function (e.g. SUM()) to get the aggregation calculation for that particular group.  Each group, then, is represented by a single row in the output table.
DISTINCT versus GROUP BY
The barebones functionality of GROUP BY is similar to DISTINCT.  Consider the following:
The DISTINCT deduplicates the column title, leaving it with only unique values.
SELECT DISTINCT
    title
FROM titles
;
By specifying the column name in the SELECT clause as well as in the GROUP BY clause, GROUP BY can achieve the same output.
SELECT
    title
FROM titles
GROUP BY
    title
;
The output from both:
title
--------------------
Engineer
Senior Engineer
Manager
Assistant Engineer
Staff
Senior Staff
Technique Leader
(7 rows)
Exercise
Re-write the following SQL using GROUP BY instead of DISTINCT
SELECT DISTINCT
    gender
FROM employees
;
Advantages of GROUP BY
However, what GROUP BY can do, that DISTINCT cannot, is to add additional aggregation calculations for the de-duplicated groupings.
For example:
SELECT
    title,
    COUNT(*) AS cnt_row
FROM titles
GROUP BY
    title
;
gives us, not just the de-duplicated values in the title column, but also the number of times this value occurs in the titles table.
title              | cnt_row
-------------------+---------
Assistant Engineer |   15128
Engineer           |  115003
Manager            |      24
Senior Engineer    |   97750
Senior Staff       |   92853
Staff              |  107391
Technique Leader   |   15159
(7 rows)
Exercise
Write the query that generates the SQL output below, using GROUP BY on the employees table.
gender | cnt_row
-------+---------
F      |  120051
M      |  179973
(2 rows)
It is possible to use all aggregation functions in conjunction with GROUP BY.
The most common aggregation functions are:
SUM(): totalMIN(): minimum valueMAX(): maximum valueAVG(): averageCOUNT(*): count the number of rowsCOUNT(DISTINCT <column name>): count the number of distinct values in the column specifiedThe following query deduplicates the values in the column title and then uses the COUNT(*) aggregation function to calculate how many rows in the table contains this value.
SELECT
    title,
    COUNT(*) AS cnt_rows
FROM titles
GROUP BY
    title
;
Important: Note that only the column we are de-duplicating and grouping by (e.g. title) needs to be in the GROUP BY clause.  Aggregation functions do NOT go in the GROUP BY clause.
We can modify the SQL so that instead of counting the number of rows, we can count the number of distinct employees (emp_no) that have the same title.
SELECT
    title,
    COUNT(DISTINCT emp_no) AS cnt_emps
FROM titles
GROUP BY
    title
;
We could also apply multiple aggregations to the same level of grouping.  However, since we are only de-duplicating and aggregating by column title, that’s the only column that goes into the GROUP BY clause.
SELECT
    title,
    COUNT(*) AS cnt_rows,
    COUNT(DISTINCT emp_no) AS cnt_emps
FROM titles
GROUP BY
    title
;
Exercise
Using GROUP BY, find the count of # distinct employee numbers for each gender in the employees table.
GROUP BYAll aggregation functions are fair game. However, sometimes it is hard to sanity check if the aggregation is performing as expected.
For example, consider the following aggregation query:
SELECT
    emp_no,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary
FROM salaries
GROUP BY
    emp_no
;    
outputs:
emp_no | min_salary | max_salary
-------+------------+------------
 10001 |      60117 |      88958
 10002 |      65828 |      72527
 10003 |      40006 |      43699
 10004 |      40054 |      74057
 10005 |      78228 |      94692
 10006 |      40000 |      60098
 10007 |      56724 |      88070
 10008 |      46671 |      52668
.... <truncated>
To spot check, it’s useful to filter down and spot check 1 data point. E.g. emp_no = 10001:
SELECT
    emp_no,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary
FROM salaries
WHERE emp_no = 10001
GROUP BY
    emp_no
;    
outputs:
emp_no | min_salary | max_salary
-------+------------+------------
 10001 |      60117 |      88958
Then, we can compare against:
SELECT *
FROM salaries
WHERE emp_no = 10001
;
which allows us to visually scan and confirm the minimum and maximum salary values.
emp_no | salary | from_date  |  to_date
-------+--------+------------+------------
 10001 |  60117 | 1986-06-26 | 1987-06-26
 10001 |  62102 | 1987-06-26 | 1988-06-25
 10001 |  66074 | 1988-06-25 | 1989-06-25
 10001 |  66596 | 1989-06-25 | 1990-06-25
 10001 |  66961 | 1990-06-25 | 1991-06-25
 10001 |  71046 | 1991-06-25 | 1992-06-24
 10001 |  74333 | 1992-06-24 | 1993-06-24
 10001 |  75286 | 1993-06-24 | 1994-06-24
 10001 |  75994 | 1994-06-24 | 1995-06-24
 10001 |  76884 | 1995-06-24 | 1996-06-23
 10001 |  80013 | 1996-06-23 | 1997-06-23
 10001 |  81025 | 1997-06-23 | 1998-06-23
 10001 |  81097 | 1998-06-23 | 1999-06-23
 10001 |  84917 | 1999-06-23 | 2000-06-22
 10001 |  85112 | 2000-06-22 | 2001-06-22
 10001 |  85097 | 2001-06-22 | 2002-06-22
 10001 |  88958 | 2002-06-22 | 9999-01-01
(17 rows)
Exercise
Spot check a few results from the following query to make sure it’s performing as expected:
SELECT
    emp_no,
    COUNT(*) AS num_occurrences,
    COUNT(DISTINCT emp_no) AS num_employees,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary,
    AVG(salary) AS avg_salary
FROM salaries
GROUP BY
    emp_no
;
Based on our existing vocabulary in SQL: here is the order of evaluation.
PostgreSQL evaluates the GROUP BY clause after the FROM and the WHERE but before the SELECT, DISTINCT, and LIMIT clause(s).
FROM -> [WHERE] -> [GROUP BY] -> SELECT -> [DISTINCT] -> [LIMIT]
Consider the following:
SELECT                                -- 4. calculates aggregations based on groups
    emp_no,
    COUNT(*) AS num_occurrences,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary,
    AVG(salary) AS avg_salary
FROM salaries                         -- 1. identifies the table
WHERE emp_no BETWEEN 10001 AND 10010  -- 2. filters row by criteria
GROUP BY
    emp_no                            -- 3. groups by emp_no
LIMIT 5                               -- 5. further limits output to 5 rows only
;    
GROUP BY manipulated columns or expressions
Just like DISTINCT, GROUP BY can aggregate manipulated columns or expressions just as well as raw data columns.
The following query extracts only the year from the birth date, and aggregates the number of employees that are born within each year.
SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
    EXTRACT(year FROM birth_date)
;
Important: Note that in the GROUP BY, we only included the manipulated SQL expression (e.g. EXTRACT(year FROM birth_date) and NOT the column alias (e.g. AS birth_year) portion.
Also note that while PostgreSQL also allows GROUP BY using the column alias alone (see below), this is NOT best practice and not accepted in most other SQL variants.
SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
    birth_year
;
GROUP BY multiple columns
It is also possible to aggregate by multiple columns at the same time. The following SQL query aggregates the number of employees that are born in the same birth year and same birth month:
SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    EXTRACT(month FROM birth_date) AS birth_month,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
    EXTRACT(year FROM birth_date),
    EXTRACT(month FROM birth_date)
;
outputs:
birth_year | birth_month | num_employees
-----------+-------------+---------------
      1952 |           2 |          1788
      1952 |           3 |          2062
      1952 |           4 |          1904
      1952 |           5 |          1973
      1952 |           6 |          1887
      1952 |           7 |          1978
...  <output truncated>
Shorthand for GROUP BY
Because of the repetitiveness of re-writing the same SQL expression in both the SELECT and the GROUP BY clauses, the columns can be referenced in the GROUP BY using short hand numbers.  The numbering is based on the order of appearance of the columns as referred in the SQL SELECT clause.
This query:
SELECT
    title,
    SUM(1) AS cnt_rows
FROM titles
GROUP BY
    title
;
can be re-written as:
SELECT
    title,
    SUM(1) AS cnt_rows
FROM titles
GROUP BY 1
;
since the 1 denotes the 1st column mentioned in the SELECT clause, which is title.
Likewise, this query:
SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    EXTRACT(month FROM birth_date) AS birth_month,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY
    EXTRACT(year FROM birth_date),
    EXTRACT(month FROM birth_date)
;
can be re-written as:
SELECT
    EXTRACT(year FROM birth_date) AS birth_year,
    EXTRACT(month FROM birth_date) AS birth_month,
    COUNT(DISTINCT emp_no) AS num_employees
FROM employees
GROUP BY 1, 2
;
The styling convention is to write the groups (non-aggregate columns) first, and leave the aggregate functions later.
For example:
SELECT
  column_name,
  column_name,
  column_name,
  column_name,
  AGGREGATION_FUNCTION(column_name),
  AGGREGATION_FUNCTION(column_name)
FROM table_name
GROUP BY 1, 2, 3, 4
;
Exercise
  SELECT
   LEFT(first_name, 1) AS first_name_initial,
   LEFT(last_name, 1) AS last_name_initial,
   COUNT(DISTINCT emp_no) AS num_employees
  FROM employees
  GROUP BY 1, 2
  ;
GROUP BYSome common errors in using GROUP BY:
GROUP BY altogether:
      SELECT
   title,
   SUM(1) AS cnt_row
  FROM titles
  -- where is the GROUP BY?
  ;
  SELECT
   LEFT(first_name, 1) AS first_name_initial,
   LEFT(last_name, 1) AS last_name_initial,
   COUNT(DISTINCT emp_no) AS num_employees
  FROM employees
  GROUP BY 1 -- where is 2?
  ;
SELECT clause so that the grouped columns come first and the aggregation functions come last.  This threw off your GROUP BY enumeration.
      SELECT
   LEFT(first_name, 1) AS first_name_initial,
   COUNT(DISTINCT emp_no) AS num_employees, -- this should come last
   LEFT(last_name, 1) AS last_name_initial      
  FROM employees
  GROUP BY 1, 2 -- right now, "2" refers to COUNT(DISTINCT emp_no) which is wrong
  ;
from_date in the dept_emp table to denote the start date)to_date in the dept_emp table to denote the end date)employees table.  Write the SQL query that will produce the following data output:
      gender | birth_year | num_employees
  -------+------------+---------------
  F      |       1952 |          8502
  F      |       1953 |          9104
  F      |       1954 |          9385
  ...
  M      |       1952 |         12707
  M      |       1953 |         13753
  M      |       1954 |         13843
  ...
  (28 rows)
CASE .. WHEN):
      birth_year | gender | num_employees
  -----------+--------+---------------
     1952 | Female |          8502
     1952 | Male   |         12707
     1953 | Female |          9104
     1953 | Male   |         13753
     ...
     1965 | Female |           790
     1965 | Male   |          1150
  (28 rows)
  birth_year | num_female_employees | num_male_employees
  -----------+----------------------+--------------------
     1952 |                 8502 |              12707
     1953 |                 9104 |              13753
     1954 |                 9385 |              13843
     1955 |                 9267 |              13837
     1956 |                 9229 |              13822
     1957 |                 9069 |              13781
     1958 |                 9312 |              13964
     1959 |                 9286 |              14025
     1960 |                 9383 |              13743
     1961 |                 9269 |              13796
     1962 |                 9216 |              13798
     1963 |                 9194 |              13886
     1964 |                 9045 |              13868
     1965 |                  790 |               1150
  (14 rows)
GROUP BYGROUP BY syntax