DISTINCTDISTINCTDISTINCTWhat is DISTINCT?
DISTINCT is used in the SELECT statement to remove duplicate rows. DISTINCT keeps one row for each group of duplicates.
For example:
SELECT
title
FROM titles
;
outputs the raw data stored in the column title, which has numerous duplicates, since it is possible for multiple employees to hold the same title.
title
-----------------
Senior Engineer
Staff
Senior Engineer
Engineer
Senior Engineer
Senior Staff
Staff
Senior Engineer
Senior Staff
Staff
...
<output truncated to 10 rows for viewability>
If we are interested in obtaining only a unique list of titles, we can dedup using DISTINCT:
SELECT DISTINCT
title
FROM titles
;
which gives us the following output, by which we can deduce that there are 7 unique job titles.
title
--------------------
Engineer
Senior Engineer
Manager
Assistant Engineer
Staff
Senior Staff
Technique Leader
(7 rows)
Using the gender column in the employees table, write the SQL that creates the following dedupped output:
gender
--------
M
F
(2 rows)
Applying DISTINCT to an already dedupped column
Applying DISTINCT to an already dedupped column will not change the number of rows in the output.
The departments table contains one unique dept_no and dept_name per row, with no duplications.
SELECT *
FROM departments
;
dept_no | dept_name
--------+--------------------
d009 | Customer Service
d005 | Development
d002 | Finance
d003 | Human Resources
d001 | Marketing
d004 | Production
d006 | Quality Management
d008 | Research
d007 | Sales
(9 rows)
Thus, applying DISTINCT to dept_no:
SELECT DISTINCT
dept_no
FROM departments
;
gives us 9 rows, which allows us to deduce that there is no duplication of values in the dept_no column.
dept_no
---------
d009
d005
d002
d003
d001
d004
d006
d008
d007
(9 rows)
Likewise, applying DISTINCT to dept_name:
SELECT DISTINCT
dept_name
FROM departments
;
also gives us 9 rows, which allows us to deduce that there is also no duplication of values in the dept_name column.
dept_name
--------------------
Customer Service
Development
Finance
Human Resources
Marketing
Production
Quality Management
Research
Sales
(9 rows)
Apply DISTINCT to the column emp_no in the employees table.
emp_no column in the employees table?Using COUNT(DISTINCT <column_name>) to validate duplicates in column values
We can apply the COUNT() function over the DISTINCT <column_name> clause, to return a count of distinct values in the specified column. This, when compared with the value returned from COUNT(*) or SUM(1) is a quick sanity check on duplicated values.
For example, we suspect that the employees table contains 1 row per employee (as denoted by emp_no) with no duplication.
To verify this, we can get the # of rows with either aggregation function:
SELECT
SUM(1) AS cnt_rows_v1,
COUNT(*) AS cnt_rows_v2
FROM employees
; -- 300024
We can get the count of unique employees (emp_no) by applying COUNT(DISTINCT <column_name>):
SELECT
COUNT(DISTINCT emp_no) AS cnt_unique_employee
FROM employees
; -- 300024
Given that COUNT(*) and COUNT(DISTINCT emp_no) gives the same count, we can conclude that the table employees has no duplication for the column emp_no.
Note:
COUNT(<column_name>) without the DISTINCT only counts the number of non null values in the column specified.
Note the distinction in the following aggregations:
SELECT
COUNT(*) AS cnt_rows, -- counts # rows in the employees table
COUNT(emp_no) AS cnt_nonnull_employee, -- counts # non null values in emp_no column
COUNT(DISTINCT emp_no) AS cnt_unique_employee -- counts # unique values in emp_no column
FROM employees
;
gives us this output:
cnt_rows | cnt_nonnull_employee | cnt_unique_employee
---------+----------------------+---------------------
300024 | 300024 | 300024
The ouput tells us that:
emp_no column.emp_no column.employees contains 1 row per employee (as denoted by emp_no).### Exercise
Re: the table salaries:
emp_no?emp_no?Multiple columns with DISTINCT
DISTINCT can also be applied to more than one column in the SELECT statement.
For example:
SELECT DISTINCT
column_name_1,
column_name_2
FROM table_name
;
Functionally, the combination of values in column_name_1 and column_name_2 will be used to evaluate the de-duplication.
For example, the following dedups based on the employee’s first name alone:
SELECT DISTINCT
first_name
FROM employees
; -- 1,275 rows
Applying COUNT(DISTINCT first_name) lets us know that there are 1,275 unique values (distinct first names).
However, if we add in the last name as well, then we are de-duplicating based on both first and last names:
SELECT DISTINCT
first_name,
last_name
FROM employees
; -- 279,408 rows
This means there are 279,408 unique combinations of first and last names in this table. Given that the employees table contains 300,024 rows, there are still some employees that share the same first and last name!
### Exercise
Write a SQL query that de-duplicates by all three columns: first_name, last_name, gender in the employees table. Compare the number of rows in this new query output against the output above for dedupping against just first_name and last_name.
Why is it that adding in a third column (gender) to de-duplicate by, adds more rows to the output? Hint: Try spot checking this:
SELECT *
FROM employees
WHERE first_name = 'Aamer'
AND last_name = 'Gill'
;
Applying DISTINCT to manipulated columns
DISTINCT can also be applied to manipulated columns created via SQL functions.
Consider the following, we created a temporary new column birth_year:
SELECT
birth_date,
EXTRACT(year FROM birth_date) AS birth_year
FROM employees
;
the output:
birth_date | birth_year
------------+------------
1953-09-02 | 1953
1964-06-02 | 1964
1959-12-03 | 1959
1954-05-01 | 1954
1955-01-21 | 1955
1953-04-20 | 1953
1957-05-23 | 1957
1958-02-19 | 1958
1952-04-19 | 1952
1963-06-01 | 1963
...
<output truncated to 10 rows for viewability>
If we are interested in how many distinct (unique) birth years, we can de-dup using DISTINCT:
SELECT DISTINCT
EXTRACT(year FROM birth_date) AS birth_year
FROM employees
;
which gives us 14 unique values:
birth_year
------------
1958
1964
1962
1957
1961
1959
1952
1953
1956
1954
1965
1963
1960
1955
(14 rows)
How many distinct years are there in the column hire_date in the table employees?
Based on our existing vocabulary in SQL:
PostgreSQL evaluates and dedups via DISTINCT after the SELECT clause.
FROM -> [WHERE] -> SELECT -> [DISTINCT] -> [LIMIT]
SELECT COUNT(DISTINCT emp_no) AS cnt_employee
FROM employees
WHERE gender = 'F'
AND LEFT(first_name, 1) = 'A'
;
SELECT
COUNT(
DISTINCT
CASE
WHEN gender = 'F' THEN emp_no
WHEN gender = 'M' THEN NULL
ELSE NULL
END
) AS cnt_employee
FROM employees
WHERE hire_date BETWEEN '1989-01-01' AND '1989-12-31'
;
How many employees quit their jobs at this company in the year 1988? (use to_date in the dept_emp table to denote the end date)
How many employees started their jobs at this company in the year 1988? (use from_date in the dept_emp table to denote the start date)
9999-01-01’ is used to denote no end date in the to_date column, meaning the employee is still with the company.)DISTINCT