DISTINCT
DISTINCT
DISTINCT
What is DISTINCT
?
DISTINCT
is used in the SELECT statement to remove duplicate rows. DISTINCT
keeps one row for each group of duplicates.
For example:
SELECT
title
FROM titles
;
outputs the raw data stored in the column title
, which has numerous duplicates, since it is possible for multiple employees to hold the same title.
title
-----------------
Senior Engineer
Staff
Senior Engineer
Engineer
Senior Engineer
Senior Staff
Staff
Senior Engineer
Senior Staff
Staff
...
<output truncated to 10 rows for viewability>
If we are interested in obtaining only a unique list of titles, we can dedup using DISTINCT
:
SELECT DISTINCT
title
FROM titles
;
which gives us the following output, by which we can deduce that there are 7 unique job titles.
title
--------------------
Engineer
Senior Engineer
Manager
Assistant Engineer
Staff
Senior Staff
Technique Leader
(7 rows)
Using the gender
column in the employees
table, write the SQL that creates the following dedupped output:
gender
--------
M
F
(2 rows)
Applying DISTINCT
to an already dedupped column
Applying DISTINCT
to an already dedupped column will not change the number of rows in the output.
The departments
table contains one unique dept_no
and dept_name
per row, with no duplications.
SELECT *
FROM departments
;
dept_no | dept_name
--------+--------------------
d009 | Customer Service
d005 | Development
d002 | Finance
d003 | Human Resources
d001 | Marketing
d004 | Production
d006 | Quality Management
d008 | Research
d007 | Sales
(9 rows)
Thus, applying DISTINCT
to dept_no
:
SELECT DISTINCT
dept_no
FROM departments
;
gives us 9 rows, which allows us to deduce that there is no duplication of values in the dept_no
column.
dept_no
---------
d009
d005
d002
d003
d001
d004
d006
d008
d007
(9 rows)
Likewise, applying DISTINCT
to dept_name
:
SELECT DISTINCT
dept_name
FROM departments
;
also gives us 9 rows, which allows us to deduce that there is also no duplication of values in the dept_name
column.
dept_name
--------------------
Customer Service
Development
Finance
Human Resources
Marketing
Production
Quality Management
Research
Sales
(9 rows)
Apply DISTINCT
to the column emp_no
in the employees
table.
emp_no
column in the employees
table?Using COUNT(DISTINCT <column_name>)
to validate duplicates in column values
We can apply the COUNT()
function over the DISTINCT <column_name>
clause, to return a count of distinct values in the specified column. This, when compared with the value returned from COUNT(*)
or SUM(1)
is a quick sanity check on duplicated values.
For example, we suspect that the employees
table contains 1 row per employee (as denoted by emp_no
) with no duplication.
To verify this, we can get the # of rows with either aggregation function:
SELECT
SUM(1) AS cnt_rows_v1,
COUNT(*) AS cnt_rows_v2
FROM employees
; -- 300024
We can get the count of unique employees (emp_no
) by applying COUNT(DISTINCT <column_name>)
:
SELECT
COUNT(DISTINCT emp_no) AS cnt_unique_employee
FROM employees
; -- 300024
Given that COUNT(*)
and COUNT(DISTINCT emp_no)
gives the same count, we can conclude that the table employees
has no duplication for the column emp_no
.
Note:
COUNT(<column_name>)
without the DISTINCT
only counts the number of non null values in the column specified.
Note the distinction in the following aggregations:
SELECT
COUNT(*) AS cnt_rows, -- counts # rows in the employees table
COUNT(emp_no) AS cnt_nonnull_employee, -- counts # non null values in emp_no column
COUNT(DISTINCT emp_no) AS cnt_unique_employee -- counts # unique values in emp_no column
FROM employees
;
gives us this output:
cnt_rows | cnt_nonnull_employee | cnt_unique_employee
---------+----------------------+---------------------
300024 | 300024 | 300024
The ouput tells us that:
emp_no
column.emp_no
column.employees
contains 1 row per employee (as denoted by emp_no
).### Exercise
Re: the table salaries
:
emp_no
?emp_no
?Multiple columns with DISTINCT
DISTINCT
can also be applied to more than one column in the SELECT
statement.
For example:
SELECT DISTINCT
column_name_1,
column_name_2
FROM table_name
;
Functionally, the combination of values in column_name_1
and column_name_2
will be used to evaluate the de-duplication.
For example, the following dedups based on the employee’s first name alone:
SELECT DISTINCT
first_name
FROM employees
; -- 1,275 rows
Applying COUNT(DISTINCT first_name)
lets us know that there are 1,275 unique values (distinct first names).
However, if we add in the last name as well, then we are de-duplicating based on both first and last names:
SELECT DISTINCT
first_name,
last_name
FROM employees
; -- 279,408 rows
This means there are 279,408 unique combinations of first and last names in this table. Given that the employees table contains 300,024 rows, there are still some employees that share the same first and last name!
### Exercise
Write a SQL query that de-duplicates by all three columns: first_name
, last_name
, gender
in the employees
table. Compare the number of rows in this new query output against the output above for dedupping against just first_name
and last_name
.
Why is it that adding in a third column (gender
) to de-duplicate by, adds more rows to the output? Hint: Try spot checking this:
SELECT *
FROM employees
WHERE first_name = 'Aamer'
AND last_name = 'Gill'
;
Applying DISTINCT
to manipulated columns
DISTINCT
can also be applied to manipulated columns created via SQL functions.
Consider the following, we created a temporary new column birth_year
:
SELECT
birth_date,
EXTRACT(year FROM birth_date) AS birth_year
FROM employees
;
the output:
birth_date | birth_year
------------+------------
1953-09-02 | 1953
1964-06-02 | 1964
1959-12-03 | 1959
1954-05-01 | 1954
1955-01-21 | 1955
1953-04-20 | 1953
1957-05-23 | 1957
1958-02-19 | 1958
1952-04-19 | 1952
1963-06-01 | 1963
...
<output truncated to 10 rows for viewability>
If we are interested in how many distinct (unique) birth years, we can de-dup using DISTINCT
:
SELECT DISTINCT
EXTRACT(year FROM birth_date) AS birth_year
FROM employees
;
which gives us 14 unique values:
birth_year
------------
1958
1964
1962
1957
1961
1959
1952
1953
1956
1954
1965
1963
1960
1955
(14 rows)
How many distinct years are there in the column hire_date
in the table employees
?
Based on our existing vocabulary in SQL:
PostgreSQL evaluates and dedups via DISTINCT
after the SELECT
clause.
FROM
-> [WHERE
] -> SELECT
-> [DISTINCT
] -> [LIMIT
]
SELECT COUNT(DISTINCT emp_no) AS cnt_employee
FROM employees
WHERE gender = 'F'
AND LEFT(first_name, 1) = 'A'
;
SELECT
COUNT(
DISTINCT
CASE
WHEN gender = 'F' THEN emp_no
WHEN gender = 'M' THEN NULL
ELSE NULL
END
) AS cnt_employee
FROM employees
WHERE hire_date BETWEEN '1989-01-01' AND '1989-12-31'
;
How many employees quit their jobs at this company in the year 1988? (use to_date
in the dept_emp
table to denote the end date)
How many employees started their jobs at this company in the year 1988? (use from_date
in the dept_emp
table to denote the start date)
9999-01-01
’ is used to denote no end date in the to_date
column, meaning the employee is still with the company.)DISTINCT