Home

Topic: `DISTINCT`

Learning objective

De-duplicating data using DISTINCT

De-duplication using `DISTINCT`

What is DISTINCT?

DISTINCT is used in the SELECT statement to remove duplicate rows. DISTINCT keeps one row for each group of duplicates.

For example:

SELECT
  title
FROM titles
;

outputs the raw data stored in the column title, which has numerous duplicates, since it is possible for multiple employees to hold the same title.

title
-----------------
Senior Engineer
Staff
Senior Engineer
Engineer
Senior Engineer
Senior Staff
Staff
Senior Engineer
Senior Staff
Staff
...
<output truncated to 10 rows for viewability>

If we are interested in obtaining only a unique list of titles, we can dedup using DISTINCT:

SELECT DISTINCT
    title
FROM titles
;

which gives us the following output, by which we can deduce that there are 7 unique job titles.

title
--------------------
Engineer
Senior Engineer
Manager
Assistant Engineer
Staff
Senior Staff
Technique Leader
(7 rows)

Exercise

Using the gender column in the employees table, write the SQL that creates the following dedupped output:

gender
--------
M
F
(2 rows)

Applying DISTINCT to an already dedupped column

Applying DISTINCT to an already dedupped column will not change the number of rows in the output.

The departments table contains one unique dept_no and dept_name per row, with no duplications.

SELECT *
FROM departments
;

dept_no |     dept_name
--------+--------------------
d009    | Customer Service
d005    | Development
d002    | Finance
d003    | Human Resources
d001    | Marketing
d004    | Production
d006    | Quality Management
d008    | Research
d007    | Sales
(9 rows)

Thus, applying DISTINCT to dept_no:

SELECT DISTINCT
    dept_no
FROM departments
;

gives us 9 rows, which allows us to deduce that there is no duplication of values in the dept_no column.

dept_no
---------
d009
d005
d002
d003
d001
d004
d006
d008
d007
(9 rows)

Likewise, applying DISTINCT to dept_name:

SELECT DISTINCT
    dept_name
FROM departments
;

also gives us 9 rows, which allows us to deduce that there is also no duplication of values in the dept_name column.

dept_name
--------------------
Customer Service
Development
Finance
Human Resources
Marketing
Production
Quality Management
Research
Sales
(9 rows)

Exercise

Apply DISTINCT to the column emp_no in the employees table.

How many rows in the SQL output? Notice that scrolling to the bottom of the output will take a while.
Are there any duplications of values in the emp_no column in the employees table?

Using COUNT(DISTINCT <column_name>) to validate duplicates in column values

We can apply the COUNT() function over the DISTINCT <column_name> clause, to return a count of distinct values in the specified column. This, when compared with the value returned from COUNT(*) or SUM(1) is a quick sanity check on duplicated values.

For example, we suspect that the employees table contains 1 row per employee (as denoted by emp_no) with no duplication.

To verify this, we can get the # of rows with either aggregation function:

SELECT
    SUM(1) AS cnt_rows_v1,
    COUNT(*) AS cnt_rows_v2
FROM employees
; -- 300024

We can get the count of unique employees (emp_no) by applying COUNT(DISTINCT <column_name>):

SELECT
    COUNT(DISTINCT emp_no) AS cnt_unique_employee
FROM employees
; -- 300024   

Given that COUNT(*) and COUNT(DISTINCT emp_no) gives the same count, we can conclude that the table employees has no duplication for the column emp_no.

Note:

COUNT(<column_name>) without the DISTINCT only counts the number of non null values in the column specified.

Note the distinction in the following aggregations:

SELECT
    COUNT(*) AS cnt_rows, -- counts # rows in the employees table
    COUNT(emp_no) AS cnt_nonnull_employee, -- counts # non null values in emp_no column
    COUNT(DISTINCT emp_no) AS cnt_unique_employee -- counts # unique values in emp_no column
FROM employees
;    

gives us this output:

cnt_rows | cnt_nonnull_employee | cnt_unique_employee
---------+----------------------+---------------------
  300024 |               300024 |              300024

The ouput tells us that:

There are no null values in the emp_no column.
There are no duplicated values in the emp_no column.
This table employees contains 1 row per employee (as denoted by emp_no).

### Exercise

Re: the table salaries:

Are there any null values in the column emp_no?
Are there any duplicated values in the column emp_no?
How many rows are there in this table?

Multiple columns with DISTINCT

DISTINCT can also be applied to more than one column in the SELECT statement.

For example:

SELECT DISTINCT
    column_name_1,
    column_name_2
FROM table_name
;

Functionally, the combination of values in column_name_1 and column_name_2 will be used to evaluate the de-duplication.

For example, the following dedups based on the employee’s first name alone:

SELECT DISTINCT
    first_name
FROM employees
; -- 1,275 rows

Applying COUNT(DISTINCT first_name) lets us know that there are 1,275 unique values (distinct first names).

However, if we add in the last name as well, then we are de-duplicating based on both first and last names:

SELECT DISTINCT
    first_name,
    last_name
FROM employees
; -- 279,408 rows

This means there are 279,408 unique combinations of first and last names in this table. Given that the employees table contains 300,024 rows, there are still some employees that share the same first and last name!

### Exercise

Write a SQL query that de-duplicates by all three columns: first_name, last_name, gender in the employees table. Compare the number of rows in this new query output against the output above for dedupping against just first_name and last_name.
Why is it that adding in a third column (gender) to de-duplicate by, adds more rows to the output? Hint: Try spot checking this:
```
SELECT *
FROM employees
WHERE first_name = 'Aamer'
  AND last_name = 'Gill'
;
```

Applying DISTINCT to manipulated columns

DISTINCT can also be applied to manipulated columns created via SQL functions.

Consider the following, we created a temporary new column birth_year:

SELECT
    birth_date,
    EXTRACT(year FROM birth_date) AS birth_year
FROM employees
;

the output:

birth_date | birth_year
------------+------------
1953-09-02 |       1953
1964-06-02 |       1964
1959-12-03 |       1959
1954-05-01 |       1954
1955-01-21 |       1955
1953-04-20 |       1953
1957-05-23 |       1957
1958-02-19 |       1958
1952-04-19 |       1952
1963-06-01 |       1963
...
<output truncated to 10 rows for viewability>

If we are interested in how many distinct (unique) birth years, we can de-dup using DISTINCT:

SELECT DISTINCT
    EXTRACT(year FROM birth_date) AS birth_year
FROM employees
;

which gives us 14 unique values:

birth_year
------------
      1958
      1964
      1962
      1957
      1961
      1959
      1952
      1953
      1956
      1954
      1965
      1963
      1960
      1955
(14 rows)

Exercise

How many distinct years are there in the column hire_date in the table employees?

Order of evaluation

Based on our existing vocabulary in SQL:

PostgreSQL evaluates and dedups via DISTINCT after the SELECT clause.

FROM -> [WHERE] -> SELECT -> [DISTINCT] -> [LIMIT]

Exercise

Break down the following SQL query, step by step, based on order of evaluation. What is this SQL doing and what is the output, in layman’s terms?
```
  SELECT COUNT(DISTINCT emp_no) AS cnt_employee
  FROM employees
  WHERE gender = 'F'
   AND LEFT(first_name, 1) = 'A'
  ;
```

Break down the following SQL query, step by step, based on order of evaluation. What is this SQL doing and what is the output, in layman’s terms?

  SELECT
   COUNT(
     DISTINCT
       CASE
         WHEN gender = 'F' THEN emp_no
         WHEN gender = 'M' THEN NULL
         ELSE NULL
       END
       ) AS cnt_employee
  FROM employees
  WHERE hire_date BETWEEN '1989-01-01' AND '1989-12-31'
  ;

How many employees quit their jobs at this company in the year 1988? (use to_date in the dept_emp table to denote the end date)
How many employees started their jobs at this company in the year 1988? (use from_date in the dept_emp table to denote the start date)
How many employees are still employed at this company as of the time of the last database update? (Assuming that ‘9999-01-01’ is used to denote no end date in the to_date column, meaning the employee is still with the company.)

Resources

Postgresqltutorial on DISTINCT

Home

Topic: DISTINCT

Learning objective

De-duplication using DISTINCT

Exercise

Exercise

Exercise

Order of evaluation

Exercise

Resources

Questions?

Topic: `DISTINCT`

De-duplication using `DISTINCT`