DISTINCTDISTINCTApply aggregation functions to the salaries table (e.g. use the salary column where necessary) to reproduce the following output:
min_salary | max_salary |     avg_salary     | cnt_rows
-----------+------------+--------------------+----------
     38623 |     158220 | 63810.744836143706 |  2844047
DISTINCTWhat is DISTINCT?
DISTINCT is used in the SELECT statement to remove duplicate rows.  DISTINCT keeps one row for each group of duplicates.
For example:
SELECT
  title
FROM titles
;
outputs the raw data stored in the column title, which has numerous duplicates, since it is possible for multiple employees to hold the same title.
title
-----------------
Senior Engineer
Staff
Senior Engineer
Engineer
Senior Engineer
Senior Staff
Staff
Senior Engineer
Senior Staff
Staff
...
<output truncated to 10 rows for viewability>
If we are interested in obtaining only a unique list of titles, we can dedup using DISTINCT:
SELECT DISTINCT
    title
FROM titles
;
which gives us the following output, by which we can deduce that there are 7 unique job titles.
title
--------------------
Engineer
Senior Engineer
Manager
Assistant Engineer
Staff
Senior Staff
Technique Leader
(7 rows)
Exercise
Using the gender column in the employees table, write the SQL that creates the following dedupped output:
gender
--------
M
F
(2 rows)
Applying DISTINCT to an already dedupped column
Applying DISTINCT to an already dedupped column will not change the number of rows in the output.
The departments table contains one unique dept_no and dept_name per row, with no duplications.
SELECT *
FROM departments
;
dept_no |     dept_name
--------+--------------------
d009    | Customer Service
d005    | Development
d002    | Finance
d003    | Human Resources
d001    | Marketing
d004    | Production
d006    | Quality Management
d008    | Research
d007    | Sales
(9 rows)
Thus, applying DISTINCT to dept_no:
SELECT DISTINCT
    dept_no
FROM departments
;
gives us 9 rows, which allows us to deduce that there is no duplication of values in the dept_no column.
dept_no
---------
d009
d005
d002
d003
d001
d004
d006
d008
d007
(9 rows)
Likewise, applying DISTINCT to dept_name:
SELECT DISTINCT
    dept_name
FROM departments
;
also gives us 9 rows, which allows us to deduce that there is also no duplication of values in the dept_name column.
dept_name
--------------------
Customer Service
Development
Finance
Human Resources
Marketing
Production
Quality Management
Research
Sales
(9 rows)
Exercise
Apply DISTINCT to the column emp_no in the employees table.
Shift + G).  How many rows in the SQL output?emp_no column in the employees table?
Using COUNT(DISTINCT <column_name>) to validate duplicates in column values
We can apply the COUNT() function over the DISTINCT <column_name> clause, to return a count of distinct values in the specified column.  This, when compared with the value returned from COUNT(*) or SUM(1) is a quick sanity check on duplicated values.
For example, we suspect that the employees table contains 1 row per employee (as denoted by emp_no) with no duplication.
To verify this, we can get the # of rows with either aggregation function:
SELECT
    SUM(1) AS cnt_rows_v1,
    COUNT(*) AS cnt_rows_v2
FROM employees
; -- 300024
We can get the count of unique employees (emp_no) by applying COUNT(DISTINCT <column_name>):
SELECT
    COUNT(DISTINCT emp_no) AS cnt_unique_employee
FROM employees
; -- 300024   
Given that COUNT(*) and COUNT(DISTINCT emp_no) gives the same count, we can conclude that the table employees has no duplication for the column emp_no.
Note:
COUNT(<column_name>) without the DISTINCT only counts the number of non null values in the column specified.
Note the distinction in the following aggregations:
SELECT
    COUNT(*) AS cnt_rows, -- counts # rows in the employees table
    COUNT(emp_no) AS cnt_nonnull_employee, -- counts # non null values in emp_no column
    COUNT(DISTINCT emp_no) AS cnt_unique_employee -- counts # unique values in emp_no column
FROM employees
;    
gives us this output:
cnt_rows | cnt_nonnull_employee | cnt_unique_employee
---------+----------------------+---------------------
  300024 |               300024 |              300024
The ouput tells us that:
emp_no column.emp_no column.employees contains 1 row per employee (as denoted by emp_no).
Exercise
Re: the table salaries:
emp_no?emp_no?
Multiple columns with DISTINCT
DISTINCT can also be applied to more than one column in the SELECT statement.
For example:
SELECT DISTINCT
    column_name_1,
    column_name_2
FROM table_name
;
Functionally, the combination of values in column_name_1 and column_name_2 will be used to evaluate the de-duplication.
For example, the following dedups based on the employee’s first name alone:
SELECT DISTINCT
    first_name
FROM employees
; -- 1,275 rows
Either Shift+G or applying COUNT(DISTINCT first_name) lets us know that there are 1,275 unique values (distinct first names).
However, if we add in the last name as well, then we are de-duplicating based on both first and last names:
SELECT DISTINCT
    first_name,
    last_name
FROM employees
; -- 279,408 rows
This means there are 279,408 unique combinations of first and last names in this table. Given that the employees table contains 300,024 rows, there are still some employees that share the same first and last name!
Exercise
Write a SQL query that de-duplicates by all three columns: first_name, last_name, gender in the employees table.  Compare the number of rows in this new query output against the output above for dedupping against just first_name and last_name.  What can you infer about the data based on the results?
Hint: Stil unsure? Try spot checking this:
SELECT *
FROM employees
WHERE first_name = 'Aamer'
  AND last_name = 'Gill'
;
Applying DISTINCT to manipulated columns
DISTINCT can also be applied to manipulated columns created via SQL functions.
Consider the following, we created a temporary new column birth_year:
SELECT
    birth_date,
    EXTRACT(year FROM birth_date) AS birth_year
FROM employees
;
the output:
birth_date | birth_year
------------+------------
1953-09-02 |       1953
1964-06-02 |       1964
1959-12-03 |       1959
1954-05-01 |       1954
1955-01-21 |       1955
1953-04-20 |       1953
1957-05-23 |       1957
1958-02-19 |       1958
1952-04-19 |       1952
1963-06-01 |       1963
...
<output truncated to 10 rows for viewability>
If we are interested in how many distinct (unique) birth years, we can de-dup using DISTINCT:
SELECT DISTINCT
    EXTRACT(year FROM birth_date) AS birth_year
FROM employees
;
which gives us 14 unique values:
birth_year
------------
      1958
      1964
      1962
      1957
      1961
      1959
      1952
      1953
      1956
      1954
      1965
      1963
      1960
      1955
(14 rows)
Exercise
How many distinct years are there in the column hire_date in the table employees?
Based on our existing vocabulary in SQL:
PostgreSQL evaluates and dedups via DISTINCT after the SELECT clause.
FROM -> [WHERE] -> SELECT -> [DISTINCT] -> [LIMIT]
  SELECT COUNT(DISTINCT emp_no) AS cnt_employee
  FROM employees
  WHERE gender = 'F'
   AND LEFT(first_name, 1) = 'A'
  ;
  SELECT
   COUNT(
     DISTINCT
       CASE
         WHEN gender = 'F' THEN emp_no
         WHEN gender = 'M' THEN NULL
         ELSE NULL
       END
       ) AS cnt_employee
  FROM employees
  WHERE hire_date BETWEEN '1989-01-01' AND '1989-12-31'
  ;
How many employees quit their jobs at this company in the year 1988?  (use to_date in the dept_emp table to denote the end date)
How many employees started their jobs at this company in the year 1988? (use from_date in the dept_emp table to denote the start date)
9999-01-01’ is used to denote no end date in the to_date column, meaning the employee is still with the company.)DISTINCT