JOIN
sINNER
, FULL OUTER
, LEFT
, RIGHT
JOIN
sCROSS
) JOIN
JOIN
sEx 1. True or False. The salaries
table contains one row per employee ID (emp_no
)?
Ex 2. True or False. The employees
table contains one row per employee ID (emp_no
)?
Ex 3. Fill in the blanks to create an INNER JOIN
between the two tables:
SELECT *
FROM employees
INNER JOIN salaries
ON ??? = ???
;
Ex 4. Use the query below to answer the following questions:
INNER JOIN
)SELECT
SUM(1),
COUNT(DISTINCT employees.emp_no),
COUNT(DISTINCT salaries.emp_no)
FROM employees
INNER JOIN salaries
ON employees.emp_no = salaries.emp_no
;
In a one-to-one relationship, a row in one table is associated (joined) with one and only one record in another table.
In a one-to-many relationship, a row in one table is associated (joined) with multiple rows in another table.
In a many-to-many relationship, multiple rows in one table are associated (joined) with multiple rows in another table. (This is pretty rare.)
An example of a one-to-many relationship is demonstrated when we join employees
table to salaries
using emp_no
.
In the employees
table, we have one record per employee.
We can check this via count verification:
SELECT
SUM(1) AS num_rows,
COUNT(DISTINCT emp_no) AS num_employees
FROM employees
;
num_rows | num_employees
----------+---------------
300024 | 300024
We can also infer this by previewing a few rows:
SELECT *
FROM employees
ORDER BY 1
LIMIT 5
;
emp_no | birth_date | first_name | last_name | gender | hire_date
--------+------------+------------+-----------+--------+------------
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26
10002 | 1964-06-02 | Bezalel | Simmel | F | 1985-11-21
10003 | 1959-12-03 | Parto | Bamford | M | 1986-08-28
10004 | 1954-05-01 | Chirstian | Koblick | M | 1986-12-01
10005 | 1955-01-21 | Kyoichi | Maliniak | M | 1989-09-12
...
In the salaries
table, we have multiple records per employee. The reason for this is because an employee might go through several salary raises over the years of being employed at the same (fictional) company.
We can check this via count verification:
SELECT
SUM(1) AS num_rows,
COUNT(DISTINCT emp_no) AS num_employees
FROM salaries
;
num_rows | num_employees
----------+---------------
2844047 | 300024
We can also infer this by spot checking an employee.
SELECT *
FROM salaries
WHERE emp_no = 10001
;
emp_no | salary | from_date | to_date
--------+--------+------------+------------
10001 | 60117 | 1986-06-26 | 1987-06-26
10001 | 62102 | 1987-06-26 | 1988-06-25
10001 | 66074 | 1988-06-25 | 1989-06-25
10001 | 66596 | 1989-06-25 | 1990-06-25
10001 | 66961 | 1990-06-25 | 1991-06-25
10001 | 71046 | 1991-06-25 | 1992-06-24
10001 | 74333 | 1992-06-24 | 1993-06-24
10001 | 75286 | 1993-06-24 | 1994-06-24
10001 | 75994 | 1994-06-24 | 1995-06-24
10001 | 76884 | 1995-06-24 | 1996-06-23
10001 | 80013 | 1996-06-23 | 1997-06-23
10001 | 81025 | 1997-06-23 | 1998-06-23
10001 | 81097 | 1998-06-23 | 1999-06-23
10001 | 84917 | 1999-06-23 | 2000-06-22
10001 | 85112 | 2000-06-22 | 2001-06-22
10001 | 85097 | 2001-06-22 | 2002-06-22
10001 | 88958 | 2002-06-22 | 9999-01-01
(17 rows)
To see how a one-to-many relationship join impacts the outcome, let’s narrow the scope down to one employee (emp_no = 10001
).
SELECT *
FROM employees
INNER JOIN salaries
ON employees.emp_no = salaries.emp_no
WHERE employees.emp_no = 10001
AND salaries.emp_no = 10001
;
emp_no | birth_date | first_name | last_name | gender | hire_date | emp_no | salary | from_date | to_date
--------+------------+------------+-----------+--------+------------+--------+--------+------------+------------
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 60117 | 1986-06-26 | 1987-06-26
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 62102 | 1987-06-26 | 1988-06-25
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 66074 | 1988-06-25 | 1989-06-25
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 66596 | 1989-06-25 | 1990-06-25
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 66961 | 1990-06-25 | 1991-06-25
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 71046 | 1991-06-25 | 1992-06-24
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 74333 | 1992-06-24 | 1993-06-24
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 75286 | 1993-06-24 | 1994-06-24
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 75994 | 1994-06-24 | 1995-06-24
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 76884 | 1995-06-24 | 1996-06-23
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 80013 | 1996-06-23 | 1997-06-23
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 81025 | 1997-06-23 | 1998-06-23
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 81097 | 1998-06-23 | 1999-06-23
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 84917 | 1999-06-23 | 2000-06-22
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 85112 | 2000-06-22 | 2001-06-22
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 85097 | 2001-06-22 | 2002-06-22
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 88958 | 2002-06-22 | 9999-01-01
(17 rows)
Notice that, for the columns sourced from employees
table (from the 1st emp_no
column … to hire_date
), the row values are duplicated. However, for the columns sourced from the salaries
table (from the 2nd emp_no
column … to to_date
), the row values retain the same cardinality.
After constructing a JOIN
query, it’s often helpful to do a sanity check to see if any rows or IDs were dropped during the JOIN
. One way to check is by previewing a few rows in the output. Another way is to do a COUNT()
aggregation sanity check.
A sample COUNT()
aggregation sanity check:
SELECT
SUM(1) AS num_total_rows,
COUNT(DISTINCT employees.emp_no) AS num_emp_from_employees_table,
COUNT(DISTINCT salaries.emp_no) AS num_emp_from_salaries_table
FROM employees
INNER JOIN salaries
ON employees.emp_no = salaries.emp_no
;
num_total_rows | num_emp_from_employees_table | num_emp_from_salaries_table
----------------+------------------------------+-----------------------------
2844047 | 300024 | 300024
Based on the output of the query above, this means that for the employees
table and the salaries
table, we retained all 300,024 employees in both tables after completing the INNER JOIN
(no one got dropped due to the INNER JOIN
). Note also, that our end result of 2,844,047 rows is the same number of rows as the original salaries
table.
Exercise
Explore the tables employees
and titles
.
INNER JOIN
is created between these two tables, what is the JOIN
key column in each table?INNER JOIN
between these two tables. Return all columns.INNER JOIN
for a few emp_no
records (e.g. adding on WHERE employees.emp_no BETWEEN 10001 AND 10005
).COUNT
aggregation method.JOIN
sSo far, we have only focused on INNER JOIN
s. In reality, there are many different types of JOIN
s.
The most common ones are: INNER JOIN
, FULL OUTER JOIN
, LEFT JOIN
, RIGHT JOIN
, and Cartesian (CROSS
) JOIN
.
To illustrate the difference among the different type of JOIN
s, we will be using the following two dummy tables for illustrative purposes:
employees
table:
emp_no | first_name | last_name
-------+------------+-----------
10001 | Georgi | Facello
10002 | Bezalel | Simmel
10003 | Parto | Bamford
10004 | Chirstian | Koblick
titles
table:
emp_no | title
-------+-----------------
10002 | Staff
10003 | Senior Engineer
10004 | Senior Engineer
10005 | Senior Staff
The INNER JOIN
An INNER JOIN
only returns data if the value in the key column
in the LEFT
table also has a corresponding value in the key column
in the RIGHT
table.
If we apply the pseudo SQL:
SELECT *
FROM employees
INNER JOIN titles
ON employees.emp_no = titles.emp_no
;
Then, only the rows where the emp_no
values are shared across both dummy tables are returned. Thus, the expected output would be:
emp_no | first_name | last_name | emp_no | title
-------+------------+-----------+--------+-----------------
10002 | Bezalel | Simmel | 10002 | Staff
10003 | Parto | Bamford | 10003 | Senior Engineer
10004 | Chirstian | Koblick | 10004 | Senior Engineer
The FULL OUTER JOIN
A FULL OUTER JOIN
(sometimes referred to as just OUTER JOIN
) returns all rows from both the LEFT table as well as the RIGHT table, once the join has been established.
If the value in the key column
on one of the JOIN
tables has no corresponding match in the key column
of the other table, then the columns for the table with no match will be filled with NULL
values for that particular row.
If we apply the pseudo SQL:
SELECT *
FROM employees
FULL OUTER JOIN titles
ON employees.emp_no = titles.emp_no
;
The expected output would be:
emp_no | first_name | last_name | emp_no | title
-------+------------+-----------+--------+-----------------
10001 | Georgi | Facello | |
10002 | Bezalel | Simmel | 10002 | Staff
10003 | Parto | Bamford | 10003 | Senior Engineer
10004 | Chirstian | Koblick | 10004 | Senior Engineer
| | | 10005 | Senior Staff
Note here, NULL
values are indicated by empty.
Exercises
Answer questions 1 and 2 based on the following queries:
SELECT *
FROM titles
FULL OUTER JOIN employees
ON employees.emp_no = titles.emp_no
;
SELECT *
FROM employees
FULL OUTER JOIN titles
ON employees.emp_no = titles.emp_no
;
True or False. The above two queries will give you the same # of rows in the output.
True or False. The above two queries will give you the same ordering of columns in the output.
True or False. For both INNER JOIN
s and FULL OUTER JOIN
s, swapping the order of which is the LEFT table and which is the RIGHT table will NOT change the # of rows in the output.
The LEFT JOIN
A LEFT JOIN
(sometimes referred to as LEFT OUTER JOIN
) returns all rows from the LEFT table, once the join has been established.
Hint: Recall from previous lecture that the LEFT
table is the table to the left of the JOIN
keyword.
When the JOIN
key column values are equal for both the LEFT and the RIGHT table, then the LEFT JOIN
creates a row that populates column values for both tables. (see below: emp_no
10002
, 10003
, 10004
)
When the JOIN
key column value is found in the LEFT table but not in the RIGHT table, then the LEFT JOIN
creates a row that populates column values for the LEFT table, and fills in the column values for the right table with NULL
s. (see below: emp_no = 10001
)
When the JOIN
key column value is found in the RIGHT table but not in the LEFT table, this row is not returned in the result of the LEFT JOIN
at all. (see below: emp_no = 10005
)
If we apply the pseudo SQL:
SELECT *
FROM employees
LEFT JOIN titles
ON employees.emp_no = titles.emp_no
;
The expected output would be:
emp_no | first_name | last_name | emp_no | title
-------+------------+-----------+--------+-----------------
10001 | Georgi | Facello | |
10002 | Bezalel | Simmel | 10002 | Staff
10003 | Parto | Bamford | 10003 | Senior Engineer
10004 | Chirstian | Koblick | 10004 | Senior Engineer
The RIGHT JOIN
A RIGHT JOIN
(sometimes referred to as RIGHT OUTER JOIN
) returns all rows from the RIGHT table, once the join has been established.
Hint: Recall from previous lecture that the RIGHT
table is the table to the right of the JOIN
keyword.
When the JOIN
key column values are equal for both the LEFT and the RIGHT table, then the RIGHT JOIN
creates a row that populates column values for both tables. (see below: emp_no
10002
, 10003
, 10004
)
When the JOIN
key column value is found in the RIGHT table but not in the LEFT table, then the RIGHT JOIN
creates a row that populates column values for the RIGHT table, and fills in the column values for the LEFT table with NULL
s. (see below: emp_no = 10005
)
When the JOIN
key column value is found in the LEFT table but not in the RIGHT table, this row is not returned in the result of the RIGHT JOIN
. (see below: emp_no = 10001
)
If we apply the pseudo SQL:
SELECT *
FROM employees
RIGHT JOIN titles
ON employees.emp_no = titles.emp_no
;
The expected output would be:
emp_no | first_name | last_name | emp_no | title
-------+------------+-----------+--------+-----------------
10002 | Bezalel | Simmel | 10002 | Staff
10003 | Parto | Bamford | 10003 | Senior Engineer
10004 | Chirstian | Koblick | 10004 | Senior Engineer
| | | 10005 | Senior Staff
Common use cases and conventions
Case #1. Every RIGHT JOIN
could be re-written as a LEFT JOIN
. Because of this, the convention is to always write LEFT JOIN
s instead of RIGHT JOIN
s.
For example:
SELECT employees.*, titles.*
FROM employees
RIGHT JOIN titles
ON employees.emp_no = titles.emp_no
;
can be re-written as:
SELECT employees.*, titles.*
FROM titles
LEFT JOIN employees
ON employees.emp_no = titles.emp_no -- note: the key column order doesn't matter.
;
Case #2. LEFT JOIN
s are often used to supplement information (columns) from the RIGHT table to the LEFT table, without dropping any rows in the LEFT table.
For example:
The dept_emp
table contains information on the employee level, including the department number. However, it does not have the department name for each employee. To supplement that information, we LEFT JOIN
the departments
table to it, to pull in the department name (dept_name
).
The SQL:
SELECT *
FROM dept_emp
LEFT JOIN departments
ON dept_emp.dept_no = departments.dept_no
;
outputs:
emp_no | dept_no | from_date | to_date | dept_no | dept_name
--------+---------+------------+------------+---------+--------------------
10001 | d005 | 1986-06-26 | 9999-01-01 | d005 | Development
10002 | d007 | 1996-08-03 | 9999-01-01 | d007 | Sales
10003 | d004 | 1995-12-03 | 9999-01-01 | d004 | Production
10004 | d004 | 1986-12-01 | 9999-01-01 | d004 | Production
10005 | d003 | 1989-09-12 | 9999-01-01 | d003 | Human Resources
10006 | d005 | 1990-08-05 | 9999-01-01 | d005 | Development
10007 | d008 | 1989-02-10 | 9999-01-01 | d008 | Research
10008 | d005 | 1998-03-11 | 2000-07-31 | d005 | Development
10009 | d006 | 1985-02-18 | 9999-01-01 | d006 | Quality Management
10010 | d004 | 1996-11-24 | 2000-06-26 | d004 | Production
....
Here, we use LEFT JOIN
instead of INNER JOIN
to ensure that just in case a department number is missing from the departments
table, it does not then unintentionally filter out employees from the dept_emp
table that has that department number.
Exercise
Re-write the following SQL query with a LEFT JOIN
but make sure the output is still the same.
SELECT
EXTRACT(year FROM dept_emp.from_date) AS start_year,
COUNT(DISTINCT employees.emp_no) AS num_employees
FROM employees
RIGHT JOIN dept_emp
ON employees.emp_no = dept_emp.emp_no
WHERE EXTRACT(year FROM employees.birth_date) BETWEEN 1950 AND 1960
GROUP BY 1
ORDER BY 1
;
The Cartesian (a.k.a CROSS
) JOIN
A CROSS JOIN
creates a Cartesian Product of two tables and does not have a ON
clause.
If the LEFT table has m rows and the RIGHT table has n rows, the CROSS JOIN
result will have m x n rows.
If we apply the pseudo SQL to our dummy tables:
SELECT *
FROM employees
CROSS JOIN titles
;
We will create 4 x 4 = 16 rows of Cartesian Product.
The expected output would be:
emp_no | first_name | last_name | emp_no | title
-------+------------+-----------+--------+-----------------
10001 | Georgi | Facello | 10002 | Staff
10001 | Georgi | Facello | 10003 | Senior Engineer
10001 | Georgi | Facello | 10004 | Senior Engineer
10001 | Georgi | Facello | 10005 | Senior Staff
10002 | Bezalel | Simmel | 10002 | Staff
10002 | Bezalel | Simmel | 10003 | Senior Engineer
10002 | Bezalel | Simmel | 10004 | Senior Engineer
10002 | Bezalel | Simmel | 10005 | Senior Staff
...
(16 rows)
CROSS JOIN
s are not often used, except in special cases (e.g. needing a fast way to create a temporary table).
Exercises
How many rows would be in the output of the CROSS JOIN
of the employees
and salaries
table?
SELECT *
FROM employees
CROSS JOIN salaries
;
Hint: It’s better to calculate this using arithmetic instead of actually running the query, which is computationally very expensive and may take a while.
JOIN
sMultiple table JOIN
s are used when the information needed for the final output lies across more than 2 tables. These JOIN
s can be complex and are best done after some brainstorming.
For example, assume we want to create an output that looks like this:
first_name | last_name | dept_name
------------+-----------+--------------------
Georgi | Facello |Development
Bezalel | Simmel |Sales
Parto | Bamford |Production
Chirstian | Koblick |Production
Kyoichi | Maliniak |Human Resources
Anneke | Preusig |Development
Tzvetan | Zielinski |Research
Saniya | Kalloufi |Development
Sumant | Peac |Quality Management
Duangkaew | Piveteau |Production
...
The columns we need lie across multiple tables.
first_name
and last_name
can be found in the employees
table:
emp_no | birth_date | first_name | last_name | gender | hire_date
--------+------------+------------+-----------+--------+------------
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26
10002 | 1964-06-02 | Bezalel | Simmel | F | 1985-11-21
10003 | 1959-12-03 | Parto | Bamford | M | 1986-08-28
10004 | 1954-05-01 | Chirstian | Koblick | M | 1986-12-01
10005 | 1955-01-21 | Kyoichi | Maliniak | M | 1989-09-12
10006 | 1953-04-20 | Anneke | Preusig | F | 1989-06-02
10007 | 1957-05-23 | Tzvetan | Zielinski | F | 1989-02-10
10008 | 1958-02-19 | Saniya | Kalloufi | M | 1994-09-15
10009 | 1952-04-19 | Sumant | Peac | F | 1985-02-18
10010 | 1963-06-01 | Duangkaew | Piveteau | F | 1989-08-24
...
(300024 rows)
dept_name
can be found in the departments
table:
dept_no | dept_name
---------+--------------------
d001 | Marketing
d002 | Finance
d003 | Human Resources
d004 | Production
d005 | Development
d006 | Quality Management
d007 | Sales
d008 | Research
d009 | Customer Service
(9 rows)
However, there is no common column shared between the employees
table and the departments
table, so we do not have our JOIN
key columns!
We use dept_emp
table as a bridging table since it contains both emp_no
and dept_no
:
emp_no | dept_no | from_date | to_date
--------+---------+------------+------------
10001 | d005 | 1986-06-26 | 9999-01-01
10002 | d007 | 1996-08-03 | 9999-01-01
10003 | d004 | 1995-12-03 | 9999-01-01
10004 | d004 | 1986-12-01 | 9999-01-01
10005 | d003 | 1989-09-12 | 9999-01-01
10006 | d005 | 1990-08-05 | 9999-01-01
10007 | d008 | 1989-02-10 | 9999-01-01
10008 | d005 | 1998-03-11 | 2000-07-31
10009 | d006 | 1985-02-18 | 9999-01-01
10010 | d004 | 1996-11-24 | 2000-06-26
...
(331603 rows)
The final query is:
SELECT
employees.first_name,
employees.last_name,
dept_emp.from_date,
dept_emp.to_date,
departments.dept_name
FROM employees
LEFT JOIN dept_emp
ON employees.emp_no = dept_emp.emp_no
LEFT JOIN departments
ON dept_emp.dept_no = departments.dept_no
ORDER BY employees.emp_no, dept_emp.from_date
;
Common use cases and conventions
Case #1. Most multi table JOIN
s are either all INNER JOIN
s or all LEFT JOIN
s. Mixing too many different types of JOIN
s in one single SELECT
query can be difficult to debug.
Case #2. Multi table JOIN
s follow two common structures.
Our example above, follows the following structure, where table b is used as a bridging table that connects table a with table c.
SELECT *
FROM table_a a
LEFT JOIN table_b b
ON a.join_key = b.join_key
LEFT JOIN table_c c
ON b.join_key = c.join_key
;
It is also possible to JOIN
all subsequent tables to the first table indicated (table a). Here, all subsequent tables (table b, table c) are JOIN
ed to table a.
SELECT *
FROM table_a a
LEFT JOIN table_b b
ON a.join_key = b.join_key
LEFT JOIN table_c c
ON a.join_key = c.join_key
;
Your choice for how to construct the JOIN
depends on what you need the output to be, so it helps to brainstorm first before writing the SQL.
Ex 1. The multi table JOIN
query we constructed in the lesson above (also replicated below) has duplications on the emp_no
level. Investigate what is causing the duplication.
SELECT
employees.first_name,
employees.last_name,
dept_emp.from_date,
dept_emp.to_date,
departments.dept_name
FROM employees
LEFT JOIN dept_emp
ON employees.emp_no = dept_emp.emp_no
LEFT JOIN departments
ON dept_emp.dept_no = departments.dept_no
ORDER BY employees.emp_no, dept_emp.from_date
;
Ex 2. Fill in the blanks to create a working SQL query.
SELECT
employees.first_name,
employees.last_name,
dept_emp.from_date,
dept_emp.to_date,
departments.dept_name
FROM departments
INNER JOIN dept_emp
ON ??? = ???
INNER JOIN employees
ON ??? = ???
ORDER BY employees.emp_no, dept_emp.from_date
;
Ex 3. Construct a SQL query that returns the following columns:
to_date = '9999-01-01'
)to_date = '9999-01-01'
)INNER JOIN
.FULL OUTER JOIN
.LEFT JOIN
.RIGHT JOIN
.