Home

Topic: Complex JOINs

Learning objective


Warm up

Ex 1. True or False. The salaries table contains one row per employee ID (emp_no)?

Ex 2. True or False. The employees table contains one row per employee ID (emp_no)?

Ex 3. Fill in the blanks to create an INNER JOIN between the two tables:

SELECT *
FROM employees
INNER JOIN salaries
  ON ??? = ???
;

Ex 4. Use the query below to answer the following questions:

SELECT
  SUM(1),
  COUNT(DISTINCT employees.emp_no),
  COUNT(DISTINCT salaries.emp_no)
FROM employees
INNER JOIN salaries
  ON employees.emp_no = salaries.emp_no
;

One-to-one, one-to-many relationships


In a one-to-one relationship, a row in one table is associated (joined) with one and only one record in another table.

In a one-to-many relationship, a row in one table is associated (joined) with multiple rows in another table.

In a many-to-many relationship, multiple rows in one table are associated (joined) with multiple rows in another table. (This is pretty rare.)



An example of a one-to-many relationship is demonstrated when we join employees table to salaries using emp_no.


In the employees table, we have one record per employee.

We can check this via count verification:

SELECT
  SUM(1) AS num_rows,
  COUNT(DISTINCT emp_no) AS num_employees
FROM employees
;
num_rows | num_employees
----------+---------------
  300024 |        300024

We can also infer this by previewing a few rows:

SELECT *
FROM employees
ORDER BY 1
LIMIT 5
;
emp_no | birth_date | first_name | last_name | gender | hire_date
--------+------------+------------+-----------+--------+------------
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26
 10002 | 1964-06-02 | Bezalel    | Simmel    | F      | 1985-11-21
 10003 | 1959-12-03 | Parto      | Bamford   | M      | 1986-08-28
 10004 | 1954-05-01 | Chirstian  | Koblick   | M      | 1986-12-01
 10005 | 1955-01-21 | Kyoichi    | Maliniak  | M      | 1989-09-12
 ...



In the salaries table, we have multiple records per employee. The reason for this is because an employee might go through several salary raises over the years of being employed at the same (fictional) company.

We can check this via count verification:

 SELECT
   SUM(1) AS num_rows,
   COUNT(DISTINCT emp_no) AS num_employees
 FROM salaries
 ;
 num_rows | num_employees
----------+---------------
  2844047 |        300024

We can also infer this by spot checking an employee.

 SELECT *
 FROM salaries
 WHERE emp_no = 10001
 ;
 emp_no | salary | from_date  |  to_date
--------+--------+------------+------------
 10001 |  60117 | 1986-06-26 | 1987-06-26
 10001 |  62102 | 1987-06-26 | 1988-06-25
 10001 |  66074 | 1988-06-25 | 1989-06-25
 10001 |  66596 | 1989-06-25 | 1990-06-25
 10001 |  66961 | 1990-06-25 | 1991-06-25
 10001 |  71046 | 1991-06-25 | 1992-06-24
 10001 |  74333 | 1992-06-24 | 1993-06-24
 10001 |  75286 | 1993-06-24 | 1994-06-24
 10001 |  75994 | 1994-06-24 | 1995-06-24
 10001 |  76884 | 1995-06-24 | 1996-06-23
 10001 |  80013 | 1996-06-23 | 1997-06-23
 10001 |  81025 | 1997-06-23 | 1998-06-23
 10001 |  81097 | 1998-06-23 | 1999-06-23
 10001 |  84917 | 1999-06-23 | 2000-06-22
 10001 |  85112 | 2000-06-22 | 2001-06-22
 10001 |  85097 | 2001-06-22 | 2002-06-22
 10001 |  88958 | 2002-06-22 | 9999-01-01
(17 rows)



To see how a one-to-many relationship join impacts the outcome, let’s narrow the scope down to one employee (emp_no = 10001).

SELECT *
FROM employees
INNER JOIN salaries
  ON employees.emp_no = salaries.emp_no
WHERE employees.emp_no = 10001
  AND salaries.emp_no = 10001
;
emp_no | birth_date | first_name | last_name | gender | hire_date  | emp_no | salary | from_date  |  to_date
--------+------------+------------+-----------+--------+------------+--------+--------+------------+------------
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  60117 | 1986-06-26 | 1987-06-26
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  62102 | 1987-06-26 | 1988-06-25
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  66074 | 1988-06-25 | 1989-06-25
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  66596 | 1989-06-25 | 1990-06-25
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  66961 | 1990-06-25 | 1991-06-25
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  71046 | 1991-06-25 | 1992-06-24
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  74333 | 1992-06-24 | 1993-06-24
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  75286 | 1993-06-24 | 1994-06-24
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  75994 | 1994-06-24 | 1995-06-24
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  76884 | 1995-06-24 | 1996-06-23
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  80013 | 1996-06-23 | 1997-06-23
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  81025 | 1997-06-23 | 1998-06-23
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  81097 | 1998-06-23 | 1999-06-23
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  84917 | 1999-06-23 | 2000-06-22
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  85112 | 2000-06-22 | 2001-06-22
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  85097 | 2001-06-22 | 2002-06-22
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  88958 | 2002-06-22 | 9999-01-01
 (17 rows)

Notice that, for the columns sourced from employees table (from the 1st emp_no column … to hire_date), the row values are duplicated. However, for the columns sourced from the salaries table (from the 2nd emp_no column … to to_date), the row values retain the same cardinality.



After constructing a JOIN query, it’s often helpful to do a sanity check to see if any rows or IDs were dropped during the JOIN. One way to check is by previewing a few rows in the output. Another way is to do a COUNT() aggregation sanity check.

A sample COUNT() aggregation sanity check:

SELECT
  SUM(1) AS num_total_rows,
  COUNT(DISTINCT employees.emp_no) AS num_emp_from_employees_table,
  COUNT(DISTINCT salaries.emp_no) AS num_emp_from_salaries_table
FROM employees
INNER JOIN salaries
  ON employees.emp_no = salaries.emp_no
;
num_total_rows | num_emp_from_employees_table | num_emp_from_salaries_table
----------------+------------------------------+-----------------------------
       2844047 |                       300024 |                      300024

Based on the output of the query above, this means that for the employees table and the salaries table, we retained all 300,024 employees in both tables after completing the INNER JOIN (no one got dropped due to the INNER JOIN). Note also, that our end result of 2,844,047 rows is the same number of rows as the original salaries table.



Exercise

Explore the tables employees and titles.


Different types of JOINs


So far, we have only focused on INNER JOINs. In reality, there are many different types of JOINs.

The most common ones are: INNER JOIN, FULL OUTER JOIN, LEFT JOIN, RIGHT JOIN, and Cartesian (CROSS) JOIN.

To illustrate the difference among the different type of JOINs, we will be using the following two dummy tables for illustrative purposes:

employees table:

emp_no | first_name | last_name
-------+------------+-----------
 10001 | Georgi     | Facello   
 10002 | Bezalel    | Simmel    
 10003 | Parto      | Bamford   
 10004 | Chirstian  | Koblick   

titles table:

emp_no |      title      
-------+-----------------
10002 | Staff           
10003 | Senior Engineer
10004 | Senior Engineer
10005 | Senior Staff




The INNER JOIN

An INNER JOIN only returns data if the value in the key column in the LEFT table also has a corresponding value in the key column in the RIGHT table.


If we apply the pseudo SQL:

SELECT *
FROM employees
INNER JOIN titles
  ON employees.emp_no = titles.emp_no
;  

Then, only the rows where the emp_no values are shared across both dummy tables are returned. Thus, the expected output would be:

emp_no | first_name | last_name | emp_no |      title
-------+------------+-----------+--------+-----------------
 10002 | Bezalel    | Simmel    | 10002  | Staff
 10003 | Parto      | Bamford   | 10003  | Senior Engineer
 10004 | Chirstian  | Koblick   | 10004  | Senior Engineer




The FULL OUTER JOIN

A FULL OUTER JOIN (sometimes referred to as just OUTER JOIN) returns all rows from both the LEFT table as well as the RIGHT table, once the join has been established.

If the value in the key column on one of the JOIN tables has no corresponding match in the key column of the other table, then the columns for the table with no match will be filled with NULL values for that particular row.


If we apply the pseudo SQL:

SELECT *
FROM employees
FULL OUTER JOIN titles
  ON employees.emp_no = titles.emp_no
;  

The expected output would be:

emp_no | first_name | last_name | emp_no |      title
-------+------------+-----------+--------+-----------------
 10001 | Georgi     | Facello   |        |
 10002 | Bezalel    | Simmel    | 10002  | Staff
 10003 | Parto      | Bamford   | 10003  | Senior Engineer
 10004 | Chirstian  | Koblick   | 10004  | Senior Engineer
       |            |           | 10005  | Senior Staff

Note here, NULL values are indicated by empty.




Exercises

Answer questions 1 and 2 based on the following queries:

SELECT *
FROM titles
FULL OUTER JOIN employees
  ON employees.emp_no = titles.emp_no
;  
SELECT *
FROM employees
FULL OUTER JOIN titles
  ON employees.emp_no = titles.emp_no
;  
  1. True or False. The above two queries will give you the same # of rows in the output.

  2. True or False. The above two queries will give you the same ordering of columns in the output.

  3. True or False. For both INNER JOINs and FULL OUTER JOINs, swapping the order of which is the LEFT table and which is the RIGHT table will NOT change the # of rows in the output.




The LEFT JOIN

A LEFT JOIN (sometimes referred to as LEFT OUTER JOIN) returns all rows from the LEFT table, once the join has been established.

Hint: Recall from previous lecture that the LEFT table is the table to the left of the JOIN keyword.

When the JOIN key column values are equal for both the LEFT and the RIGHT table, then the LEFT JOIN creates a row that populates column values for both tables. (see below: emp_no 10002, 10003, 10004)

When the JOIN key column value is found in the LEFT table but not in the RIGHT table, then the LEFT JOIN creates a row that populates column values for the LEFT table, and fills in the column values for the right table with NULLs. (see below: emp_no = 10001)

When the JOIN key column value is found in the RIGHT table but not in the LEFT table, this row is not returned in the result of the LEFT JOIN at all. (see below: emp_no = 10005)


If we apply the pseudo SQL:

SELECT *
FROM employees
LEFT JOIN titles
  ON employees.emp_no = titles.emp_no
;  

The expected output would be:

emp_no | first_name | last_name | emp_no |      title
-------+------------+-----------+--------+-----------------
 10001 | Georgi     | Facello   |        |
 10002 | Bezalel    | Simmel    | 10002  | Staff
 10003 | Parto      | Bamford   | 10003  | Senior Engineer
 10004 | Chirstian  | Koblick   | 10004  | Senior Engineer




The RIGHT JOIN

A RIGHT JOIN (sometimes referred to as RIGHT OUTER JOIN) returns all rows from the RIGHT table, once the join has been established.

Hint: Recall from previous lecture that the RIGHT table is the table to the right of the JOIN keyword.

When the JOIN key column values are equal for both the LEFT and the RIGHT table, then the RIGHT JOIN creates a row that populates column values for both tables. (see below: emp_no 10002, 10003, 10004)

When the JOIN key column value is found in the RIGHT table but not in the LEFT table, then the RIGHT JOIN creates a row that populates column values for the RIGHT table, and fills in the column values for the LEFT table with NULLs. (see below: emp_no = 10005)

When the JOIN key column value is found in the LEFT table but not in the RIGHT table, this row is not returned in the result of the RIGHT JOIN. (see below: emp_no = 10001)


If we apply the pseudo SQL:

SELECT *
FROM employees
RIGHT JOIN titles
  ON employees.emp_no = titles.emp_no
;  

The expected output would be:

emp_no | first_name | last_name | emp_no |      title
-------+------------+-----------+--------+-----------------
 10002 | Bezalel    | Simmel    | 10002  | Staff
 10003 | Parto      | Bamford   | 10003  | Senior Engineer
 10004 | Chirstian  | Koblick   | 10004  | Senior Engineer
       |            |           | 10005  | Senior Staff




Common use cases and conventions


Case #1. Every RIGHT JOIN could be re-written as a LEFT JOIN. Because of this, the convention is to always write LEFT JOINs instead of RIGHT JOINs.

For example:

SELECT employees.*, titles.*
FROM employees
RIGHT JOIN titles
  ON employees.emp_no = titles.emp_no
;  

can be re-written as:

SELECT employees.*, titles.*
FROM titles
LEFT JOIN employees
  ON employees.emp_no = titles.emp_no -- note: the key column order doesn't matter.
;  



Case #2. LEFT JOINs are often used to supplement information (columns) from the RIGHT table to the LEFT table, without dropping any rows in the LEFT table.

For example:

The dept_emp table contains information on the employee level, including the department number. However, it does not have the department name for each employee. To supplement that information, we LEFT JOIN the departments table to it, to pull in the department name (dept_name).

The SQL:

SELECT *
FROM dept_emp
LEFT JOIN departments
  ON dept_emp.dept_no = departments.dept_no
;  

outputs:

emp_no | dept_no | from_date  |  to_date   | dept_no |     dept_name
--------+---------+------------+------------+---------+--------------------
 10001 | d005    | 1986-06-26 | 9999-01-01 | d005    | Development
 10002 | d007    | 1996-08-03 | 9999-01-01 | d007    | Sales
 10003 | d004    | 1995-12-03 | 9999-01-01 | d004    | Production
 10004 | d004    | 1986-12-01 | 9999-01-01 | d004    | Production
 10005 | d003    | 1989-09-12 | 9999-01-01 | d003    | Human Resources
 10006 | d005    | 1990-08-05 | 9999-01-01 | d005    | Development
 10007 | d008    | 1989-02-10 | 9999-01-01 | d008    | Research
 10008 | d005    | 1998-03-11 | 2000-07-31 | d005    | Development
 10009 | d006    | 1985-02-18 | 9999-01-01 | d006    | Quality Management
 10010 | d004    | 1996-11-24 | 2000-06-26 | d004    | Production
 ....

Here, we use LEFT JOIN instead of INNER JOIN to ensure that just in case a department number is missing from the departments table, it does not then unintentionally filter out employees from the dept_emp table that has that department number.




Exercise

Re-write the following SQL query with a LEFT JOIN but make sure the output is still the same.

SELECT
  EXTRACT(year FROM dept_emp.from_date) AS start_year,
  COUNT(DISTINCT employees.emp_no) AS num_employees
FROM employees
RIGHT JOIN dept_emp
  ON employees.emp_no = dept_emp.emp_no
WHERE EXTRACT(year FROM employees.birth_date) BETWEEN 1950 AND 1960
GROUP BY 1
ORDER BY 1
;




The Cartesian (a.k.a CROSS) JOIN

A CROSS JOIN creates a Cartesian Product of two tables and does not have a ON clause.

If the LEFT table has m rows and the RIGHT table has n rows, the CROSS JOIN result will have m x n rows.

If we apply the pseudo SQL to our dummy tables:

SELECT *
FROM employees
CROSS JOIN titles
;  

We will create 4 x 4 = 16 rows of Cartesian Product.

The expected output would be:

emp_no | first_name | last_name | emp_no |      title
-------+------------+-----------+--------+-----------------
 10001 | Georgi     | Facello   | 10002  | Staff
 10001 | Georgi     | Facello   | 10003  | Senior Engineer
 10001 | Georgi     | Facello   | 10004  | Senior Engineer
 10001 | Georgi     | Facello   | 10005  | Senior Staff
 10002 | Bezalel    | Simmel    | 10002  | Staff
 10002 | Bezalel    | Simmel    | 10003  | Senior Engineer
 10002 | Bezalel    | Simmel    | 10004  | Senior Engineer
 10002 | Bezalel    | Simmel    | 10005  | Senior Staff
...
(16 rows)       

CROSS JOINs are not often used, except in special cases (e.g. needing a fast way to create a temporary table).




Exercises

How many rows would be in the output of the CROSS JOIN of the employees and salaries table?

SELECT *
FROM employees
CROSS JOIN salaries
;

Hint: It’s better to calculate this using arithmetic instead of actually running the query, which is computationally very expensive and may take a while.


Multiple table JOINs

Multiple table JOINs are used when the information needed for the final output lies across more than 2 tables. These JOINs can be complex and are best done after some brainstorming.



For example, assume we want to create an output that looks like this:

first_name | last_name |    dept_name
------------+-----------+--------------------
Georgi     | Facello   |Development
Bezalel    | Simmel    |Sales
Parto      | Bamford   |Production
Chirstian  | Koblick   |Production
Kyoichi    | Maliniak  |Human Resources
Anneke     | Preusig   |Development
Tzvetan    | Zielinski |Research
Saniya     | Kalloufi  |Development
Sumant     | Peac      |Quality Management
Duangkaew  | Piveteau  |Production
...


The columns we need lie across multiple tables.


first_name and last_name can be found in the employees table:

emp_no | birth_date | first_name | last_name | gender | hire_date
--------+------------+------------+-----------+--------+------------
 10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26
 10002 | 1964-06-02 | Bezalel    | Simmel    | F      | 1985-11-21
 10003 | 1959-12-03 | Parto      | Bamford   | M      | 1986-08-28
 10004 | 1954-05-01 | Chirstian  | Koblick   | M      | 1986-12-01
 10005 | 1955-01-21 | Kyoichi    | Maliniak  | M      | 1989-09-12
 10006 | 1953-04-20 | Anneke     | Preusig   | F      | 1989-06-02
 10007 | 1957-05-23 | Tzvetan    | Zielinski | F      | 1989-02-10
 10008 | 1958-02-19 | Saniya     | Kalloufi  | M      | 1994-09-15
 10009 | 1952-04-19 | Sumant     | Peac      | F      | 1985-02-18
 10010 | 1963-06-01 | Duangkaew  | Piveteau  | F      | 1989-08-24
 ...
 (300024 rows)


dept_name can be found in the departments table:

dept_no |     dept_name
---------+--------------------
d001    | Marketing
d002    | Finance
d003    | Human Resources
d004    | Production
d005    | Development
d006    | Quality Management
d007    | Sales
d008    | Research
d009    | Customer Service
(9 rows)


However, there is no common column shared between the employees table and the departments table, so we do not have our JOIN key columns!


We use dept_emp table as a bridging table since it contains both emp_no and dept_no:

 emp_no | dept_no | from_date  |  to_date
--------+---------+------------+------------
 10001 | d005    | 1986-06-26 | 9999-01-01
 10002 | d007    | 1996-08-03 | 9999-01-01
 10003 | d004    | 1995-12-03 | 9999-01-01
 10004 | d004    | 1986-12-01 | 9999-01-01
 10005 | d003    | 1989-09-12 | 9999-01-01
 10006 | d005    | 1990-08-05 | 9999-01-01
 10007 | d008    | 1989-02-10 | 9999-01-01
 10008 | d005    | 1998-03-11 | 2000-07-31
 10009 | d006    | 1985-02-18 | 9999-01-01
 10010 | d004    | 1996-11-24 | 2000-06-26
 ...
 (331603 rows)


The final query is:

SELECT
  employees.first_name,
  employees.last_name,
  dept_emp.from_date,
  dept_emp.to_date,
  departments.dept_name
FROM employees
LEFT JOIN dept_emp
  ON employees.emp_no = dept_emp.emp_no
LEFT JOIN departments
  ON dept_emp.dept_no = departments.dept_no
ORDER BY employees.emp_no, dept_emp.from_date    
;




Common use cases and conventions


Case #1. Most multi table JOINs are either all INNER JOINs or all LEFT JOINs. Mixing too many different types of JOINs in one single SELECT query can be difficult to debug.



Case #2. Multi table JOINs follow two common structures.

Our example above, follows the following structure, where table b is used as a bridging table that connects table a with table c.

SELECT *
FROM table_a a
LEFT JOIN table_b b
  ON a.join_key = b.join_key
LEFT JOIN table_c c
  ON b.join_key = c.join_key
;

It is also possible to JOIN all subsequent tables to the first table indicated (table a). Here, all subsequent tables (table b, table c) are JOINed to table a.

SELECT *
FROM table_a a
LEFT JOIN table_b b
  ON a.join_key = b.join_key
LEFT JOIN table_c c
  ON a.join_key = c.join_key
;

Your choice for how to construct the JOIN depends on what you need the output to be, so it helps to brainstorm first before writing the SQL.


Breakout exercises


Ex 1. The multi table JOIN query we constructed in the lesson above (also replicated below) has duplications on the emp_no level. Investigate what is causing the duplication.

SELECT
  employees.first_name,
  employees.last_name,
  dept_emp.from_date,
  dept_emp.to_date,
  departments.dept_name
FROM employees
LEFT JOIN dept_emp
  ON employees.emp_no = dept_emp.emp_no
LEFT JOIN departments
  ON dept_emp.dept_no = departments.dept_no
ORDER BY employees.emp_no, dept_emp.from_date    
;


Ex 2. Fill in the blanks to create a working SQL query.

SELECT
  employees.first_name,
  employees.last_name,
  dept_emp.from_date,
  dept_emp.to_date,
  departments.dept_name
FROM departments
INNER JOIN dept_emp
  ON ??? = ???
INNER JOIN employees
  ON ??? = ???
ORDER BY employees.emp_no, dept_emp.from_date    
;


Ex 3. Construct a SQL query that returns the following columns:


Resources


Questions?