Home

Lectures 18 & 19: Subqueries and CTEs

Learning objective


What is a subquery?


A subquery is a query that appears inside another SQL query statement. SQL queries with subqueries are also known as nested queries or sub-select queries.

The nested subquery is also called the “inner query”, whereas the query on the outside is also referred to as the “outer query”/



A simple example

The following query finds the average salary across all employees’ salaries stored in the salaries table.

SELECT AVG(salary)
FROM salaries
;

which gives us the value: 63810.744836143706

Previously, if we had wanted to filter the salaries table so that we only retain rows where the salary is above the average salary, we would then have to solve this problem in two pieces.

Once, to calculate the value for average salary (see SQL above), and then again, to do the filtering (see SQL below).

SELECT *
FROM salaries
WHERE salary > 63810.744836143706
;

With subqueries, we could solve this in one step:

SELECT *
FROM salaries
WHERE salary > (SELECT AVG(salary) FROM salaries)
;


To note:

In addition to making the solution shorter and more elegant, subqueries like above are robust against data changes. With data addition or deletion, aggregate calculations such as averages will shift constantly. In order to upkeep the two step analysis, we would need to constantly change the hardcoded average salary value. However, by writing this using subqueries, we bypass that need.


Order of syntax and execution


Order of execution

  1. The subquery is executed first.
  2. The results of the subquery will be passed on to the outer query.
  3. The outer query is executed.

Both subquery and outer query follow the traditional syntax rules for order of execution.




Order of syntax

Depending on where the subquery is nested, the order of syntax differs, although the subquery is always wrapped in parentheses.

Below are a few common examples:


If the subquery is nested in the SELECT clause:

SELECT
  column_name,
  (SELECT column_name FROM table_name) AS alias_column_name
FROM table_name
;  



If the subquery is nested in the WHERE clause:

SELECT *
FROM table_name
WHERE column_name IN (SELECT column_name FROM table_name)
;



If the subquery is nested in the FROM clause:

SELECT *
FROM (
  SELECT
    column_name,
    column_name,
    ...
  FROM table_name    
  ) table_alias
WHERE ...  
GROUP BY ...
;



If the subquery is rewritten as a common table expression (CTE):

WITH cte_name AS (
  SELECT
    column_name,
    column_name,
    ...
  FROM table_name
  )

SELECT *
FROM cte_name
;




Style

Please see the SQL style guide for more details.


Common types of subquery


In this section, we will review some of the most common types of subqueries:

  1. Scalar subquery
  2. IN subquery
  3. Table subquery

1. The scalar subquery


A scalar subquery returns a single value. Scalar subqueries often appear in the SELECT or WHERE clause.




Scalar subquery in the WHERE clause

The following is a scalar subquery in the WHERE clause:

SELECT *
FROM salaries
WHERE salary > (SELECT AVG(salary) FROM salaries)
;

The subquery returns a single value: 63810.744836143706

SELECT AVG(salary)
FROM salaries
;

The calculated value 63810.744836143706 is then passed on to the outer query in the WHERE clause, where it is used to evaluate and filter against, line by line, the values stored in the column salary in the outer query.




Scalar subquery in the SELECT clause

The scalar subquery can also reside in SELECT clause.

Here is an example:

SELECT
  emp_no,
  salary,
  (SELECT AVG(salary) FROM salaries) AS average_salary
FROM salaries
;  

which outputs:

emp_no | salary |   average_salary
-------+--------+--------------------
 10001 |  76884 | 63810.744836143706
 10001 |  74333 | 63810.744836143706
 10001 |  85112 | 63810.744836143706
....
 10002 |  69366 | 63810.744836143706
 10002 |  71963 | 63810.744836143706
 10002 |  65828 | 63810.744836143706
....
 10003 |  43478 | 63810.744836143706
 10003 |  40006 | 63810.744836143706
 10003 |  43699 | 63810.744836143706
....
(2,844,047 rows)


Note:

  1. Because this scalar subquery is used to create a temporary column in the SELECT statement, a column alias (AS average_salary) is used.

  2. The subquery is not aggregating across emp_no, but rather operating independently of the outer query and returning the average salary universally across the entire table, thus the output retains the same number of rows (2,844,047 rows) as the original table salaries.




Using scalar subquery to calculate deviation

Scalar subqueries can also be used to answer the question that requires in-line calculations such as deviations from the norm.

For example, the following query answers the question “How much is a employee’s being over or under paid, as compared to the norm?”

SELECT
  emp_no,
  salary,
  (SELECT AVG(salary) FROM salaries WHERE to_date = '9999-01-01') AS universal_avg_salary,
  salary - (SELECT AVG(salary) FROM salaries WHERE to_date = '9999-01-01') AS diff_salary
FROM salaries
WHERE to_date = '9999-01-01'
;  

which outputs:

emp_no | salary | universal_avg_salary |     diff_salary
-------+--------+----------------------+---------------------
 10001 |  88958 |   72012.235857307058 |  16945.764142692942
 10002 |  72527 |   72012.235857307058 |    514.764142692942
 10003 |  43311 |   72012.235857307058 | -28701.235857307058
 10004 |  74057 |   72012.235857307058 |   2044.764142692942
 10005 |  94692 |   72012.235857307058 |  22679.764142692942
...

Let’s break this down step by step:

We know that the salaries table contains historical records on employee salary. In order to find a fair benchmark, we should calculate the average salary based on current salary data only. To filter this, we apply the filter to_date = '9999-01-01'.

Thus, our new average salary as earned by current employees is thus:

SELECT AVG(salary)
FROM salaries
WHERE to_date = '9999-01-01'
;

Then, we build the outer query, also taking care that we filter for present salary for each employee. Given that only the resulting value (72012.235857307058) and not the SQL logic gets transferred from the subquery to the outer query, we once again have to specify the filter to_date = '9999-01-01' in the outer query’s WHERE clause.

SELECT
  emp_no,
  salary
FROM salaries
WHERE to_date = '9999-01-01'
;  

We can add the scalar subquery to the outer query as an additional column, and call it universal_avg_salary.

SELECT
  emp_no,
  salary,
  (SELECT AVG(salary) FROM salaries WHERE to_date = '9999-01-01') AS universal_avg_salary
FROM salaries
WHERE to_date = '9999-01-01'
;  

Finally, we can take this one step further, and do in-line subtraction between the employee’s salary from the universal average salary.

SELECT
  emp_no,
  salary,
  (SELECT AVG(salary) FROM salaries WHERE to_date = '9999-01-01') AS universal_avg_salary,
  salary - (SELECT AVG(salary) FROM salaries WHERE to_date = '9999-01-01') AS diff_salary
FROM salaries
WHERE to_date = '9999-01-01'
;  

We could further simplify this by eliminating the original columns and just keep the differenced column, with some optional line breaks for readability and styling.

SELECT
  emp_no,
  salary - (
    SELECT AVG(salary)
    FROM salaries
    WHERE to_date = '9999-01-01'
    ) AS diff_salary
FROM salaries
WHERE to_date = '9999-01-01'
;  




Breakout exercises

Ex 1.

Fill in the rest of this query to find the employee with the maximum salary in the salaries table.

SELECT *
FROM salaries
WHERE salary = ()
;


Ex 2.

Fill in the rest of this query to create a new column that is the universally minimum salary across the entire company.

SELECT
  *,
  () AS universal_min_salary
FROM salaries
WHERE to_date = '9999-01-01'
;  


Ex 3.

Write a query (with subquery) that returns the employee(s) with the longest last name.


Ex 4. (Challenge)

Write a query that helps answer the following question:

“Find the employees whose salary is less than $5000 away from the highest salary in the company.”



2. IN subquery


The IN subquery returns a single column as an result.

The single column result of the subquery is passed to the outer query, most commonly in the WHERE clause, where the subquery result is compared against.

If the outer query’s value is in the set returned by the subquery, then return True, otherwise return False.




A simple example

The following IN subquery returns the salary information of the current top 3 highest earning employees.

SELECT *
FROM salaries
WHERE salary IN (
  SELECT salary  
  FROM salaries
  WHERE to_date = '9999-01-01'
  ORDER BY 1 DESC
  LIMIT 3  
  )
;

Breaking this down step by step:

This query gives us the top 3 salary using a combination of ORDER BY and LIMIT.

SELECT salary  
FROM salaries
WHERE to_date = '9999-01-01'
ORDER BY 1 DESC
LIMIT 3  
;

which outputs:

salary
--------
158220
156286
155709

Prior to learning about subquery, we could have hardcoded the values in and gotten the same results:

SELECT *
FROM salaries
WHERE salary IN (158220, 156286, 155709)
;

However, just like in the scalar subquery, hardcoding values is not reproducible when the database updates. Therefore, a dynamic query with the IN subquery is preferred for reproducible results.




Points of failure

There are two points of failure with the IN subquery type:

  1. The subquery returns more than a single column.
  2. The subquery’s return column data type does not match the data type of the column it is compared against in the outer query.

An example of point of failure # 1 is if we modified the previous example so that the subquery now returns 4 columns instead of 1.

SELECT *
FROM salaries
WHERE salary IN (
  SELECT *  -- returns 4 columns instead of 1
  FROM salaries
  WHERE to_date = '9999-01-01'
  ORDER BY salary DESC
  LIMIT 3  
  )
;

This will error out to:

ERROR:  subquery has too many columns

This is because SQL does not know which column is used to compare against the salary column in the outer query.




IN and NOT IN

The opposite of an IN subquery is the NOT IN subquery. It takes the inverse where, if the outer query’s value is in the set returned by the subquery, then return True, otherwise return False.

For example, the following query will return all salary information except the top 3 highest salary records.

SELECT *
FROM salaries
WHERE salary NOT IN (
  SELECT salary  
  FROM salaries
  WHERE to_date = '9999-01-01'
  ORDER BY 1 DESC
  LIMIT 3  
  )
;




Using the IN subquery for exploring duplications

The IN subquery structure can also be useful when exploring new tables and understanding the cardinality of table constructs.

For example, we might have noticed during a preview of the titles table that there seems to be some duplication on the emp_no level.

SELECT *
FROM titles
LIMIT 10
;

outputs:

emp_no |      title      | from_date  |  to_date
--------+-----------------+------------+------------
 10001 | Senior Engineer | 1986-06-26 | 9999-01-01
 10002 | Staff           | 1996-08-03 | 9999-01-01
 10003 | Senior Engineer | 1995-12-03 | 9999-01-01
 10004 | Engineer        | 1986-12-01 | 1995-12-01 -- dups on emp_no
 10004 | Senior Engineer | 1995-12-01 | 9999-01-01 -- dups
 10005 | Senior Staff    | 1996-09-12 | 9999-01-01 -- dups
 10005 | Staff           | 1989-09-12 | 1996-09-12 -- dups
 10006 | Senior Engineer | 1990-08-05 | 9999-01-01
 10007 | Senior Staff    | 1996-02-11 | 9999-01-01 -- dups
 10007 | Staff           | 1989-02-10 | 1996-02-11 -- dups

To isolate ALL of the records where duplication occurs, we can do so using the IN subquery below:

SELECT *
FROM titles
WHERE emp_no IN (
  SELECT emp_no
  FROM titles
  GROUP BY 1
  HAVING SUM(1) > 1
  )
;

Breaking this down step by step, the following query creates a single column / list of emp_no which occurs more than once in the titles table.

SELECT emp_no
FROM titles
GROUP BY 1
HAVING SUM(1) > 1
;

We pass the result of the subquery above into the outer query in order to filter for in the original titles table only emp_no that occur more than once:

emp_no |       title        | from_date  |  to_date
--------+--------------------+------------+------------
 10005 | Senior Staff       | 1996-09-12 | 9999-01-01
 10005 | Staff              | 1989-09-12 | 1996-09-12
 10017 | Senior Staff       | 2000-08-03 | 9999-01-01
 10017 | Staff              | 1993-08-03 | 2000-08-03
 10018 | Engineer           | 1987-04-03 | 1995-04-03
 10018 | Senior Engineer    | 1995-04-03 | 9999-01-01
... (283,554 rows)




Breakout exercises

Ex 1. Return the emp_no that have hold the 5 most common job titles in the company, using the titles table.

Starter code:

SELECT emp_no
FROM titles
WHERE title IN ()
;


Ex 2. Return the first and last name of the employees who have one of the 10 most common first names in the company, using the employees table.


Ex 3. (Challenge)

Using the salaries table, find the current average salary, after removing the 10 lowest salaries and the 10 highest salaries. (Aside: in statistical analysis, this method is also referred to as “capping”).


3. Table subquery


In table subqueries, the inner query returns a table. The outer query treats the result of the subquery as data source. Thus, table subqueries can only be used in the FROM clause of the outer query.




A simple example

The following is a query that contains a table subquery:

SELECT
  max_salary.emp_no,
  max_salary.salary
FROM (
  SELECT
    emp_no,
    MAX(salary) AS salary
  FROM salaries
  GROUP BY 1
  ) max_salary
WHERE max_salary.emp_no BETWEEN 10001 AND 10005  
;

which outputs:

emp_no | salary
--------+--------
 10001 |  88958
 10002 |  72527
 10003 |  43699
 10004 |  74057
 10005 |  94692
(5 rows)

Several things worth noting:

  1. As a table subquery, the output can have multiple columns. In this case, there are two columns, referenced by the FROM clause in the outer query.
  2. The table created by the subquery must have a table alias. Here, we give it the table alias max_salary. Once the table alias is established, the columns from the inner query can then be referenced using this table alias (e.g. max_salary.emp_no).




Reminder: For simple manipulations, sometimes table subqueries are not always necessary. For example, the SQL from our simple example:

SELECT
  max_salary.emp_no,
  max_salary.salary
FROM (
  SELECT
    emp_no,
    MAX(salary) AS salary
  FROM salaries
  GROUP BY 1
  ) max_salary
WHERE max_salary.emp_no BETWEEN 10001 AND 10005  
;

can be rewritten without subquery at all:

SELECT
  emp_no,
  MAX(salary) AS salary
FROM salaries
WHERE emp_no BETWEEN 10001 AND 10005  
GROUP BY 1
;

which still gives the same output:

emp_no | salary
--------+--------
 10001 |  88958
 10002 |  72527
 10003 |  43699
 10004 |  74057
 10005 |  94692
(5 rows)

Table subqueries are more useful when we need to do more advanced manipulations. For example, in the following query, we want to join two tables together, but one of the tables (salaries) requires more manipulation (e.g. aggregation) before it was ready for the join.

Here, it is simpler to use table subqueries to aggregate the salaries table first, before joining it to the employees table.

SELECT *

FROM employees

INNER JOIN (
  SELECT
    emp_no,
    MAX(salary) AS salary
  FROM salaries
  GROUP BY 1
  ) max_salary

  ON employees.emp_no = max_salary.emp_no
;

outputs:

emp_no | birth_date |   first_name   |    last_name     | gender | hire_date  | emp_no | salary
--------+------------+----------------+------------------+--------+------------+--------+--------
 10001 | 1953-09-02 | Georgi         | Facello          | M      | 1986-06-26 |  10001 |  88958
 10005 | 1955-01-21 | Kyoichi        | Maliniak         | M      | 1989-09-12 |  10005 |  94692
 10010 | 1963-06-01 | Duangkaew      | Piveteau         | F      | 1989-08-24 |  10010 |  80324
 10011 | 1953-11-07 | Mary           | Sluis            | F      | 1990-01-22 |  10011 |  56753
... (300024 rows)




Breakout exercises

Ex 1. In the following SQL query, the table subquery is actually not needed. Simplify and rewrite the query and eliminate the subquery, but retain the same output.

SELECT *
FROM (
  SELECT
    title,
    SUM(1) AS num_records
  FROM titles
  GROUP BY 1
  ) titles_count
WHERE titles_count.num_records > 1000
;


Ex 2. Fill in the blanks to get a working query.

SELECT
  employees_short.emp_no,
  ???.first_name,
  ???.last_name,
  ???.salary

FROM (
  SELECT
    emp_no,
    first_name,
    last_name
  FROM employees
  WHERE emp_no BETWEEN 10001 AND 10010
  ) employees_short

INNER JOIN (
  SELECT
    emp_no,
    MAX(salary) AS salary
  FROM salaries
  GROUP BY 1
  ) max_salary

  ON ???.emp_no = ???.emp_no

ORDER BY
  ???.first_name,
  ???.last_name
;


Ex 3. The table subquery Ex 2 is also not absolutely necessary. Simplify and rewrite the query and eliminate the subquery, but retain the same output.



Common table expressions (CTEs)

Common table expressions (CTEs) are temporary tables that exists only during the execution of the SQL query.

CTEs are created at the beginning of the query, starting with the keyword WITH followed by the CTE alias. Once this is established, the temporary table can be referenced like any other table in the same SQL query.

However, note that not all SQL server languages support CTEs. PostgreSQL does, and so does newer versions of MySQL. In the SQL variants that do support CTEs, there is generally a preference to write table subqueries as CTEs instead.

In general, CTEs are preferred over nested table subquery format because:

  1. CTEs are more readable.
  2. CTEs can be re-used multiple times in the same query.




A simple example

A SQL containing a table subquery:

SELECT *
FROM (
  SELECT
    emp_no,
    MAX(salary) AS salary
  FROM salaries
  GROUP BY 1
  ) max_salary
;

can be re-written in CTE format, as thus:

WITH max_salary AS (
  SELECT
    emp_no,
    MAX(salary) AS salary
  FROM salaries
  GROUP BY 1
  )

SELECT *
FROM max_salary
;  

To note:

  1. Naming the CTE as max_salary is the equivalent of giving it a table alias. The table alias can then be carried over into subsequent queries.
  2. By moving the table subquery up to the beginning, this improves readability.
  3. The CTE max_salary can be referred to more than once in the same query for other manipulations.




The more complex the query, the more useful CTEs can be.

Below is a SQL query that has two table subqueries:

SELECT
  employees_short.emp_no,
  employees_short.first_name,
  employees_short.last_name,
  max_salary.salary

FROM (
  SELECT
    emp_no,
    first_name,
    last_name
  FROM employees
  WHERE emp_no BETWEEN 10001 AND 10010
  ) employees_short

INNER JOIN (
  SELECT
    emp_no,
    MAX(salary) AS salary
  FROM salaries
  GROUP BY 1
  ) max_salary

  ON employees_short.emp_no = max_salary.emp_no

ORDER BY
  employees_short.first_name,
  employees_short.last_name
;

The above SQL query can be rewritten with CTEs like so:

WITH employees_short AS (
  SELECT
    emp_no,
    first_name,
    last_name
  FROM employees
  WHERE emp_no BETWEEN 10001 AND 10010  
  )

, max_salary AS (
  SELECT
    emp_no,
    MAX(salary) AS salary
  FROM salaries
  GROUP BY 1
  )

SELECT
  employees_short.emp_no,
  employees_short.first_name,
  employees_short.last_name,
  max_salary.salary
FROM employees_short
INNER JOIN max_salary
  ON employees_short.emp_no = max_salary.emp_no
ORDER BY
  employees_short.first_name,
  employees_short.last_name
;

To note: The 2nd CTE created (max_salary) does not need another WITH keyword. Rather, it is prefaced by a comma, to show that this is is one of many CTEs.




Style

Please see the SQL style guide for more details on how to format SQL queries with multiple CTEs.




Rewriting IN subqueries with CTEs

IN subqueries can also be rewritten as CTEs in conjunction with INNER JOINs. In some SQL variants, the CTE optin optimizes query performance and is preferred over using IN subqueries..

Consider our example from a previous breakout exercise:

Return the emp_no that have hold the 5 most common job titles in the company, using the titles table.

In IN subquery format, this can be written as:

SELECT titles.emp_no
FROM titles
WHERE title IN (
  SELECT title
  FROM titles
  GROUP BY 1
  ORDER BY COUNT(*) DESC
  LIMIT 5
  )
;

To rewrite in CTE format, we first create a CTE for the subquery. However, since CTEs create tables, not lists, we cannot pass the CTE into the IN. Instead, we use INNER JOIN to join the newly created CTE top_5_titles against the source table titles to mimic the filtering effect:

WITH top_5_titles AS (
  SELECT title
  FROM titles
  GROUP BY 1
  ORDER BY COUNT(*) DESC
  LIMIT 5
  )

SELECT titles.emp_no
FROM titles
INNER JOIN top_5_titles
  ON titles.title = top_5_titles.title
;




Breakout exercises

Ex 1. Rewrite the following query in CTE format.

SELECT *
FROM (
  SELECT
    title,
    SUM(1) AS num_records
  FROM titles
  GROUP BY 1
  ) titles_count
WHERE titles_count.num_records > 1000
;


Ex 2. Rewrite the following query in CTE format.

SELECT *

FROM (
  SELECT *
  FROM employees
  WHERE gender = 'F'  
  ) employees_female  

INNER JOIN (
  SELECT
    emp_no,
    salary
  FROM salaries
  WHERE to_date = '9999-01-01'
  ) salary_recent
  ON employees_female.emp_no = salary_recent.emp_no

INNER JOIN (
  SELECT
    emp_no,
    title
  FROM titles
  WHERE to_date = '9999-01-01'
  ) titles_recent
  ON employees_female.emp_no = titles_recent.emp_no
;


Ex 3. Return the first and last name of the employees who have one of the 10 most common first names in the company, using the employees table. Write this in CTE format, not inIN subquery format.


Correlated subqueries


A correlated subquery is a subquery that references a column from outside that subquery. (Source: Google BigQuery documentation).

Unlike most other subqueries, correlated subqueries do not operate independently of the outer query. As a result, the correlation prevents reusing of the subquery result.

Note: Correlated subqueries can be computationally very costly and should be generally avoided.




A simple example:

SELECT
  emp_no,
  (SELECT MAX(salary) FROM salaries WHERE employees.emp_no = salaries.emp_no) AS max_salary
FROM employees
;

outputs:

emp_no |  max_salary
--------+-------
 10001 | 88958
 10002 | 72527
 10003 | 43699
 10004 | 74057
 10005 | 94692
...

Note that what makes this a correlated subquery is the employees.emp_no = salaries.emp_no, which makes it impossible for the inner query to operate on its own.

However, in terms of readability and computation performance, this correlated subquery is better off been rewritten as a JOIN instead, like below:

SELECT
  employees.emp_no,
  MAX(salary) AS max_salary
FROM employees
INNER JOIN salaries
  ON employees.emp_no = salaries.emp_no
GROUP BY 1
;




Breakout exercises

Ex 1. Write a query that returns the first name, last name, and hire date of every female manager. For this problem, an empolyee is defined as a manager if the employee existing in the dept_manager table.

Write this query using CTEs.


Ex 2. Repeat Ex 1, but write the query using no subqueries, just JOINs.


Ex 3. Write a query that returns the first name, last name, and salary of every level of engineer (e.g. Engineer, Senior Engineer, etc) who was hired after 1970.

Write the query using CTEs.


Ex 4. Repeat Ex 3, but write the query using no subqueries, just JOINs.


Ex 5. Write a query that returns the first name, last name, and hire date of every employee who has worked in the Marketing department who makes more than $30k in salary.

Write this query using CTEs.


Ex 6. Repeat Ex 5, but write using no subqueries, just JOINs.


Resources


Wrapping up