A subquery is a query that appears inside another SQL query statement. SQL queries with subqueries are also known as nested queries or sub-select queries.
The nested subquery is also called the “inner query”, whereas the query on the outside is also referred to as the “outer query”/
A simple example
The following query finds the average salary across all employees’ salaries stored in the salaries
table.
SELECT AVG(salary)
FROM salaries
;
which gives us the value: 63810.744836143706
Previously, if we had wanted to filter the salaries
table so that we only retain rows where the salary is above the average salary, we would then have to solve this problem in two pieces.
Once, to calculate the value for average salary (see SQL above), and then again, to do the filtering (see SQL below).
SELECT *
FROM salaries
WHERE salary > 63810.744836143706
;
With subqueries, we could solve this in one step:
SELECT *
FROM salaries
WHERE salary > (SELECT AVG(salary) FROM salaries)
;
To note:
In addition to making the solution shorter and more elegant, subqueries like above are robust against data changes. With data addition or deletion, aggregate calculations such as averages will shift constantly. In order to upkeep the two step analysis, we would need to constantly change the hardcoded average salary value. However, by writing this using subqueries, we bypass that need.
Order of execution
Both subquery and outer query follow the traditional syntax rules for order of execution.
Order of syntax
Depending on where the subquery is nested, the order of syntax differs, although the subquery is always wrapped in parentheses.
Below are a few common examples:
If the subquery is nested in the SELECT
clause:
SELECT
column_name,
(SELECT column_name FROM table_name) AS alias_column_name
FROM table_name
;
If the subquery is nested in the WHERE
clause:
SELECT *
FROM table_name
WHERE column_name IN (SELECT column_name FROM table_name)
;
If the subquery is nested in the FROM
clause:
SELECT *
FROM (
SELECT
column_name,
column_name,
...
FROM table_name
) table_alias
WHERE ...
GROUP BY ...
;
If the subquery is rewritten as a common table expression (CTE):
WITH cte_name AS (
SELECT
column_name,
column_name,
...
FROM table_name
)
SELECT *
FROM cte_name
;
Style
Please see the SQL style guide for more details.
In this section, we will review some of the most common types of subqueries:
IN
subqueryA scalar subquery returns a single value. Scalar subqueries often appear in the SELECT
or WHERE
clause.
Scalar subquery in the WHERE
clause
The following is a scalar subquery in the WHERE
clause:
SELECT *
FROM salaries
WHERE salary > (SELECT AVG(salary) FROM salaries)
;
The subquery returns a single value: 63810.744836143706
SELECT AVG(salary)
FROM salaries
;
The calculated value 63810.744836143706
is then passed on to the outer query in the WHERE
clause, where it is used to evaluate and filter against, line by line, the values stored in the column salary
in the outer query.
Scalar subquery in the SELECT
clause
The scalar subquery can also reside in SELECT
clause.
Here is an example:
SELECT
emp_no,
salary,
(SELECT AVG(salary) FROM salaries) AS average_salary
FROM salaries
;
which outputs:
emp_no | salary | average_salary
-------+--------+--------------------
10001 | 76884 | 63810.744836143706
10001 | 74333 | 63810.744836143706
10001 | 85112 | 63810.744836143706
....
10002 | 69366 | 63810.744836143706
10002 | 71963 | 63810.744836143706
10002 | 65828 | 63810.744836143706
....
10003 | 43478 | 63810.744836143706
10003 | 40006 | 63810.744836143706
10003 | 43699 | 63810.744836143706
....
(2,844,047 rows)
Note:
Because this scalar subquery is used to create a temporary column in the SELECT
statement, a column alias (AS average_salary
) is used.
The subquery is not aggregating across emp_no
, but rather operating independently of the outer query and returning the average salary universally across the entire table, thus the output retains the same number of rows (2,844,047 rows) as the original table salaries
.
Using scalar subquery to calculate deviation
Scalar subqueries can also be used to answer the question that requires in-line calculations such as deviations from the norm.
For example, the following query answers the question “How much is a employee’s being over or under paid, as compared to the norm?”
SELECT
emp_no,
salary,
(SELECT AVG(salary) FROM salaries WHERE to_date = '9999-01-01') AS universal_avg_salary,
salary - (SELECT AVG(salary) FROM salaries WHERE to_date = '9999-01-01') AS diff_salary
FROM salaries
WHERE to_date = '9999-01-01'
;
which outputs:
emp_no | salary | universal_avg_salary | diff_salary
-------+--------+----------------------+---------------------
10001 | 88958 | 72012.235857307058 | 16945.764142692942
10002 | 72527 | 72012.235857307058 | 514.764142692942
10003 | 43311 | 72012.235857307058 | -28701.235857307058
10004 | 74057 | 72012.235857307058 | 2044.764142692942
10005 | 94692 | 72012.235857307058 | 22679.764142692942
...
Let’s break this down step by step:
We know that the salaries
table contains historical records on employee salary. In order to find a fair benchmark, we should calculate the average salary based on current salary data only. To filter this, we apply the filter to_date = '9999-01-01'
.
Thus, our new average salary as earned by current employees is thus:
SELECT AVG(salary)
FROM salaries
WHERE to_date = '9999-01-01'
;
Then, we build the outer query, also taking care that we filter for present salary for each employee. Given that only the resulting value (72012.235857307058
) and not the SQL logic gets transferred from the subquery to the outer query, we once again have to specify the filter to_date = '9999-01-01'
in the outer query’s WHERE
clause.
SELECT
emp_no,
salary
FROM salaries
WHERE to_date = '9999-01-01'
;
We can add the scalar subquery to the outer query as an additional column, and call it universal_avg_salary
.
SELECT
emp_no,
salary,
(SELECT AVG(salary) FROM salaries WHERE to_date = '9999-01-01') AS universal_avg_salary
FROM salaries
WHERE to_date = '9999-01-01'
;
Finally, we can take this one step further, and do in-line subtraction between the employee’s salary from the universal average salary.
SELECT
emp_no,
salary,
(SELECT AVG(salary) FROM salaries WHERE to_date = '9999-01-01') AS universal_avg_salary,
salary - (SELECT AVG(salary) FROM salaries WHERE to_date = '9999-01-01') AS diff_salary
FROM salaries
WHERE to_date = '9999-01-01'
;
We could further simplify this by eliminating the original columns and just keep the differenced column, with some optional line breaks for readability and styling.
SELECT
emp_no,
salary - (
SELECT AVG(salary)
FROM salaries
WHERE to_date = '9999-01-01'
) AS diff_salary
FROM salaries
WHERE to_date = '9999-01-01'
;
Breakout exercises
Ex 1.
Fill in the rest of this query to find the employee with the maximum salary in the salaries
table.
SELECT *
FROM salaries
WHERE salary = ()
;
Ex 2.
Fill in the rest of this query to create a new column that is the universally minimum salary across the entire company.
SELECT
*,
() AS universal_min_salary
FROM salaries
WHERE to_date = '9999-01-01'
;
Ex 3.
Write a query (with subquery) that returns the employee(s) with the longest last name.
Ex 4. (Challenge)
Write a query that helps answer the following question:
“Find the employees whose salary is less than $5000 away from the highest salary in the company.”
IN
subqueryThe IN
subquery returns a single column as an result.
The single column result of the subquery is passed to the outer query, most commonly in the WHERE
clause, where the subquery result is compared against.
If the outer query’s value is in the set returned by the subquery, then return True
, otherwise return False
.
A simple example
The following IN
subquery returns the salary information of the current top 3 highest earning employees.
SELECT *
FROM salaries
WHERE salary IN (
SELECT salary
FROM salaries
WHERE to_date = '9999-01-01'
ORDER BY 1 DESC
LIMIT 3
)
;
Breaking this down step by step:
This query gives us the top 3 salary using a combination of ORDER BY
and LIMIT
.
SELECT salary
FROM salaries
WHERE to_date = '9999-01-01'
ORDER BY 1 DESC
LIMIT 3
;
which outputs:
salary
--------
158220
156286
155709
Prior to learning about subquery, we could have hardcoded the values in and gotten the same results:
SELECT *
FROM salaries
WHERE salary IN (158220, 156286, 155709)
;
However, just like in the scalar subquery, hardcoding values is not reproducible when the database updates. Therefore, a dynamic query with the IN
subquery is preferred for reproducible results.
Points of failure
There are two points of failure with the IN
subquery type:
An example of point of failure # 1 is if we modified the previous example so that the subquery now returns 4 columns instead of 1.
SELECT *
FROM salaries
WHERE salary IN (
SELECT * -- returns 4 columns instead of 1
FROM salaries
WHERE to_date = '9999-01-01'
ORDER BY salary DESC
LIMIT 3
)
;
This will error out to:
ERROR: subquery has too many columns
This is because SQL does not know which column is used to compare against the salary
column in the outer query.
IN
and NOT IN
The opposite of an IN
subquery is the NOT IN
subquery. It takes the inverse
where, if the outer query’s value is in the set returned by the subquery, then return True
, otherwise return False
.
For example, the following query will return all salary information except the top 3 highest salary records.
SELECT *
FROM salaries
WHERE salary NOT IN (
SELECT salary
FROM salaries
WHERE to_date = '9999-01-01'
ORDER BY 1 DESC
LIMIT 3
)
;
Using the IN
subquery for exploring duplications
The IN
subquery structure can also be useful when exploring new tables and understanding the cardinality of table constructs.
For example, we might have noticed during a preview of the titles
table that there seems to be some duplication on the emp_no
level.
SELECT *
FROM titles
LIMIT 10
;
outputs:
emp_no | title | from_date | to_date
--------+-----------------+------------+------------
10001 | Senior Engineer | 1986-06-26 | 9999-01-01
10002 | Staff | 1996-08-03 | 9999-01-01
10003 | Senior Engineer | 1995-12-03 | 9999-01-01
10004 | Engineer | 1986-12-01 | 1995-12-01 -- dups on emp_no
10004 | Senior Engineer | 1995-12-01 | 9999-01-01 -- dups
10005 | Senior Staff | 1996-09-12 | 9999-01-01 -- dups
10005 | Staff | 1989-09-12 | 1996-09-12 -- dups
10006 | Senior Engineer | 1990-08-05 | 9999-01-01
10007 | Senior Staff | 1996-02-11 | 9999-01-01 -- dups
10007 | Staff | 1989-02-10 | 1996-02-11 -- dups
To isolate ALL of the records where duplication occurs, we can do so using the IN
subquery below:
SELECT *
FROM titles
WHERE emp_no IN (
SELECT emp_no
FROM titles
GROUP BY 1
HAVING SUM(1) > 1
)
;
Breaking this down step by step, the following query creates a single column / list of emp_no
which occurs more than once in the titles
table.
SELECT emp_no
FROM titles
GROUP BY 1
HAVING SUM(1) > 1
;
We pass the result of the subquery above into the outer query in order to filter for in the original titles
table only emp_no
that occur more than once:
emp_no | title | from_date | to_date
--------+--------------------+------------+------------
10005 | Senior Staff | 1996-09-12 | 9999-01-01
10005 | Staff | 1989-09-12 | 1996-09-12
10017 | Senior Staff | 2000-08-03 | 9999-01-01
10017 | Staff | 1993-08-03 | 2000-08-03
10018 | Engineer | 1987-04-03 | 1995-04-03
10018 | Senior Engineer | 1995-04-03 | 9999-01-01
... (283,554 rows)
Breakout exercises
Ex 1. Return the emp_no
that have hold the 5 most common job titles in the company, using the titles
table.
Starter code:
SELECT emp_no
FROM titles
WHERE title IN ()
;
Ex 2. Return the first and last name of the employees who have one of the 10 most common first names in the company, using the employees
table.
Ex 3. (Challenge)
Using the salaries
table, find the current average salary, after removing the 10 lowest salaries and the 10 highest salaries. (Aside: in statistical analysis, this method is also referred to as “capping”).
In table subqueries, the inner query returns a table. The outer query treats the result of the subquery as data source. Thus, table subqueries can only be used in the FROM
clause of the outer query.
A simple example
The following is a query that contains a table subquery:
SELECT
max_salary.emp_no,
max_salary.salary
FROM (
SELECT
emp_no,
MAX(salary) AS salary
FROM salaries
GROUP BY 1
) max_salary
WHERE max_salary.emp_no BETWEEN 10001 AND 10005
;
which outputs:
emp_no | salary
--------+--------
10001 | 88958
10002 | 72527
10003 | 43699
10004 | 74057
10005 | 94692
(5 rows)
Several things worth noting:
FROM
clause in the outer query.max_salary
. Once the table alias is established, the columns from the inner query can then be referenced using this table alias (e.g. max_salary.emp_no
).
Reminder: For simple manipulations, sometimes table subqueries are not always necessary. For example, the SQL from our simple example:
SELECT
max_salary.emp_no,
max_salary.salary
FROM (
SELECT
emp_no,
MAX(salary) AS salary
FROM salaries
GROUP BY 1
) max_salary
WHERE max_salary.emp_no BETWEEN 10001 AND 10005
;
can be rewritten without subquery at all:
SELECT
emp_no,
MAX(salary) AS salary
FROM salaries
WHERE emp_no BETWEEN 10001 AND 10005
GROUP BY 1
;
which still gives the same output:
emp_no | salary
--------+--------
10001 | 88958
10002 | 72527
10003 | 43699
10004 | 74057
10005 | 94692
(5 rows)
Table subqueries are more useful when we need to do more advanced manipulations. For example, in the following query, we want to join two tables together, but one of the tables (salaries
) requires more manipulation (e.g. aggregation) before it was ready for the join.
Here, it is simpler to use table subqueries to aggregate the salaries
table first, before joining it to the employees
table.
SELECT *
FROM employees
INNER JOIN (
SELECT
emp_no,
MAX(salary) AS salary
FROM salaries
GROUP BY 1
) max_salary
ON employees.emp_no = max_salary.emp_no
;
outputs:
emp_no | birth_date | first_name | last_name | gender | hire_date | emp_no | salary
--------+------------+----------------+------------------+--------+------------+--------+--------
10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 88958
10005 | 1955-01-21 | Kyoichi | Maliniak | M | 1989-09-12 | 10005 | 94692
10010 | 1963-06-01 | Duangkaew | Piveteau | F | 1989-08-24 | 10010 | 80324
10011 | 1953-11-07 | Mary | Sluis | F | 1990-01-22 | 10011 | 56753
... (300024 rows)
Breakout exercises
Ex 1. In the following SQL query, the table subquery is actually not needed. Simplify and rewrite the query and eliminate the subquery, but retain the same output.
SELECT *
FROM (
SELECT
title,
SUM(1) AS num_records
FROM titles
GROUP BY 1
) titles_count
WHERE titles_count.num_records > 1000
;
Ex 2. Fill in the blanks to get a working query.
SELECT
employees_short.emp_no,
???.first_name,
???.last_name,
???.salary
FROM (
SELECT
emp_no,
first_name,
last_name
FROM employees
WHERE emp_no BETWEEN 10001 AND 10010
) employees_short
INNER JOIN (
SELECT
emp_no,
MAX(salary) AS salary
FROM salaries
GROUP BY 1
) max_salary
ON ???.emp_no = ???.emp_no
ORDER BY
???.first_name,
???.last_name
;
Ex 3. The table subquery Ex 2 is also not absolutely necessary. Simplify and rewrite the query and eliminate the subquery, but retain the same output.
Common table expressions (CTEs) are temporary tables that exists only during the execution of the SQL query.
CTEs are created at the beginning of the query, starting with the keyword WITH
followed by the CTE alias. Once this is established, the temporary table can be referenced like any other table in the same SQL query.
However, note that not all SQL server languages support CTEs. PostgreSQL does, and so does newer versions of MySQL. In the SQL variants that do support CTEs, there is generally a preference to write table subqueries as CTEs instead.
In general, CTEs are preferred over nested table subquery format because:
A simple example
A SQL containing a table subquery:
SELECT *
FROM (
SELECT
emp_no,
MAX(salary) AS salary
FROM salaries
GROUP BY 1
) max_salary
;
can be re-written in CTE format, as thus:
WITH max_salary AS (
SELECT
emp_no,
MAX(salary) AS salary
FROM salaries
GROUP BY 1
)
SELECT *
FROM max_salary
;
To note:
max_salary
is the equivalent of giving it a table alias. The table alias can then be carried over into subsequent queries.max_salary
can be referred to more than once in the same query for other manipulations.
The more complex the query, the more useful CTEs can be.
Below is a SQL query that has two table subqueries:
SELECT
employees_short.emp_no,
employees_short.first_name,
employees_short.last_name,
max_salary.salary
FROM (
SELECT
emp_no,
first_name,
last_name
FROM employees
WHERE emp_no BETWEEN 10001 AND 10010
) employees_short
INNER JOIN (
SELECT
emp_no,
MAX(salary) AS salary
FROM salaries
GROUP BY 1
) max_salary
ON employees_short.emp_no = max_salary.emp_no
ORDER BY
employees_short.first_name,
employees_short.last_name
;
The above SQL query can be rewritten with CTEs like so:
WITH employees_short AS (
SELECT
emp_no,
first_name,
last_name
FROM employees
WHERE emp_no BETWEEN 10001 AND 10010
)
, max_salary AS (
SELECT
emp_no,
MAX(salary) AS salary
FROM salaries
GROUP BY 1
)
SELECT
employees_short.emp_no,
employees_short.first_name,
employees_short.last_name,
max_salary.salary
FROM employees_short
INNER JOIN max_salary
ON employees_short.emp_no = max_salary.emp_no
ORDER BY
employees_short.first_name,
employees_short.last_name
;
To note: The 2nd CTE created (max_salary
) does not need another WITH
keyword. Rather, it is prefaced by a comma, to show that this is is one of many CTEs.
Style
Please see the SQL style guide for more details on how to format SQL queries with multiple CTEs.
Rewriting IN
subqueries with CTEs
IN
subqueries can also be rewritten as CTEs in conjunction with INNER JOIN
s. In some SQL variants, the CTE optin optimizes query performance and is preferred over using IN
subqueries..
Consider our example from a previous breakout exercise:
Return the emp_no
that have hold the 5 most common job titles in the company, using the titles
table.
In IN
subquery format, this can be written as:
SELECT titles.emp_no
FROM titles
WHERE title IN (
SELECT title
FROM titles
GROUP BY 1
ORDER BY COUNT(*) DESC
LIMIT 5
)
;
To rewrite in CTE format, we first create a CTE for the subquery. However, since CTEs create tables, not lists, we cannot pass the CTE into the IN
. Instead, we use INNER JOIN
to join the newly created CTE top_5_titles
against the source table titles
to mimic the filtering effect:
WITH top_5_titles AS (
SELECT title
FROM titles
GROUP BY 1
ORDER BY COUNT(*) DESC
LIMIT 5
)
SELECT titles.emp_no
FROM titles
INNER JOIN top_5_titles
ON titles.title = top_5_titles.title
;
Breakout exercises
Ex 1. Rewrite the following query in CTE format.
SELECT *
FROM (
SELECT
title,
SUM(1) AS num_records
FROM titles
GROUP BY 1
) titles_count
WHERE titles_count.num_records > 1000
;
Ex 2. Rewrite the following query in CTE format.
SELECT *
FROM (
SELECT *
FROM employees
WHERE gender = 'F'
) employees_female
INNER JOIN (
SELECT
emp_no,
salary
FROM salaries
WHERE to_date = '9999-01-01'
) salary_recent
ON employees_female.emp_no = salary_recent.emp_no
INNER JOIN (
SELECT
emp_no,
title
FROM titles
WHERE to_date = '9999-01-01'
) titles_recent
ON employees_female.emp_no = titles_recent.emp_no
;
Ex 3. Return the first and last name of the employees who have one of the 10 most common first names in the company, using the employees
table. Write this in CTE format, not inIN
subquery format.
A correlated subquery is a subquery that references a column from outside that subquery. (Source: Google BigQuery documentation).
Unlike most other subqueries, correlated subqueries do not operate independently of the outer query. As a result, the correlation prevents reusing of the subquery result.
Note: Correlated subqueries can be computationally very costly and should be generally avoided.
A simple example:
SELECT
emp_no,
(SELECT MAX(salary) FROM salaries WHERE employees.emp_no = salaries.emp_no) AS max_salary
FROM employees
;
outputs:
emp_no | max_salary
--------+-------
10001 | 88958
10002 | 72527
10003 | 43699
10004 | 74057
10005 | 94692
...
Note that what makes this a correlated subquery is the employees.emp_no = salaries.emp_no
, which makes it impossible for the inner query to operate on its own.
However, in terms of readability and computation performance, this correlated subquery is better off been rewritten as a JOIN
instead, like below:
SELECT
employees.emp_no,
MAX(salary) AS max_salary
FROM employees
INNER JOIN salaries
ON employees.emp_no = salaries.emp_no
GROUP BY 1
;
Breakout exercises
Ex 1. Write a query that returns the first name, last name, and hire date of every female manager. For this problem, an empolyee is defined as a manager if the employee existing in the dept_manager
table.
Write this query using CTEs.
Ex 2. Repeat Ex 1, but write the query using no subqueries, just JOIN
s.
Ex 3. Write a query that returns the first name, last name, and salary of every level of engineer (e.g. Engineer, Senior Engineer, etc) who was hired after 1970.
Write the query using CTEs.
Ex 4. Repeat Ex 3, but write the query using no subqueries, just JOIN
s.
Ex 5. Write a query that returns the first name, last name, and hire date of every employee who has worked in the Marketing department who makes more than $30k in salary.
Write this query using CTEs.
Ex 6. Repeat Ex 5, but write using no subqueries, just JOIN
s.