SQL Interview Preparation
Tables
orders — Orders (status, amount, date)
order_items — Order items (qty, unit price)
products — Products (name, price, stock, brand)
categories — Categories (parent-child hierarchy)
customers — Customers (grade, points, channel)
reviews — Reviews (rating, content)
product_views — View log (customer, product, date)
calendar — Calendar (weekday, holiday)
staff — Staff (dept, role, manager)
Concepts
DENSE_RANK, Consecutive N Days, Median, Session Analysis, MoM Growth Rate, Percentile, Working Days, Organization Chart Recursion
1. Second highest selling product ★★★
Find the product with the second highest total sales among all products.
Use DENSE_RANK because there may be multiple products with the same sales as the product with the highest sales.
Presentation Frequency: ★★★ (Very High) — Amazon, LeetCode #176 variant
| product_name | category | total_revenue | revenue_rank |
|---|---|---|---|
| ... | ... | ... | 2 |
Hint 1: - DENSE_RANK() OVER (ORDER BY total_revenue DESC): Ranking where ties are allowed
- Sum up sales by product in CTE → Ranking → WHERE rank = 2
Answer
WITH product_revenue AS (
SELECT
p.id,
p.name AS product_name,
cat.name AS category,
SUM(oi.quantity * oi.unit_price) AS total_revenue
FROM order_items AS oi
JOIN orders AS o ON oi.order_id = o.id
JOIN products AS p ON oi.product_id = p.id
JOIN categories AS cat ON p.category_id = cat.id
WHERE o.status NOT IN ('cancelled')
GROUP BY p.id, p.name, cat.name
),
ranked AS (
SELECT *,
DENSE_RANK() OVER (ORDER BY total_revenue DESC) AS revenue_rank
FROM product_revenue
)
SELECT product_name, category,
CAST(total_revenue AS INTEGER) AS total_revenue,
revenue_rank
FROM ranked
WHERE revenue_rank = 2;
Result (1 rows)
| product_name | category | total_revenue | revenue_rank |
|---|---|---|---|
| Razer Blade 16 Silver | Gaming Laptop | 907,210,500 | 2 |
2. Running Total ★★★
Find the monthly sales and year-to-date (YTD) for 2024.
Question frequency: ★★★ (very high) — Google, Meta frequently appears
| month | monthly_revenue | ytd_revenue |
|---|---|---|
| 2024-01 | ... | ... |
Hint 1: - SUM(monthly_revenue) OVER (ORDER BY month) = running total
- The basic frame of the window function is UNBOUNDED PRECEDING ~ CURRENT ROW
Answer
SELECT
SUBSTR(ordered_at, 1, 7) AS month,
CAST(SUM(total_amount) AS INTEGER) AS monthly_revenue,
CAST(SUM(SUM(total_amount)) OVER (ORDER BY SUBSTR(ordered_at, 1, 7)) AS INTEGER) AS ytd_revenue
FROM orders
WHERE ordered_at >= '2024-01-01' AND ordered_at < '2025-01-01'
AND status NOT IN ('cancelled')
GROUP BY SUBSTR(ordered_at, 1, 7)
ORDER BY month;
Result (top 7 of 12 rows)
| month | monthly_revenue | ytd_revenue |
|---|---|---|
| 2024-01 | 301,075,320 | 301,075,320 |
| 2024-02 | 426,177,449 | 727,252,769 |
| 2024-03 | 536,322,767 | 1,263,575,536 |
| 2024-04 | 470,154,081 | 1,733,729,617 |
| 2024-05 | 459,724,596 | 2,193,454,213 |
| 2024-06 | 377,040,302 | 2,570,494,515 |
| 2024-07 | 363,944,597 | 2,934,439,112 |
3. Identifying duplicate data ★★★
**Find cases where the same customer ordered the same product on the same day (suspected duplicates). Only the most recent order is marked as valid and the rest are flagged as duplicates.
Question frequency: ★★★ (very high) — LeetCode #196 variant, Kakao
| order_id | customer_name | product_name | ordered_at | is_duplicate |
|---|---|---|---|---|
Hint 1: - ROW_NUMBER() OVER (PARTITION BY customer_id, product_id, DATE(ordered_at) ORDER BY ordered_at DESC)
- Valid if rn = 1, duplicate if rn > 1
Answer
WITH order_detail AS (
SELECT
o.id AS order_id,
o.customer_id,
oi.product_id,
o.ordered_at,
ROW_NUMBER() OVER (
PARTITION BY o.customer_id, oi.product_id, DATE(o.ordered_at)
ORDER BY o.ordered_at DESC
) AS rn
FROM orders AS o
JOIN order_items AS oi ON o.id = oi.order_id
WHERE o.status NOT IN ('cancelled')
)
SELECT
od.order_id,
c.name AS customer_name,
p.name AS product_name,
od.ordered_at,
CASE WHEN od.rn > 1 THEN 'Y' ELSE 'N' END AS is_duplicate
FROM order_detail AS od
JOIN customers AS c ON od.customer_id = c.id
JOIN products AS p ON od.product_id = p.id
WHERE od.rn <= 2
ORDER BY od.customer_id, od.product_id, DATE(od.ordered_at), od.rn;
Result (top 7 of 86,611 rows)
| order_id | customer_name | product_name | ordered_at | is_duplicate |
|---|---|---|---|---|
| 1049 | Danny Johnson | Razer Blade 18 Black | 2017-12-04 15:52:09 | N |
| 31,251 | Danny Johnson | Razer Blade 18 Black | 2025-01-02 18:41:57 | N |
| 243 | Danny Johnson | MSI GeForce RTX 4070 Ti Super GAMING X | 2016-08-17 23:29:34 | N |
| 17,814 | Danny Johnson | MSI GeForce RTX 4070 Ti Super GAMING X | 2022-07-18 12:29:51 | N |
| 21,134 | Danny Johnson | MSI GeForce RTX 4070 Ti Super GAMING X | 2023-03-04 08:54:35 | N |
| 5736 | Danny Johnson | Dell U2724D | 2020-03-09 16:09:46 | N |
| 236 | Danny Johnson | G.SKILL Trident Z5 DDR5 64GB 6000MHz ... | 2016-08-19 22:29:34 | N |
4. Finding the median ★★☆
Find the median of the order amount for each customer.
SQLite does not have a MEDIAN function, so it is implemented as a window function.
Question frequency: ★★☆ — Google, Amazon
| customer_name | order_count | median_amount |
|---|---|---|
| ... | ... | ... |
Hint 1: - Ranking as ROW_NUMBER(), total number of cases as COUNT(*) OVER()
- Median = (n+1)/2nd if the total number of cases is odd, average of n/2 and n/2+1 if the total number of cases is even.
- Extract center row with WHERE rn IN (cnt/2, cnt/2+1, (cnt+1)/2)
Answer
WITH numbered AS (
SELECT
customer_id,
total_amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY total_amount) AS rn,
COUNT(*) OVER (PARTITION BY customer_id) AS cnt
FROM orders
WHERE status NOT IN ('cancelled')
),
median_rows AS (
SELECT
customer_id,
ROUND(AVG(total_amount), 0) AS median_amount,
MAX(cnt) AS order_count
FROM numbered
WHERE rn IN ((cnt + 1) / 2, (cnt + 2) / 2)
GROUP BY customer_id
)
SELECT
c.name AS customer_name,
m.order_count,
m.median_amount
FROM median_rows AS m
JOIN customers AS c ON m.customer_id = c.id
WHERE m.order_count >= 5
ORDER BY m.median_amount DESC
LIMIT 20;
Result (top 7 of 20 rows)
| customer_name | order_count | median_amount |
|---|---|---|
| Johnny Gutierrez | 5 | 3,162,000.00 |
| John Brown | 5 | 2,741,600.00 |
| Abigail Richardson | 5 | 2,461,800.00 |
| Joanne Jones | 12 | 2,403,150.00 |
| Jeremy Ellis | 6 | 2,282,947.00 |
| Margaret Kirk | 5 | 2,281,800.00 |
| Corey Carroll | 9 | 2,279,500.00 |
5. Login for N consecutive days (Islands) ★★★
Consider product_views as your login log, and find customers active for more than 3 consecutive days.
Question frequency: ★★★ (very high) — Meta, Coupang, LeetCode #180 variant
| customer_name | streak_days | streak_start | streak_end |
|---|---|---|---|
Hint 1: - DATE(viewed_at) Detect groups of consecutive dates after removing duplicates
- DATE(viewed_at, '-' || (ROW_NUMBER()-1) || ' days') → Same value if in the same group
- HAVING COUNT(*) >= 3
Answer
WITH active_days AS (
SELECT DISTINCT
customer_id,
DATE(viewed_at) AS active_date
FROM product_views
),
grouped AS (
SELECT
customer_id,
active_date,
DATE(active_date, '-' || (ROW_NUMBER() OVER (
PARTITION BY customer_id ORDER BY active_date
) - 1) || ' days') AS grp
FROM active_days
),
streaks AS (
SELECT
customer_id,
COUNT(*) AS streak_days,
MIN(active_date) AS streak_start,
MAX(active_date) AS streak_end
FROM grouped
GROUP BY customer_id, grp
HAVING COUNT(*) >= 3
)
SELECT
c.name AS customer_name,
s.streak_days,
s.streak_start,
s.streak_end
FROM streaks AS s
JOIN customers AS c ON s.customer_id = c.id
ORDER BY s.streak_days DESC, s.streak_start
LIMIT 20;
Result (top 7 of 20 rows)
| customer_name | streak_days | streak_start | streak_end |
|---|---|---|---|
| Austin Townsend | 46 | 2020-02-28 | 2020-04-13 |
| Gabriel Walters | 40 | 2016-01-09 | 2016-02-17 |
| James Banks | 39 | 2020-05-02 | 2020-06-09 |
| Jason Rivera | 27 | 2020-02-28 | 2020-03-25 |
| John Maldonado | 26 | 2019-03-13 | 2019-04-07 |
| Adam Moore | 25 | 2016-03-05 | 2016-03-29 |
| Gabriel Walters | 25 | 2016-04-18 | 2016-05-12 |
6. Top-N by category (ranking within group) ★★★
Extract the two products with the highest review ratings from each category. Only products with 10 or more reviews are eligible. In case of a tie, the product with the most reviews will take precedence.
Question frequency: ★★★ (very high) — Amazon, Naver, Kakao
| category | product_name | avg_rating | review_count | rank |
|---|---|---|---|---|
Hint 1: - Filter number of reviews by HAVING COUNT(*) >= 10
- ROW_NUMBER() OVER (PARTITION BY category ORDER BY avg_rating DESC, review_count DESC)
- WHERE rn <= 2
Answer
WITH product_ratings AS (
SELECT
cat.name AS category,
p.name AS product_name,
ROUND(AVG(r.rating), 2) AS avg_rating,
COUNT(*) AS review_count
FROM reviews AS r
JOIN products AS p ON r.product_id = p.id
JOIN categories AS cat ON p.category_id = cat.id
GROUP BY cat.name, p.id, p.name
HAVING COUNT(*) >= 10
),
ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY category
ORDER BY avg_rating DESC, review_count DESC
) AS rn
FROM product_ratings
)
SELECT category, product_name, avg_rating, review_count, rn AS rank
FROM ranked
WHERE rn <= 2
ORDER BY category, rn;
Result (top 7 of 70 rows)
| category | product_name | avg_rating | review_count | rank |
|---|---|---|---|---|
| 2-in-1 | HP Pavilion x360 14 Black | 3.79 | 28 | 1 |
| 2-in-1 | Lenovo IdeaPad Flex 5 | 3.75 | 12 | 2 |
| AMD | MSI Radeon RX 9070 VENTUS 3X White | 4.08 | 40 | 1 |
| AMD | AMD Ryzen 9 9900X | 4.08 | 13 | 2 |
| AMD Socket | MSI MAG X870E TOMAHAWK WIFI White | 4.06 | 32 | 1 |
| AMD Socket | ASRock B850M Pro RS Silver | 4.04 | 25 | 2 |
| Air Cooling | Arctic Freezer 36 A-RGB White | 4.02 | 59 | 1 |
7. Year-on-year (YoY) growth rate ★★★
Find the year-over-year growth rate (YoY %) of quarterly sales. It covers data from 2023 to 2025.
Frequency: ★★★ (Very High) -- Google, Meta, Coupang
| year | quarter | revenue | prev_year_revenue | yoy_growth_pct |
|---|---|---|---|---|
Hint 1: - Refer to sales in the same quarter of the previous year as LAG(revenue, 4) OVER (ORDER BY year, quarter) — since quarters are 1 to 4, 4 transitions are the same quarter of the previous year.
- or LAG(revenue, 1) OVER (PARTITION BY quarter ORDER BY year)
Answer
WITH quarterly AS (
SELECT
CAST(SUBSTR(ordered_at, 1, 4) AS INTEGER) AS year,
CASE
WHEN CAST(SUBSTR(ordered_at, 6, 2) AS INTEGER) <= 3 THEN 1
WHEN CAST(SUBSTR(ordered_at, 6, 2) AS INTEGER) <= 6 THEN 2
WHEN CAST(SUBSTR(ordered_at, 6, 2) AS INTEGER) <= 9 THEN 3
ELSE 4
END AS quarter,
SUM(total_amount) AS revenue
FROM orders
WHERE status NOT IN ('cancelled')
AND ordered_at >= '2023-01-01'
GROUP BY 1, 2
)
SELECT
year,
quarter,
CAST(revenue AS INTEGER) AS revenue,
CAST(LAG(revenue) OVER (PARTITION BY quarter ORDER BY year) AS INTEGER) AS prev_year_revenue,
ROUND(100.0 * (revenue - LAG(revenue) OVER (PARTITION BY quarter ORDER BY year))
/ LAG(revenue) OVER (PARTITION BY quarter ORDER BY year), 1) AS yoy_growth_pct
FROM quarterly
ORDER BY year, quarter;
Result (top 7 of 12 rows)
| year | quarter | revenue | prev_year_revenue | yoy_growth_pct |
|---|---|---|---|---|
| 2023 | 1 | 1,112,502,647 | NULL | NULL |
| 2023 | 2 | 1,075,865,258 | NULL | NULL |
| 2023 | 3 | 1,162,362,566 | NULL | NULL |
| 2023 | 4 | 1,464,300,253 | NULL | NULL |
| 2024 | 1 | 1,263,575,536 | 1,112,502,647 | 13.60 |
| 2024 | 2 | 1,306,918,979 | 1,075,865,258 | 21.50 |
| 2024 | 3 | 1,340,721,817 | 1,162,362,566 | 15.30 |
8. Moving Average ★★☆
Find the 7-day moving average of daily sales for 2024. The moving average is the average of the previous 7 days including the current day.
Question frequency: ★★☆ — Google, Naver
| order_date | daily_revenue | ma_7d |
|---|---|---|
Hint 1: - AVG(daily_revenue) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
- LEFT JOIN with calendar to include days without orders as 0
Answer
WITH daily AS (
SELECT
cal.date_key AS order_date,
COALESCE(SUM(o.total_amount), 0) AS daily_revenue
FROM calendar AS cal
LEFT JOIN orders AS o
ON DATE(o.ordered_at) = cal.date_key
AND o.status NOT IN ('cancelled')
WHERE cal.date_key BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY cal.date_key
)
SELECT
order_date,
CAST(daily_revenue AS INTEGER) AS daily_revenue,
CAST(AVG(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS INTEGER) AS ma_7d
FROM daily
ORDER BY order_date;
Result (top 7 of 366 rows)
| order_date | daily_revenue | ma_7d |
|---|---|---|
| 2024-01-01 | 7,732,372 | 7,732,372 |
| 2024-01-02 | 14,807,975 | 11,270,173 |
| 2024-01-03 | 2,825,828 | 8,455,391 |
| 2024-01-04 | 4,332,900 | 7,424,768 |
| 2024-01-05 | 8,083,504 | 7,556,515 |
| 2024-01-06 | 9,182,200 | 7,827,463 |
| 2024-01-07 | 23,522,036 | 10,069,545 |
9. Percentile ★★☆
Find the values where the total purchase amount for each customer falls within the top 10%, 25%, 50% (median), 75%, and 90% boundaries.
Question frequency: ★★☆ — Amazon, Google
| percentile | threshold_amount |
|---|---|
| 10 | ... |
| 25 | ... |
Hint 1: - Assign percentile group to NTILE(100) OVER (ORDER BY total_spent)
- Each percentile boundary value: MAX value in WHERE percentile_group IN (10, 25, 50, 75, 90)
Answer
WITH customer_spent AS (
SELECT
customer_id,
SUM(total_amount) AS total_spent
FROM orders
WHERE status NOT IN ('cancelled')
GROUP BY customer_id
),
with_percentile AS (
SELECT
total_spent,
NTILE(100) OVER (ORDER BY total_spent) AS pctl
FROM customer_spent
)
SELECT
pctl AS percentile,
CAST(MAX(total_spent) AS INTEGER) AS threshold_amount
FROM with_percentile
WHERE pctl IN (10, 25, 50, 75, 90)
GROUP BY pctl
ORDER BY pctl;
Result (5 rows)
| percentile | threshold_amount |
|---|---|
| 10 | 180,500 |
| 25 | 1,183,200 |
| 50 | 4,654,232 |
| 75 | 13,607,591 |
| 90 | 31,606,150 |
10. Change rate compared to previous day + 7-day moving change rate ★★☆
Percent change from previous day (DoD %) for the number of daily orders in December 2024 Find the change rate (WoW %) compared to 7 days ago at the same time.
Question frequency: ★★☆ — Meta, Coupang
| order_date | order_count | prev_day | dod_pct | prev_week | wow_pct |
|---|---|---|---|---|---|
Hint 1: - LAG(order_count, 1) = the previous day, LAG(order_count, 7) = 7 days ago
- Rate of change = (current_day - previous) / previous * 100
- calendar Includes days missed due to LEFT JOIN
Answer
WITH daily AS (
SELECT
cal.date_key AS order_date,
COALESCE(COUNT(o.id), 0) AS order_count
FROM calendar AS cal
LEFT JOIN orders AS o
ON DATE(o.ordered_at) = cal.date_key
AND o.status NOT IN ('cancelled')
WHERE cal.date_key BETWEEN '2024-12-01' AND '2024-12-31'
GROUP BY cal.date_key
)
SELECT
order_date,
order_count,
LAG(order_count, 1) OVER (ORDER BY order_date) AS prev_day,
ROUND(100.0 * (order_count - LAG(order_count, 1) OVER (ORDER BY order_date))
/ NULLIF(LAG(order_count, 1) OVER (ORDER BY order_date), 0), 1) AS dod_pct,
LAG(order_count, 7) OVER (ORDER BY order_date) AS prev_week,
ROUND(100.0 * (order_count - LAG(order_count, 7) OVER (ORDER BY order_date))
/ NULLIF(LAG(order_count, 7) OVER (ORDER BY order_date), 0), 1) AS wow_pct
FROM daily
ORDER BY order_date;
Result (top 7 of 31 rows)
| order_date | order_count | prev_day | dod_pct | prev_week | wow_pct |
|---|---|---|---|---|---|
| 2024-12-01 | 17 | NULL | NULL | NULL | NULL |
| 2024-12-02 | 14 | 17 | -17.60 | NULL | NULL |
| 2024-12-03 | 15 | 14 | 7.10 | NULL | NULL |
| 2024-12-04 | 14 | 15 | -6.70 | NULL | NULL |
| 2024-12-05 | 9 | 14 | -35.70 | NULL | NULL |
| 2024-12-06 | 14 | 9 | 55.60 | NULL | NULL |
| 2024-12-07 | 15 | 14 | 7.10 | NULL | NULL |
11. Recursive CTE: Organizational Tree ★★★
Print the full organizational hierarchy using manager_id in the staff table.
Include depth, supervisor name, and full path (CEO > ... > yourself).
Question frequency: ★★★ (very high) — Amazon, Kakao, Naver
| name | department | role | depth | manager_name | path |
|---|---|---|---|---|---|
| CEO Name | management | admin | 0 | NULL | CEO Name |
| ... | ... | ... | 1 | CEO Name | CEO Name > ... |
Hint 1: - Recursive CTE: WHERE manager_id IS NULL is the anchor (root)
- Join with s.manager_id = tree.id in the recursive part
- Accumulate paths with path || ' > ' || s.name
Answer
WITH RECURSIVE org_tree AS (
-- 앵커: 최상위 관리자 (manager_id IS NULL)
SELECT
s.id,
s.name,
s.department,
s.role,
0 AS depth,
CAST(NULL AS TEXT) AS manager_name,
s.name AS path
FROM staff AS s
WHERE s.manager_id IS NULL
UNION ALL
-- 재귀: 하위 직원
SELECT
s.id,
s.name,
s.department,
s.role,
t.depth + 1,
t.name AS manager_name,
t.path || ' > ' || s.name
FROM staff AS s
JOIN org_tree AS t ON s.manager_id = t.id
)
SELECT name, department, role, depth, manager_name, path
FROM org_tree
ORDER BY path;
Result (5 rows)
| name | department | role | depth | manager_name | path |
|---|---|---|---|---|---|
| Michael Thomas | Management | admin | 0 | NULL | Michael Thomas |
| Jaime Phelps | Sales | manager | 1 | Michael Thomas | Michael Thomas > Jaime Phelps |
| Jonathan Smith | Management | admin | 1 | Michael Thomas | Michael Thomas > Jonathan Smith |
| Nicole Hamilton | Marketing | manager | 2 | Jonathan Smith | Michael Thomas > Jonathan Smith > Nic... |
| Michael Mcguire | Management | admin | 1 | Michael Thomas | Michael Thomas > Michael Mcguire |
12. Recursive CTE: Creating date sequences ★★☆
Generate all dates in December 2024 as recursive CTEs, Find the number of orders and sales for each date. (Days without orders are also displayed as 0)
Question frequency: ★★☆ — Google (in an environment without a calendar table)
| dt | order_count | revenue |
|---|---|---|
| 2024-12-01 | ... | ... |
Hint 1: - Anchor: SELECT '2024-12-01' AS dt
- Recursion: SELECT DATE(dt, '+1 day') FROM dates WHERE dt < '2024-12-31'
- Combine order data with LEFT JOIN orders
Answer
WITH RECURSIVE dates AS (
SELECT '2024-12-01' AS dt
UNION ALL
SELECT DATE(dt, '+1 day')
FROM dates
WHERE dt < '2024-12-31'
)
SELECT
d.dt,
COUNT(o.id) AS order_count,
COALESCE(CAST(SUM(o.total_amount) AS INTEGER), 0) AS revenue
FROM dates AS d
LEFT JOIN orders AS o
ON DATE(o.ordered_at) = d.dt
AND o.status NOT IN ('cancelled')
GROUP BY d.dt
ORDER BY d.dt;
Result (top 7 of 31 rows)
| dt | order_count | revenue |
|---|---|---|
| 2024-12-01 | 17 | 12,081,245 |
| 2024-12-02 | 14 | 12,578,657 |
| 2024-12-03 | 15 | 11,867,860 |
| 2024-12-04 | 14 | 11,198,303 |
| 2024-12-05 | 9 | 5,489,585 |
| 2024-12-06 | 14 | 16,160,600 |
| 2024-12-07 | 15 | 16,802,502 |
13. Cohort analysis (repurchase by subscription month) ★★★
Based on the customer's subscription month (cohort), find the percentage of customers purchasing from 0 to 3 months after subscription. This applies to customers signing up in 2024.
Question frequency: ★★★ (very high) — Meta, Coupang, Naver
| cohort | size | m0_pct | m1_pct | m2_pct | m3_pct |
|---|---|---|---|---|---|
Hint 1: - Cohort: SUBSTR(created_at, 1, 7)
- Month Offset: Convert (julianday(order_month-01) - julianday(signup_month-01)) / 30 to integer
- COUNT(DISTINCT CASE WHEN offset = N THEN customer_id END) / cohort size
Answer
WITH cohort AS (
SELECT
id AS customer_id,
SUBSTR(created_at, 1, 7) AS cohort_month
FROM customers
WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'
),
order_month AS (
SELECT DISTINCT
co.customer_id,
co.cohort_month,
CAST(
(julianday(SUBSTR(o.ordered_at, 1, 7) || '-01')
- julianday(co.cohort_month || '-01')) / 30
AS INTEGER) AS month_offset
FROM cohort AS co
JOIN orders AS o ON co.customer_id = o.customer_id
WHERE o.status NOT IN ('cancelled')
)
SELECT
c.cohort_month AS cohort,
COUNT(DISTINCT c.customer_id) AS size,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN om.month_offset = 0 THEN om.customer_id END)
/ COUNT(DISTINCT c.customer_id), 1) AS m0_pct,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN om.month_offset = 1 THEN om.customer_id END)
/ COUNT(DISTINCT c.customer_id), 1) AS m1_pct,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN om.month_offset = 2 THEN om.customer_id END)
/ COUNT(DISTINCT c.customer_id), 1) AS m2_pct,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN om.month_offset = 3 THEN om.customer_id END)
/ COUNT(DISTINCT c.customer_id), 1) AS m3_pct
FROM cohort AS c
LEFT JOIN order_month AS om ON c.customer_id = om.customer_id
AND c.cohort_month = om.cohort_month
GROUP BY c.cohort_month
ORDER BY c.cohort_month;
Result (top 7 of 12 rows)
| cohort | size | m0_pct | m1_pct | m2_pct | m3_pct |
|---|---|---|---|---|---|
| 2024-01 | 52 | 11.50 | 5.80 | 9.60 | 7.70 |
| 2024-02 | 48 | 18.80 | 0.0 | 6.30 | 10.40 |
| 2024-03 | 71 | 16.90 | 14.10 | 15.50 | 2.80 |
| 2024-04 | 53 | 5.70 | 7.50 | 7.50 | 3.80 |
| 2024-05 | 43 | 9.30 | 9.30 | 16.30 | 4.70 |
| 2024-06 | 68 | 10.30 | 5.90 | 5.90 | 8.80 |
| 2024-07 | 62 | 8.10 | 9.70 | 11.30 | 11.30 |
14. Category hierarchy aggregation (Recursive + GROUP BY) ★★☆
Find the total sales by category from the categories tree.
Roll up all sales from the lower category (small/medium) to the upper category (large).
Question frequency: ★★☆ — Amazon, Kakao
| top_category | sub_category_count | product_count | total_revenue |
|---|---|---|---|
Hint 1: - Find the root (depth=0) ancestor of each category with recursive CTE
- After recursive search, base on root GROUP BY
Answer
WITH RECURSIVE cat_tree AS (
SELECT id, id AS root_id, name AS root_name, depth
FROM categories
WHERE parent_id IS NULL
UNION ALL
SELECT c.id, ct.root_id, ct.root_name, c.depth
FROM categories AS c
JOIN cat_tree AS ct ON c.parent_id = ct.id
)
SELECT
ct.root_name AS top_category,
COUNT(DISTINCT CASE WHEN ct.depth > 0 THEN ct.id END) AS sub_category_count,
COUNT(DISTINCT p.id) AS product_count,
COALESCE(CAST(SUM(oi.quantity * oi.unit_price) AS INTEGER), 0) AS total_revenue
FROM cat_tree AS ct
LEFT JOIN products AS p ON p.category_id = ct.id
LEFT JOIN order_items AS oi ON oi.product_id = p.id
LEFT JOIN orders AS o ON oi.order_id = o.id
AND o.status NOT IN ('cancelled', 'returned', 'return_requested')
WHERE ct.root_id = ct.root_id
GROUP BY ct.root_id, ct.root_name
ORDER BY total_revenue DESC;
Result (top 7 of 18 rows)
| top_category | sub_category_count | product_count | total_revenue |
|---|---|---|---|
| Laptop | 4 | 29 | 10,144,187,100 |
| Graphics Card | 2 | 15 | 5,608,961,100 |
| Monitor | 3 | 22 | 4,753,611,200 |
| Motherboard | 2 | 23 | 3,255,592,700 |
| CPU | 2 | 7 | 1,877,389,800 |
| Speakers/Headsets | 0 | 12 | 1,561,393,600 |
| Storage | 3 | 15 | 1,524,801,600 |
15. Self-Join: Employee with higher salary (number of sales processed) than boss ★★★
Find the number of orders processed by each employee, and find the employee who processed more orders than his/her boss. (based on orders.staff_id)
Question frequency: ★★★ (very high) — LeetCode #181 variant, Kakao
| staff_name | department | handled_orders | manager_name | manager_orders |
|---|---|---|---|---|
Hint 1: - staff AS s JOIN staff AS m ON s.manager_id = m.id (Self-Join)
- Number of orders for each employee: counted as orders.staff_id
- WHERE s_count > m_count
Answer
WITH staff_orders AS (
SELECT
s.id,
s.name,
s.department,
s.manager_id,
COUNT(o.id) AS handled_orders
FROM staff AS s
LEFT JOIN orders AS o ON s.id = o.staff_id
GROUP BY s.id, s.name, s.department, s.manager_id
)
SELECT
emp.name AS staff_name,
emp.department,
emp.handled_orders,
mgr.name AS manager_name,
mgr.handled_orders AS manager_orders
FROM staff_orders AS emp
JOIN staff_orders AS mgr ON emp.manager_id = mgr.id
WHERE emp.handled_orders > mgr.handled_orders
ORDER BY emp.handled_orders DESC;
16. Multi-level analysis: Measuring discount effect ★★☆
Compare customers who used the coupon with those who did not and find: (1) Average order amount by group, (2) Repurchase rate, (3) Average review rating.
Question frequency: ★★☆ — Coupang, Naver
| segment | customer_count | avg_order_value | repeat_rate_pct | avg_rating |
|---|---|---|---|---|
Hint 1: - Coupon usage: Classify segments by presence of coupon_usage
- Repurchase rate: Customers who ordered more than 2 items / All customers
- Aggregate each of the three tables (orders, coupon_usage, reviews) into CTE and then merge them.
Answer
WITH coupon_customers AS (
SELECT DISTINCT customer_id FROM coupon_usage
),
customer_segment AS (
SELECT
c.id AS customer_id,
CASE WHEN cc.customer_id IS NOT NULL THEN 'Coupon User' ELSE 'Non-Coupon' END AS segment
FROM customers AS c
LEFT JOIN coupon_customers AS cc ON c.id = cc.customer_id
WHERE c.is_active = 1
),
order_stats AS (
SELECT
cs.segment,
cs.customer_id,
COUNT(o.id) AS order_count,
AVG(o.total_amount) AS avg_order_value
FROM customer_segment AS cs
LEFT JOIN orders AS o ON cs.customer_id = o.customer_id
AND o.status NOT IN ('cancelled')
GROUP BY cs.segment, cs.customer_id
),
review_stats AS (
SELECT
cs.segment,
ROUND(AVG(r.rating), 2) AS avg_rating
FROM customer_segment AS cs
JOIN reviews AS r ON cs.customer_id = r.customer_id
GROUP BY cs.segment
)
SELECT
os.segment,
COUNT(DISTINCT os.customer_id) AS customer_count,
ROUND(AVG(os.avg_order_value), 0) AS avg_order_value,
ROUND(100.0 * SUM(CASE WHEN os.order_count >= 2 THEN 1 ELSE 0 END)
/ COUNT(DISTINCT os.customer_id), 1) AS repeat_rate_pct,
rs.avg_rating
FROM order_stats AS os
LEFT JOIN review_stats AS rs ON os.segment = rs.segment
GROUP BY os.segment, rs.avg_rating
ORDER BY os.segment;
Result (2 rows)
| segment | customer_count | avg_order_value | repeat_rate_pct | avg_rating |
|---|---|---|---|---|
| Coupon User | 888 | 1,030,689.00 | 97.20 | 3.89 |
| Non-Coupon | 2772 | 843,056.00 | 52.50 | 3.93 |
17. Data quality check: NULL/outlier detection ★★☆
Report the following data quality issues with one query: (1) Order amount is 0 or less, (2) Delivery date < Order date, (3) Outside review rating range, (4) Order on a future date.
Question frequency: ★★☆ — Amazon (Data Engineer), Kakao
| issue_type | table_name | record_count | sample_ids |
|---|---|---|---|
Hint 1: - Combine the results of each quality check with UNION ALL
- List sample ID as GROUP_CONCAT(id, ',')
- Future date: ordered_at > '2025-12-31'
Answer
SELECT 'Zero/Negative Amount' AS issue_type,
'orders' AS table_name,
COUNT(*) AS record_count,
GROUP_CONCAT(id, ',') AS sample_ids
FROM orders
WHERE total_amount <= 0
UNION ALL
SELECT 'Delivery Before Shipment',
'shipping',
COUNT(*),
GROUP_CONCAT(id, ',')
FROM shipping
WHERE delivered_at IS NOT NULL
AND shipped_at IS NOT NULL
AND delivered_at < shipped_at
UNION ALL
SELECT 'Rating Out of Range',
'reviews',
COUNT(*),
GROUP_CONCAT(id, ',')
FROM reviews
WHERE rating < 1 OR rating > 5
UNION ALL
SELECT 'Future Order Date',
'orders',
COUNT(*),
GROUP_CONCAT(id, ',')
FROM orders
WHERE ordered_at > DATE('now', '+1 day');
Result (4 rows)
| issue_type | table_name | record_count | sample_ids |
|---|---|---|---|
| Zero/Negative Amount | orders | 0 | NULL |
| Delivery Before Shipment | shipping | 0 | NULL |
| Rating Out of Range | reviews | 0 | NULL |
| Future Order Date | orders | 0 | NULL |
18. Time series anomaly detection: 3-sigma rule ★☆☆
Find outlier days that fall outside the average +/- 3 standard deviations of your daily sales.
Question frequency: ★☆☆ — Google (Data Science)
| order_date | daily_revenue | avg_revenue | stddev | z_score |
|---|---|---|---|---|
Hint 1: - Standard deviation: Manual calculation as it is not directly supported by SQLite
- SQRT(AVG(x*x) - AVG(x)*AVG(x)) = population standard deviation
- If Z-score = (value - mean) / stddev, ABS(z) > 3, it is an outlier.
Answer
WITH daily AS (
SELECT
DATE(ordered_at) AS order_date,
SUM(total_amount) AS daily_revenue
FROM orders
WHERE status NOT IN ('cancelled')
GROUP BY DATE(ordered_at)
),
stats AS (
SELECT
AVG(daily_revenue) AS avg_rev,
SQRT(AVG(daily_revenue * daily_revenue) - AVG(daily_revenue) * AVG(daily_revenue)) AS stddev_rev
FROM daily
)
SELECT
d.order_date,
CAST(d.daily_revenue AS INTEGER) AS daily_revenue,
CAST(s.avg_rev AS INTEGER) AS avg_revenue,
CAST(s.stddev_rev AS INTEGER) AS stddev,
ROUND((d.daily_revenue - s.avg_rev) / s.stddev_rev, 2) AS z_score
FROM daily AS d
CROSS JOIN stats AS s
WHERE ABS((d.daily_revenue - s.avg_rev) / s.stddev_rev) > 3
ORDER BY ABS((d.daily_revenue - s.avg_rev) / s.stddev_rev) DESC;
Result (top 7 of 48 rows)
| order_date | daily_revenue | avg_revenue | stddev | z_score |
|---|---|---|---|---|
| 2025-12-18 | 62,329,008 | 10,077,964 | 8,237,014 | 6.34 |
| 2025-03-05 | 62,266,624 | 10,077,964 | 8,237,014 | 6.34 |
| 2020-11-21 | 60,470,134 | 10,077,964 | 8,237,014 | 6.12 |
| 2025-12-21 | 56,392,485 | 10,077,964 | 8,237,014 | 5.62 |
| 2020-02-09 | 51,620,600 | 10,077,964 | 8,237,014 | 5.04 |
| 2025-05-06 | 51,138,521 | 10,077,964 | 8,237,014 | 4.98 |
| 2022-01-06 | 50,993,500 | 10,077,964 | 8,237,014 | 4.97 |
19. Complex analysis: RFM segmentation ★★★
Segment your customers by RFM(Recency, Frequency, Monetary). Divide each indicator into a scale of 1 to 5, and calculate the number of customers and average sales for each segment.
Question frequency: ★★★ (very high) — Coupang, Naver, Amazon
| rfm_segment | r_score | f_score | m_score | customer_count | avg_monetary |
|---|---|---|---|---|---|
| Champions | 5 | 5 | 5 | ... | ... |
Hint 1: - Recency: Days since last order → NTILE(5) (the more recent, the higher the score)
- Frequency: Number of orders → NTILE(5)
- Monetary: Total purchase amount → NTILE(5)
- Segment classification: R+F+M sum is 13~15=Champions, 10~12=Loyal, 7~9=Potential, 4~6=AtRisk, 3=Lost
Answer
WITH rfm_raw AS (
SELECT
customer_id,
CAST(julianday('2025-06-30') - julianday(MAX(ordered_at)) AS INTEGER) AS recency,
COUNT(*) AS frequency,
CAST(SUM(total_amount) AS INTEGER) AS monetary
FROM orders
WHERE status NOT IN ('cancelled')
GROUP BY customer_id
),
rfm_scored AS (
SELECT
customer_id,
recency, frequency, monetary,
NTILE(5) OVER (ORDER BY recency DESC) AS r_score,
NTILE(5) OVER (ORDER BY frequency ASC) AS f_score,
NTILE(5) OVER (ORDER BY monetary ASC) AS m_score
FROM rfm_raw
),
rfm_segmented AS (
SELECT *,
r_score + f_score + m_score AS rfm_total,
CASE
WHEN r_score + f_score + m_score >= 13 THEN 'Champions'
WHEN r_score + f_score + m_score >= 10 THEN 'Loyal'
WHEN r_score + f_score + m_score >= 7 THEN 'Potential'
WHEN r_score + f_score + m_score >= 4 THEN 'At Risk'
ELSE 'Lost'
END AS rfm_segment
FROM rfm_scored
)
SELECT
rfm_segment,
ROUND(AVG(r_score), 1) AS r_score,
ROUND(AVG(f_score), 1) AS f_score,
ROUND(AVG(m_score), 1) AS m_score,
COUNT(*) AS customer_count,
CAST(AVG(monetary) AS INTEGER) AS avg_monetary
FROM rfm_segmented
GROUP BY rfm_segment
ORDER BY AVG(rfm_total) DESC;
Result (5 rows)
| rfm_segment | r_score | f_score | m_score | customer_count | avg_monetary |
|---|---|---|---|---|---|
| Champions | 4.40 | 4.80 | 4.80 | 579 | 42,854,633 |
| Loyal | 3.30 | 3.80 | 3.80 | 655 | 11,437,042 |
| Potential | 2.80 | 2.60 | 2.60 | 788 | 4,077,164 |
| At Risk | 2.10 | 1.50 | 1.50 | 653 | 1,071,866 |
| Lost | 1.00 | 1.00 | 1.00 | 134 | 175,595 |
20. Comprehensive scenario: Marketing campaign effectiveness analysis ★★★
Break down the following by promotions in 2024: (1) Number of participating customers, (2) Sales during promotion period vs. sales during the previous same period (incremental effect), (3) Repurchase rate for 30 days after promotion ends; (4) Customer acquisition cost (total discount/number of new customers).
Question frequency: ★★★ (very high) — Coupang, Naver, Meta (comprehensive questions)
| promo_name | promo_type | participants | promo_revenue | pre_revenue | lift_pct | post_repurchase_pct | cac |
|---|---|---|---|---|---|---|---|
Hint 1: - Set period to started_at/ended_at in promotions
- Previous same period: DATE(started_at, '-' || (julianday(ended_at)-julianday(started_at)) || ' days')
- New Customer: Customer who placed their first order during the promotion period.
- Separated into CTE levels 4-5
Answer
WITH promo_periods AS (
SELECT
pr.id AS promo_id,
pr.name AS promo_name,
pr.type AS promo_type,
pr.started_at,
pr.ended_at,
CAST(julianday(pr.ended_at) - julianday(pr.started_at) AS INTEGER) AS duration_days
FROM promotions AS pr
WHERE pr.started_at >= '2024-01-01' AND pr.started_at < '2025-01-01'
),
promo_orders AS (
SELECT
pp.promo_id, pp.promo_name, pp.promo_type,
pp.started_at, pp.ended_at, pp.duration_days,
o.id AS order_id,
o.customer_id,
o.total_amount,
o.discount_amount
FROM promo_periods AS pp
JOIN orders AS o
ON o.ordered_at BETWEEN pp.started_at AND pp.ended_at
AND o.status NOT IN ('cancelled')
),
pre_period_revenue AS (
SELECT
pp.promo_id,
COALESCE(SUM(o.total_amount), 0) AS pre_revenue
FROM promo_periods AS pp
LEFT JOIN orders AS o
ON o.ordered_at BETWEEN DATE(pp.started_at, '-' || pp.duration_days || ' days')
AND DATE(pp.started_at, '-1 day')
AND o.status NOT IN ('cancelled')
GROUP BY pp.promo_id
),
post_repurchase AS (
SELECT
po.promo_id,
COUNT(DISTINCT CASE
WHEN EXISTS (
SELECT 1 FROM orders o2
WHERE o2.customer_id = po.customer_id
AND o2.ordered_at > po.ended_at
AND o2.ordered_at <= DATE(po.ended_at, '+30 days')
AND o2.status NOT IN ('cancelled')
)
THEN po.customer_id
END) AS repurchase_customers,
COUNT(DISTINCT po.customer_id) AS total_customers
FROM promo_orders AS po
GROUP BY po.promo_id
)
SELECT
po.promo_name,
po.promo_type,
COUNT(DISTINCT po.customer_id) AS participants,
CAST(SUM(po.total_amount) AS INTEGER) AS promo_revenue,
CAST(pr.pre_revenue AS INTEGER) AS pre_revenue,
ROUND(100.0 * (SUM(po.total_amount) - pr.pre_revenue)
/ NULLIF(pr.pre_revenue, 0), 1) AS lift_pct,
ROUND(100.0 * rp.repurchase_customers / NULLIF(rp.total_customers, 0), 1) AS post_repurchase_pct,
CAST(SUM(po.discount_amount) / NULLIF(
COUNT(DISTINCT CASE
WHEN NOT EXISTS (
SELECT 1 FROM orders o3
WHERE o3.customer_id = po.customer_id
AND o3.ordered_at < po.started_at
AND o3.status NOT IN ('cancelled')
)
THEN po.customer_id
END), 0) AS INTEGER) AS cac
FROM promo_orders AS po
JOIN pre_period_revenue AS pr ON po.promo_id = pr.promo_id
JOIN post_repurchase AS rp ON po.promo_id = rp.promo_id
GROUP BY po.promo_id, po.promo_name, po.promo_type,
pr.pre_revenue, rp.repurchase_customers, rp.total_customers
ORDER BY promo_revenue DESC;
Result (top 7 of 12 rows)
| promo_name | promo_type | participants | promo_revenue | pre_revenue | lift_pct | post_repurchase_pct | cac |
|---|---|---|---|---|---|---|---|
| Back to School Laptop Sale 2024 | category | 317 | 359,327,629 | 356,716,760 | 0.7 | 32.50 | 74,055 |
| Autumn Gift Sale 2024 | seasonal | 171 | 231,954,685 | 142,169,738 | 63.20 | 36.80 | 155,025 |
| Spring Sale 2024 | seasonal | 226 | 212,486,725 | 217,467,681 | -2.30 | 35.80 | 78,385 |
| Year-End Thank You Sale 2024 | seasonal | 186 | 192,873,636 | 188,843,699 | 2.10 | 31.20 | 121,744 |
| Summer Cooling Festival 2024 | category | 168 | 191,169,469 | 178,340,805 | 7.20 | 26.20 | 120,630 |
| Printer Special Deal 2024 | category | 121 | 147,372,860 | 158,556,326 | -7.10 | 30.60 | 120,300 |
| New Year Sale 2024 | seasonal | 68 | 70,486,815 | 54,386,890 | 29.60 | 25.00 | 148,566 |