Complete SQL & DBMS Roadmap
1. Basic DBMS Concepts
What is a DBMS?
A Database Management System (DBMS) is software that enables users to create, read, update, and delete data in a database efficiently. It acts as an interface between the database and end-users or application programs.
Key Functions of DBMS:
- Data Definition: Define database structure (CREATE, ALTER, DROP)
- Data Manipulation: Insert, update, delete, retrieve data (CRUD operations)
- Data Security: User authentication, authorization, access control
- Data Integrity: Maintain accuracy and consistency through constraints
- Concurrency Control: Handle simultaneous data access by multiple users
- Backup & Recovery: Protect against data loss and system failures
- Data Independence: Separate physical storage from logical view
Popular DBMS: MySQL, PostgreSQL, Oracle, MongoDB, SQL Server, SQLite, Cassandra, Redis
SQL vs NoSQL: When to use each?
π SQL (Relational)
- Structure: Tables with fixed schema (rows & columns)
- Query Language: Structured Query Language (SQL)
- Scalability: Vertical (scale up - add more CPU/RAM)
- ACID: Strong consistency, transactions
- Relationships: Foreign keys, JOINs
- Best for: Complex queries, financial data, e-commerce
- Examples: MySQL, PostgreSQL, Oracle, MS SQL
π NoSQL (Non-Relational)
- Structure: Flexible schema (JSON, key-value, graph)
- Query Language: Database-specific APIs
- Scalability: Horizontal (scale out - add more servers)
- BASE: Eventual consistency, high availability
- Relationships: Embedded documents, no JOINs
- Best for: Big data, real-time apps, IoT, social media
- Examples: MongoDB, Cassandra, Redis, DynamoDB
Decision Guide:
Choose SQL when:
- You need complex joins and relationships between tables
- Data integrity and ACID transactions are critical (banking, payments)
- Your schema is well-defined and stable
- You need strong consistency
Choose NoSQL when:
- You need to scale horizontally across multiple servers
- Your schema is flexible or evolves frequently
- You prioritize speed and availability over strict consistency
- You're handling huge volumes of unstructured data
Primary Key, Foreign Key, Candidate Key, Super Key
1. Primary Key π
Uniquely identifies each record in a table. Cannot be NULL. Only ONE per table.
CREATE TABLE students (
student_id INT PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(100)
);
2. Foreign Key π
Links two tables together. References the PRIMARY KEY of another table. Ensures referential integrity.
CREATE TABLE enrollments (
enrollment_id INT PRIMARY KEY,
student_id INT,
course_id INT,
FOREIGN KEY (student_id) REFERENCES students(student_id)
);
Benefit: If you try to delete a student who has enrollments, the database will prevent it (maintaining data integrity).
3. Candidate Key π―
Any column (or combination) that can uniquely identify a record. There can be multiple candidate keys. The Primary Key is chosen from candidate keys.
Example: In a Students table:
student_idβ Candidate Key (unique)emailβ Candidate Key (unique)phoneβ Candidate Key (if unique)
We choose student_id as PRIMARY KEY from these candidates.
4. Super Key π
Any combination of columns that uniquely identifies a record. A Candidate Key is a minimal Super Key (no unnecessary columns).
Example Super Keys:
- {student_id} β Minimal β
- {email} β Minimal β
- {student_id, name} β Has redundancy β
- {student_id, name, email} β Has redundancy β
Schema vs Instance in a Database
Schema (Blueprint) π
The logical design/structure of the database. Defines table names, column names, data types, constraints. Rarely changes.
CREATE TABLE users (
id INT,
name VARCHAR(50),
age INT
);
Instance (Data) π
The actual data stored at a particular moment. Changes frequently as records are inserted/updated/deleted.
(1, 'Alice', 25)
(2, 'Bob', 30)
(3, 'Charlie', 28)
Perfect Analogy:
Schema = House Blueprint (structure, layout)
Instance = Furniture Inside (actual content, changes often)
What are Constraints?
Constraints are rules applied to table columns to ensure data accuracy, consistency, and integrity. They prevent invalid data from being entered.
CREATE TABLE employees (
id INT PRIMARY KEY, -- NOT NULL + UNIQUE
name VARCHAR(100) NOT NULL, -- Must have value
email VARCHAR(100) UNIQUE, -- No duplicates
age INT CHECK (age >= 18), -- Must be 18+
salary DECIMAL DEFAULT 50000, -- Default if not provided
dept_id INT,
FOREIGN KEY (dept_id) REFERENCES departments(id)
);
Types of Constraints:
- NOT NULL: Column cannot have NULL value
- UNIQUE: All values in column must be different
- PRIMARY KEY: NOT NULL + UNIQUE combined (one per table)
- FOREIGN KEY: Links tables, ensures referential integrity
- CHECK: Ensures values meet a specific condition
- DEFAULT: Sets default value if none is provided
2. Normalization & Database Design
What is Normalization? Explain 1NF, 2NF, 3NF, BCNF
Normalization is the process of organizing data to reduce redundancy and improve data integrity. It involves dividing large tables into smaller, related tables.
1. First Normal Form (1NF) β
Rules:
- Each column contains atomic (indivisible) values
- No repeating groups or arrays
- Each column has a unique name
- Order of rows doesn't matter
-- β NOT in 1NF (multiple values in one cell)
orders
order_id | customer | products
1 | Alice | "Laptop, Mouse, Keyboard"
-- β
In 1NF (atomic values)
orders
order_id | customer | product
1 | Alice | Laptop
1 | Alice | Mouse
1 | Alice | Keyboard
2. Second Normal Form (2NF) β
Rules:
- Must be in 1NF
- No partial dependencies (all non-key attributes must depend on the ENTIRE primary key)
-- β NOT in 2NF (course_name depends only on course_id, not full key)
enrollments
student_id | course_id | course_name | grade
1 | 101 | Math | A
2 | 101 | Math | B
-- β
In 2NF (split into two tables)
enrollments
student_id | course_id | grade
1 | 101 | A
2 | 101 | B
courses
course_id | course_name
101 | Math
3. Third Normal Form (3NF) β
Rules:
- Must be in 2NF
- No transitive dependencies (non-key attributes should not depend on other non-key attributes)
-- β NOT in 3NF (dept_location depends on dept_id, not employee_id)
employees
employee_id | name | dept_id | dept_location
1 | Alice | 10 | NY
2 | Bob | 10 | NY
-- β
In 3NF (split department info)
employees
employee_id | name | dept_id
1 | Alice | 10
2 | Bob | 10
departments
dept_id | dept_location
10 | NY
4. Boyce-Codd Normal Form (BCNF) β
Rules:
- Must be in 3NF
- For every functional dependency A β B, A must be a super key
- BCNF is stricter than 3NF
When 3NF β BCNF: When there are multiple candidate keys that overlap.
Benefits of Normalization:
- Eliminates data redundancy
- Ensures data consistency
- Easier to maintain and update
- Reduces storage space
- Prevents insertion, update, and deletion anomalies
What is Denormalization? When is it preferred?
Denormalization is the process of intentionally introducing redundancy by combining normalized tables. It's the opposite of normalization.
Why Denormalize? π
- Improve read performance (fewer JOINs)
- Reduce complex queries
- Optimize for read-heavy applications
- Better for analytics and reporting
Trade-offs β οΈ
- Data redundancy (uses more storage)
- Slower writes (must update multiple places)
- Risk of data inconsistency
- More complex to maintain
-- Normalized (3NF) - Requires JOIN
orders: order_id, customer_id, total
customers: customer_id, name, email
SELECT o.order_id, c.name, o.total
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;
-- Denormalized - No JOIN needed (faster reads)
orders: order_id, customer_id, customer_name, total
SELECT order_id, customer_name, total
FROM orders;
When to Denormalize:
β
Read-heavy applications (analytics, dashboards)
β
When query performance is critical
β
Data warehouses and OLAP systems
β
Caching frequently accessed data
β Avoid in transactional systems (OLTP) where data consistency is critical
How would you design a database for YouTube/LicenseLane?
This tests your ability to design a real-world database schema. Let's design for YouTube as an example:
π Requirements Analysis:
- Users can upload videos
- Users can like, comment, subscribe
- Videos have views, likes, tags
- Users have channels, playlists
-- Users Table
CREATE TABLE users (
user_id INT PRIMARY KEY,
username VARCHAR(50) UNIQUE,
email VARCHAR(100) UNIQUE,
password_hash VARCHAR(255),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Videos Table
CREATE TABLE videos (
video_id INT PRIMARY KEY,
user_id INT,
title VARCHAR(200),
description TEXT,
video_url VARCHAR(500),
thumbnail_url VARCHAR(500),
views INT DEFAULT 0,
likes INT DEFAULT 0,
dislikes INT DEFAULT 0,
uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES users(user_id)
);
-- Comments Table
CREATE TABLE comments (
comment_id INT PRIMARY KEY,
video_id INT,
user_id INT,
comment_text TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (video_id) REFERENCES videos(video_id),
FOREIGN KEY (user_id) REFERENCES users(user_id)
);
-- Subscriptions Table (Many-to-Many)
CREATE TABLE subscriptions (
subscriber_id INT,
channel_id INT,
subscribed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (subscriber_id, channel_id),
FOREIGN KEY (subscriber_id) REFERENCES users(user_id),
FOREIGN KEY (channel_id) REFERENCES users(user_id)
);
-- Tags Table
CREATE TABLE tags (
tag_id INT PRIMARY KEY,
tag_name VARCHAR(50) UNIQUE
);
-- Video_Tags Junction Table (Many-to-Many)
CREATE TABLE video_tags (
video_id INT,
tag_id INT,
PRIMARY KEY (video_id, tag_id),
FOREIGN KEY (video_id) REFERENCES videos(video_id),
FOREIGN KEY (tag_id) REFERENCES tags(tag_id)
);
Key Design Decisions:
- Normalization: 3NF to avoid redundancy
- Junction Tables: For many-to-many relationships (video_tags, subscriptions)
- Indexes: On user_id, video_id for faster queries
- Timestamps: Track when records were created
- Scalability: Can add sharding, caching, CDN for videos later
3. ACID Properties of Transactions
What are ACID Properties? Why are they important?
ACID ensures reliable transaction processing in databases. Critical for maintaining data integrity in financial systems, e-commerce, banking, etc.
Atomicity (All or Nothing)
Transaction is all-or-nothing. Either ALL operations succeed, or NONE do. No partial completion.
Example: Money transfer: Debit from Account A AND credit to Account Bβboth must happen or neither.
Consistency (Valid State)
Database moves from one valid state to another. All rules, constraints, and triggers are maintained.
Example: Total money in the system stays constant after transfer. No money is created or destroyed.
Isolation (No Interference)
Concurrent transactions don't interfere. Each transaction appears to execute in isolation from others.
Example: Two people booking the same flight seatβonly one succeeds. Prevents race conditions.
Durability (Permanent)
Once committed, changes are permanentβeven if system crashes immediately after. Data is written to non-volatile storage.
Example: Confirmed payment survives server restart or power failure.
-- Transaction Example: Bank Transfer
BEGIN TRANSACTION;
-- Step 1: Debit $500 from Account A
UPDATE accounts
SET balance = balance - 500
WHERE account_id = 123;
-- Step 2: Credit $500 to Account B
UPDATE accounts
SET balance = balance + 500
WHERE account_id = 456;
-- If BOTH succeed β COMMIT
COMMIT;
-- If ANY fails β ROLLBACK (undo all changes)
-- ROLLBACK;
Why ACID is Critical:
- Data Integrity: Prevents corruption and inconsistencies
- Reliability: Guarantees correct behavior even during failures
- Concurrency: Multiple users can safely access database simultaneously
- Trust: Essential for financial systems, healthcare, e-commerce
4. Advanced DBMS Concepts
What is Sharding and Replication in Databases?
Sharding (Horizontal Partitioning) π
Splitting a large database into smaller, faster pieces (shards) distributed across multiple servers. Each shard contains a subset of the data.
How it works:
- Range-based: Users 1-1000 β Server1, 1001-2000 β Server2
- Hash-based: Hash(user_id) % num_servers
- Geography: US users β US server, EU β EU server
Benefits: Horizontal scalability, faster queries
Challenges: Complex joins across shards, data rebalancing
Replication (Data Copies) π
Creating multiple copies of the same data across different servers for redundancy and availability.
Types:
- Master-Slave: One master (writes), multiple slaves (reads)
- Master-Master: All nodes can read/write
- Peer-to-Peer: All nodes equal
Benefits: High availability, load balancing, disaster recovery
Challenges: Data consistency, replication lag
Key Difference:
Sharding: Splits data (each server has different data)
Replication: Copies data (each server has same data)
CAP Theorem: Consistency, Availability, Partition Tolerance
The CAP Theorem states that in a distributed database system, you can only achieve TWO out of THREE properties simultaneously.
Consistency (C)
All nodes see the same data at the same time. Every read gets the most recent write.
Availability (A)
Every request receives a response (success or failure). System is always operational.
Partition Tolerance (P)
System continues to operate despite network partitions/failures between nodes.
You can only pick TWO:
- CP (Consistency + Partition Tolerance): MongoDB, HBase, Redis
Sacrifice availabilityβsystem may reject requests to maintain consistency during partition - AP (Availability + Partition Tolerance): Cassandra, DynamoDB, CouchDB
Sacrifice consistencyβall nodes respond but may return stale data - CA (Consistency + Availability): Traditional RDBMS (MySQL, PostgreSQL)
Not partition-tolerantβsingle-node or tightly-coupled systems
Interview Tip:
In reality, Partition Tolerance is mandatory for distributed systems (network failures will happen). So the real choice is between CP vs AP.
Difference between OLTP and OLAP Systems
OLTP (Online Transaction Processing) π³
Purpose: Day-to-day transactional operations
- Operations: INSERT, UPDATE, DELETE (write-heavy)
- Queries: Simple, frequent, affecting few records
- Response Time: Milliseconds
- Data: Current, detailed transactions
- Normalization: Highly normalized (3NF)
- Users: Many concurrent users
Examples: E-commerce checkout, ATM withdrawals, airline reservations, order processing
OLAP (Online Analytical Processing) π
Purpose: Data analysis and business intelligence
- Operations: SELECT (read-heavy), complex aggregations
- Queries: Complex, infrequent, scanning millions of rows
- Response Time: Seconds to minutes
- Data: Historical, aggregated, multi-dimensional
- Normalization: Denormalized (star/snowflake schema)
- Users: Few analysts/data scientists
Examples: Sales reports, trend analysis, forecasting, data warehouses
-- OLTP Query (simple, fast)
INSERT INTO orders (customer_id, product_id, quantity)
VALUES (123, 456, 2);
-- OLAP Query (complex, analytical)
SELECT
product_category,
YEAR(order_date) AS year,
SUM(revenue) AS total_revenue,
AVG(revenue) AS avg_revenue
FROM sales_fact
JOIN products ON sales_fact.product_id = products.id
WHERE order_date >= '2020-01-01'
GROUP BY product_category, year
ORDER BY total_revenue DESC;
What are Stored Procedures, Triggers, and Views?
1. Stored Procedures π
Precompiled SQL code stored in the database that can be executed multiple times. Like a function in programming.
DELIMITER //
CREATE PROCEDURE GetEmployeesByDept(IN dept_id INT)
BEGIN
SELECT * FROM employees
WHERE department_id = dept_id;
END//
DELIMITER ;
-- Call the procedure
CALL GetEmployeesByDept(10);
Benefits: Reusability, security, performance (precompiled), network efficiency
2. Triggers β‘
Automatic actions executed before or after INSERT, UPDATE, or DELETE operations. Like event listeners.
-- Auto-update 'updated_at' timestamp
CREATE TRIGGER before_employee_update
BEFORE UPDATE ON employees
FOR EACH ROW
BEGIN
SET NEW.updated_at = NOW();
END;
-- Maintain audit log
CREATE TRIGGER after_order_insert
AFTER INSERT ON orders
FOR EACH ROW
BEGIN
INSERT INTO audit_log (action, order_id, timestamp)
VALUES ('INSERT', NEW.order_id, NOW());
END;
Use Cases: Audit logs, data validation, cascading updates, maintaining derived data
3. Views ποΈ
Virtual tables based on a SELECT query. Don't store dataβjust save the query definition.
-- Create a view
CREATE VIEW high_salary_employees AS
SELECT name, salary, department
FROM employees
WHERE salary > 100000;
-- Query the view like a table
SELECT * FROM high_salary_employees;
Benefits: Simplify complex queries, security (hide columns), logical data independence
5. SQL Joins
What are SQL Joins?
SQL Joins are used to combine rows from two or more tables based on a related column between them. They allow you to retrieve data from multiple tables in a single query.
Key Concept:
Joins work by matching values in specified columns (usually foreign keys) to combine related data from different tables.
Sample Tables for Examples
-- Create tables
CREATE TABLE employees (
emp_id INT PRIMARY KEY,
emp_name VARCHAR(50),
dept_id INT,
salary DECIMAL(10,2)
);
CREATE TABLE departments (
dept_id INT PRIMARY KEY,
dept_name VARCHAR(50),
location VARCHAR(50)
);
-- Insert sample data
INSERT INTO employees VALUES
(1, 'Alice', 10, 75000),
(2, 'Bob', 20, 65000),
(3, 'Charlie', 10, 80000),
(4, 'David', 30, 70000),
(5, 'Eve', NULL, 55000);
INSERT INTO departments VALUES
(10, 'Engineering', 'New York'),
(20, 'Marketing', 'Boston'),
(40, 'Finance', 'Chicago');
Types of SQL Joins
1. INNER JOIN β©
Returns only the matching records from both tables.
SELECT
e.emp_name,
e.salary,
d.dept_name,
d.location
FROM employees e
INNER JOIN departments d
ON e.dept_id = d.dept_id;
Result: Alice, Bob, Charlie (matched)
Excluded: David (dept 30 not in departments), Eve (NULL dept_id), Finance dept (no employees)
2. LEFT JOIN (LEFT OUTER JOIN) β
Returns ALL records from left table + matching from right table.
SELECT
e.emp_name,
e.salary,
d.dept_name,
d.location
FROM employees e
LEFT JOIN departments d
ON e.dept_id = d.dept_id;
Result: All 5 employees
NULL values: David and Eve get NULL for dept_name and location
Excluded: Finance dept (no employees)
3. RIGHT JOIN (RIGHT OUTER JOIN) β
Returns ALL records from right table + matching from left table.
SELECT
e.emp_name,
e.salary,
d.dept_name,
d.location
FROM employees e
RIGHT JOIN departments d
ON e.dept_id = d.dept_id;
Result: All 3 departments
NULL values: Finance gets NULL for emp_name and salary
Excluded: David and Eve
4. FULL OUTER JOIN βͺ
Returns ALL records from both tables (matched + unmatched).
-- MySQL workaround using UNION
SELECT * FROM employees e
LEFT JOIN departments d
ON e.dept_id = d.dept_id
UNION
SELECT * FROM employees e
RIGHT JOIN departments d
ON e.dept_id = d.dept_id;
Result: All employees + all departments
Note: MySQL doesn't support FULL OUTER JOIN directly
5. CROSS JOIN Γ
Returns Cartesian product - all possible combinations of rows.
SELECT
e.emp_name,
d.dept_name
FROM employees e
CROSS JOIN departments d;
-- Alternative syntax
SELECT e.emp_name, d.dept_name
FROM employees e, departments d;
Result: 5 employees Γ 3 departments = 15 rows
Use Case: Generate all possible combinations, schedules
6. SELF JOIN β»
Table joins with itself to compare rows within the same table.
-- Find employees with same salary
SELECT
e1.emp_name AS employee1,
e2.emp_name AS employee2,
e1.salary
FROM employees e1
INNER JOIN employees e2
ON e1.salary = e2.salary
AND e1.emp_id < e2.emp_id;
Use Case: Employee hierarchy, comparing records within same table
Quick Reference:
INNER JOIN: Only matches (β©)
LEFT JOIN: All from left + matches (β)
RIGHT JOIN: All from right + matches (β)
FULL OUTER: All from both (βͺ)
CROSS JOIN: All combinations (Γ)
SELF JOIN: Table joins itself (β»)
Advanced Join Techniques
1. Multiple Table Joins π
Join more than two tables in a single query.
-- Create projects table
CREATE TABLE projects (
project_id INT PRIMARY KEY,
project_name VARCHAR(50),
dept_id INT
);
INSERT INTO projects VALUES
(1, 'Website Redesign', 10),
(2, 'Marketing Campaign', 20),
(3, 'Mobile App', 10);
-- Join 3 tables
SELECT
e.emp_name,
d.dept_name,
p.project_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
INNER JOIN projects p ON d.dept_id = p.dept_id
ORDER BY e.emp_name;
2. Join with WHERE Conditions π―
Filter joined results with additional conditions.
-- Find high-salary employees in specific departments
SELECT
e.emp_name,
d.dept_name,
e.salary
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
WHERE e.salary > 70000
AND d.location = 'New York';
3. Join with GROUP BY π
Aggregate data from joined tables.
-- Count employees and average salary per department
SELECT
d.dept_name,
COUNT(e.emp_id) AS employee_count,
AVG(e.salary) AS avg_salary
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name
ORDER BY employee_count DESC;
4. Join with Subquery π§©
Join with derived tables or inline views.
-- Find employees earning above their department average
SELECT
e.emp_name,
e.salary,
dept_avg.avg_salary
FROM employees e
INNER JOIN (
SELECT dept_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id
) dept_avg ON e.dept_id = dept_avg.dept_id
WHERE e.salary > dept_avg.avg_salary;
5. Finding Unmatched Records π
Use LEFT/RIGHT JOIN with NULL checks to find non-matching records.
-- Find employees WITHOUT a department
SELECT e.emp_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id
WHERE d.dept_id IS NULL;
-- Find departments WITHOUT employees
SELECT d.dept_name
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
WHERE e.emp_id IS NULL;
Join Performance Tips
- Index join columns: Create indexes on columns used in ON clauses for faster lookups
- Use INNER JOIN when possible: It's faster than OUTER JOINs as it processes fewer rows
- Filter early: Apply WHERE conditions before joining to reduce data processed
- Avoid SELECT *: Only select columns you need to reduce data transfer
- Use EXPLAIN: Analyze query execution plan to identify bottlenecks
- Join order matters: Start with the smallest table when joining multiple tables
-- Create indexes on frequently joined columns
CREATE INDEX idx_emp_dept ON employees(dept_id);
CREATE INDEX idx_dept_id ON departments(dept_id);
-- Analyze query performance
EXPLAIN SELECT *
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;
Common Interview Questions
Q1: What's the difference between INNER JOIN and OUTER JOIN?
INNER JOIN: Returns only matching records from both tables.
OUTER JOIN: Returns all records from one or both tables, with NULLs for non-matching rows.
Q2: When would you use a SELF JOIN?
Use SELF JOIN when you need to compare rows within the same table, such as:
β’ Finding employee-manager relationships
β’ Comparing records with similar attributes (same salary, same city)
β’ Hierarchical data structures
Q3: What's the difference between JOIN and UNION?
JOIN: Combines columns from multiple tables horizontally (side by side).
UNION: Combines rows from multiple queries vertically (stacked on top).
Q4: Can you have multiple JOIN conditions?
Yes! Use AND/OR in the ON clause:
ON t1.id = t2.id AND t1.status = t2.status
Q5: What's the difference between WHERE and ON in joins?
ON: Specifies join condition (how tables relate).
WHERE: Filters results after the join is performed.
For INNER JOINs, they often produce the same result, but for OUTER JOINs, the difference is significant.
6. Indexing & Query Optimization
What is Database Indexing?
An index is a database structure that improves the speed of data retrieval operations. Think of it like a book's indexβinstead of reading every page, you can jump directly to the relevant section.
How Indexes Work:
Indexes create a separate data structure (usually B-Tree or Hash) that stores column values and pointers to the actual rows. This allows the database to find data without scanning every row.
-- Create single-column index
CREATE INDEX idx_employee_name ON employees(name);
-- Create composite index (multiple columns)
CREATE INDEX idx_dept_salary ON employees(dept_id, salary);
-- Create unique index
CREATE UNIQUE INDEX idx_email ON users(email);
-- Drop index
DROP INDEX idx_employee_name ON employees;
β Pros of Indexing:
- Dramatically faster SELECT queries
- Speeds up WHERE, ORDER BY, JOIN
- Enforces uniqueness (UNIQUE index)
- Improves sorting performance
β Cons of Indexing:
- Slower INSERT, UPDATE, DELETE
- Takes extra storage space
- Maintenance overhead
- Too many indexes can hurt performance
When to Use Indexes:
- Columns frequently used in WHERE clauses
- Foreign key columns used in JOINs
- Columns used in ORDER BY or GROUP BY
- Columns with high cardinality (many unique values)
- Don't index: Small tables, columns with low cardinality, frequently updated columns
7. Top 15 SQL Interview Queries π₯
Master these frequently asked SQL queries to ace your technical interviews. Each includes the problem, solution, and explanation.
1. Find the Second Highest Salary
Problem: Write a query to find the second highest salary from the employees table.
-- Using LIMIT and OFFSET
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
-- Using Subquery (works for Nth highest)
SELECT MAX(salary) AS second_highest
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
Explanation:
Method 1: DISTINCT removes duplicates, ORDER BY DESC sorts highest first, OFFSET 1 skips the first row.
Method 2: Subquery finds max salary, outer query finds max below that value.
2. Find Nth Highest Salary (Generic Solution)
Problem: Write a function/query to find the Nth highest salary.
-- Using LIMIT and OFFSET (N = 3 for third highest)
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 2; -- N-1
-- Using DENSE_RANK (handles ties better)
WITH RankedSalaries AS (
SELECT
salary,
DENSE_RANK() OVER (ORDER BY salary DESC) AS rank_num
FROM employees
)
SELECT DISTINCT salary
FROM RankedSalaries
WHERE rank_num = 3; -- N
Why DENSE_RANK?
DENSE_RANK handles duplicate salaries correctly. If two employees have the highest salary, both get rank 1, and the next salary gets rank 2 (not rank 3 like ROW_NUMBER would give).
3. Find Duplicate Records
Problem: Find all duplicate email addresses in the users table.
-- Find duplicate emails
SELECT email, COUNT(*) AS count
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
-- Show all details of duplicate records
SELECT u.*
FROM users u
INNER JOIN (
SELECT email
FROM users
GROUP BY email
HAVING COUNT(*) > 1
) duplicates ON u.email = duplicates.email;
Explanation:
GROUP BY groups rows by email, HAVING filters groups with count > 1. The second query joins back to get full record details of duplicates.
4. Delete Duplicate Records (Keep One)
Problem: Delete duplicate rows from a table, keeping only one instance of each duplicate.
-- Using ROW_NUMBER to identify duplicates
DELETE FROM users
WHERE id IN (
SELECT id
FROM (
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY id
) AS row_num
FROM users
) AS temp
WHERE row_num > 1
);
-- Alternative: Using self-join
DELETE u1
FROM users u1
INNER JOIN users u2
WHERE u1.id > u2.id
AND u1.email = u2.email;
Explanation:
ROW_NUMBER assigns sequential numbers within each email group. We keep row_num = 1 (first occurrence) and delete row_num > 1 (duplicates).
5. Employees Earning More Than Their Managers
Problem: Find employees who earn more than their managers.
-- Assume employees table has: id, name, salary, manager_id
SELECT
e.name AS employee_name,
e.salary AS employee_salary,
m.name AS manager_name,
m.salary AS manager_salary
FROM employees e
INNER JOIN employees m
ON e.manager_id = m.id
WHERE e.salary > m.salary;
Explanation:
Self-join the employees table where e represents employees and m represents managers. The join condition links employee's manager_id to manager's id.
6. Department with Highest Average Salary
Problem: Find the department with the highest average salary.
-- Method 1: Using LIMIT
SELECT
d.dept_name,
AVG(e.salary) AS avg_salary
FROM departments d
INNER JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name
ORDER BY avg_salary DESC
LIMIT 1;
-- Method 2: Using subquery with MAX
SELECT
dept_name,
avg_salary
FROM (
SELECT
d.dept_name,
AVG(e.salary) AS avg_salary
FROM departments d
INNER JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name
) AS dept_averages
WHERE avg_salary = (
SELECT MAX(avg_salary)
FROM (
SELECT AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id
) AS temp
);
Explanation:
Calculate average salary per department using AVG() and GROUP BY. Method 1 is simpler, Method 2 handles ties (multiple departments with same highest average).
7. Calculate Cumulative Sum (Running Total)
Problem: Calculate a running total of sales by date.
-- Using Window Function (Modern SQL)
SELECT
sale_date,
amount,
SUM(amount) OVER (
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_sum
FROM sales
ORDER BY sale_date;
-- Simplified version
SELECT
sale_date,
amount,
SUM(amount) OVER (ORDER BY sale_date) AS cumulative_sum
FROM sales;
Explanation:
Window functions (OVER clause) calculate running totals without GROUP BY. Each row sees all previous rows up to the current row when ordered by sale_date.
8. Find Missing Numbers in Sequence
Problem: Find missing IDs in a sequence (e.g., 1,2,4,5,7 - missing 3 and 6).
-- Using LEAD to find gaps
WITH numbered AS (
SELECT
id,
LEAD(id) OVER (ORDER BY id) AS next_id
FROM employees
)
SELECT
id + 1 AS missing_id_start,
next_id - 1 AS missing_id_end
FROM numbered
WHERE next_id - id > 1;
-- Alternative: Generate sequence and find missing
WITH RECURSIVE seq AS (
SELECT 1 AS num
UNION ALL
SELECT num + 1
FROM seq
WHERE num < (SELECT MAX(id) FROM employees)
)
SELECT seq.num AS missing_id
FROM seq
LEFT JOIN employees e ON seq.num = e.id
WHERE e.id IS NULL;
Explanation:
LEAD() gets the next row's value. If next_id - current_id > 1, there's a gap. The recursive CTE generates all numbers from 1 to MAX(id) and finds which don't exist.
9. Rank Students by Score (Handle Ties)
Problem: Rank students by their scores, handling ties appropriately.
-- ROW_NUMBER: 1, 2, 3, 4 (no ties)
SELECT
student_name,
score,
ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num
FROM students;
-- RANK: 1, 2, 2, 4 (ties share rank, next rank skips)
SELECT
student_name,
score,
RANK() OVER (ORDER BY score DESC) AS rank_num
FROM students;
-- DENSE_RANK: 1, 2, 2, 3 (ties share rank, next rank consecutive)
SELECT
student_name,
score,
DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank
FROM students;
Comparison:
ROW_NUMBER: Unique rank for each row (1,2,3,4)
RANK: Ties get same rank, next rank skips (1,2,2,4)
DENSE_RANK: Ties get same rank, next rank is consecutive (1,2,2,3)
10. Find Pairs of Employees in Same City
Problem: Find all pairs of employees who live in the same city.
-- Self-join to find pairs
SELECT
e1.name AS employee1,
e2.name AS employee2,
e1.city
FROM employees e1
INNER JOIN employees e2
ON e1.city = e2.city
AND e1.id < e2.id -- Avoid duplicate pairs and self-pairing
ORDER BY e1.city, e1.name;
Explanation:
The condition e1.id < e2.id ensures we get each pair only once (prevents both "Alice-Bob" and "Bob-Alice") and prevents matching an employee with themselves.
11. Top 3 Highest Paid Employees per Department
Problem: Find the top 3 highest-paid employees in each department.
-- Using ROW_NUMBER with PARTITION BY
WITH RankedEmployees AS (
SELECT
dept_id,
name,
salary,
ROW_NUMBER() OVER (
PARTITION BY dept_id
ORDER BY salary DESC
) AS rank_in_dept
FROM employees
)
SELECT
d.dept_name,
r.name,
r.salary,
r.rank_in_dept
FROM RankedEmployees r
JOIN departments d ON r.dept_id = d.dept_id
WHERE r.rank_in_dept <= 3
ORDER BY d.dept_name, r.rank_in_dept;
Explanation:
PARTITION BY dept_id creates separate ranking windows for each department. ROW_NUMBER ranks employees within their department, then we filter for top 3.
12. Pivot Table - Convert Rows to Columns
Problem: Transform sales data from rows to columns (pivot by month).
-- Original data: product, month, sales
-- Goal: product, Jan_sales, Feb_sales, Mar_sales
SELECT
product,
SUM(CASE WHEN month = 'January' THEN sales ELSE 0 END) AS Jan_sales,
SUM(CASE WHEN month = 'February' THEN sales ELSE 0 END) AS Feb_sales,
SUM(CASE WHEN month = 'March' THEN sales ELSE 0 END) AS Mar_sales
FROM sales
GROUP BY product;
-- Using PIVOT (SQL Server syntax)
SELECT *
FROM sales
PIVOT (
SUM(sales)
FOR month IN ([January], [February], [March])
) AS pivot_table;
Explanation:
CASE WHEN creates conditional columns. Each CASE statement checks if the month matches and includes the sales value. GROUP BY aggregates across all months for each product.
13. Find Users Who Logged In for 3+ Consecutive Days
Problem: Identify users with consecutive login streaks of 3+ days.
-- Using ROW_NUMBER to detect consecutive sequences
WITH LoginGroups AS (
SELECT
user_id,
login_date,
DATE_SUB(login_date, INTERVAL
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY login_date
) DAY
) AS group_date
FROM logins
),
ConsecutiveCounts AS (
SELECT
user_id,
group_date,
COUNT(*) AS consecutive_days
FROM LoginGroups
GROUP BY user_id, group_date
HAVING COUNT(*) >= 3
)
SELECT DISTINCT user_id
FROM ConsecutiveCounts;
Explanation:
Subtracting ROW_NUMBER from date creates groups where consecutive dates share the same group_date value. Count records per group to find streaks of 3+ days.
14. Calculate 7-Day Moving Average
Problem: Calculate a 7-day moving average of daily sales.
-- Using window frame
SELECT
sale_date,
daily_sales,
AVG(daily_sales) OVER (
ORDER BY sale_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7days
FROM daily_sales
ORDER BY sale_date;
-- Alternative: Using date range
SELECT
sale_date,
daily_sales,
AVG(daily_sales) OVER (
ORDER BY sale_date
RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW
) AS moving_avg_7days
FROM daily_sales;
Explanation:
Window frame "6 PRECEDING AND CURRENT ROW" creates a sliding window of 7 rows (current + 6 previous). AVG calculates the average within this window for each row.
15. Find Customers Who Never Placed an Order
Problem: Identify customers who have registered but never placed an order.
-- Using LEFT JOIN and NULL check
SELECT
c.customer_id,
c.customer_name,
c.email
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;
-- Using NOT IN subquery
SELECT
customer_id,
customer_name,
email
FROM customers
WHERE customer_id NOT IN (
SELECT DISTINCT customer_id
FROM orders
WHERE customer_id IS NOT NULL
);
-- Using NOT EXISTS (Most efficient)
SELECT
customer_id,
customer_name,
email
FROM customers c
WHERE NOT EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
);
Which Method to Use?
LEFT JOIN + NULL: Simple and readable
NOT IN: Easy to understand but watch for NULLs (use WHERE IS NOT NULL)
NOT EXISTS: Usually most efficient for large datasets, stops as soon as match is found
Interview Tips for SQL Queries:
- Clarify requirements: Ask about NULL handling, duplicate treatment, and performance expectations
- Start simple: Write a basic solution first, then optimize
- Explain your approach: Talk through your thought process as you write
- Consider edge cases: Empty tables, NULL values, duplicates, single-row tables
- Know time complexity: Understand which operations are expensive (full table scans, subqueries)
- Test mentally: Walk through sample data to verify your logic
- Optimize when asked: Discuss indexes, query execution plans, alternative approaches