SQL & DBMS Complete Roadmap

1. Basic DBMS Concepts

Done

What is a DBMS?

A Database Management System (DBMS) is software that enables users to create, read, update, and delete data in a database efficiently. It acts as an interface between the database and end-users or application programs.

Key Functions of DBMS:

Data Definition: Define database structure (CREATE, ALTER, DROP)
Data Manipulation: Insert, update, delete, retrieve data (CRUD operations)
Data Security: User authentication, authorization, access control
Data Integrity: Maintain accuracy and consistency through constraints
Concurrency Control: Handle simultaneous data access by multiple users
Backup & Recovery: Protect against data loss and system failures
Data Independence: Separate physical storage from logical view

Popular DBMS: MySQL, PostgreSQL, Oracle, MongoDB, SQL Server, SQLite, Cassandra, Redis

SQL vs NoSQL: When to use each?

📊 SQL (Relational)

Structure: Tables with fixed schema (rows & columns)
Query Language: Structured Query Language (SQL)
Scalability: Vertical (scale up - add more CPU/RAM)
ACID: Strong consistency, transactions
Relationships: Foreign keys, JOINs
Best for: Complex queries, financial data, e-commerce
Examples: MySQL, PostgreSQL, Oracle, MS SQL

🚀 NoSQL (Non-Relational)

Structure: Flexible schema (JSON, key-value, graph)
Query Language: Database-specific APIs
Scalability: Horizontal (scale out - add more servers)
BASE: Eventual consistency, high availability
Relationships: Embedded documents, no JOINs
Best for: Big data, real-time apps, IoT, social media
Examples: MongoDB, Cassandra, Redis, DynamoDB

Decision Guide:

Choose SQL when:

You need complex joins and relationships between tables
Data integrity and ACID transactions are critical (banking, payments)
Your schema is well-defined and stable
You need strong consistency

Choose NoSQL when:

You need to scale horizontally across multiple servers
Your schema is flexible or evolves frequently
You prioritize speed and availability over strict consistency
You're handling huge volumes of unstructured data

Primary Key, Foreign Key, Candidate Key, Super Key

1. Primary Key 🔑

Uniquely identifies each record in a table. Cannot be NULL. Only ONE per table.

SQL

CREATE TABLE students (
    student_id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100)
);

2. Foreign Key 🔗

Links two tables together. References the PRIMARY KEY of another table. Ensures referential integrity.

SQL

CREATE TABLE enrollments (
    enrollment_id INT PRIMARY KEY,
    student_id INT,
    course_id INT,
    FOREIGN KEY (student_id) REFERENCES students(student_id)
);

Benefit: If you try to delete a student who has enrollments, the database will prevent it (maintaining data integrity).

3. Candidate Key 🎯

Any column (or combination) that can uniquely identify a record. There can be multiple candidate keys. The Primary Key is chosen from candidate keys.

Example: In a Students table:

student_id → Candidate Key (unique)
email → Candidate Key (unique)
phone → Candidate Key (if unique)

We choose student_id as PRIMARY KEY from these candidates.

4. Super Key 🔐

Any combination of columns that uniquely identifies a record. A Candidate Key is a minimal Super Key (no unnecessary columns).

Example Super Keys:

{student_id} → Minimal ✅
{email} → Minimal ✅
{student_id, name} → Has redundancy ❌
{student_id, name, email} → Has redundancy ❌

Schema vs Instance in a Database

Schema (Blueprint) 📐

The logical design/structure of the database. Defines table names, column names, data types, constraints. Rarely changes.

SQL

CREATE TABLE users (
    id INT,
    name VARCHAR(50),
    age INT
);

Instance (Data) 📊

The actual data stored at a particular moment. Changes frequently as records are inserted/updated/deleted.

Data

(1, 'Alice', 25)
(2, 'Bob', 30)
(3, 'Charlie', 28)

Perfect Analogy:

Schema = House Blueprint (structure, layout)
Instance = Furniture Inside (actual content, changes often)

What are Constraints?

Constraints are rules applied to table columns to ensure data accuracy, consistency, and integrity. They prevent invalid data from being entered.

SQL

CREATE TABLE employees (
    id INT PRIMARY KEY,                    -- NOT NULL + UNIQUE
    name VARCHAR(100) NOT NULL,          -- Must have value
    email VARCHAR(100) UNIQUE,          -- No duplicates
    age INT CHECK (age >= 18),              -- Must be 18+
    salary DECIMAL DEFAULT 50000,         -- Default if not provided
    dept_id INT,
    FOREIGN KEY (dept_id) REFERENCES departments(id)
);

Types of Constraints:

NOT NULL: Column cannot have NULL value
UNIQUE: All values in column must be different
PRIMARY KEY: NOT NULL + UNIQUE combined (one per table)
FOREIGN KEY: Links tables, ensures referential integrity
CHECK: Ensures values meet a specific condition
DEFAULT: Sets default value if none is provided

2. Normalization & Database Design

Done

What is Normalization? Explain 1NF, 2NF, 3NF, BCNF

Normalization is the process of organizing data to reduce redundancy and improve data integrity. It involves dividing large tables into smaller, related tables.

1. First Normal Form (1NF) ✅

Rules:

Each column contains atomic (indivisible) values
No repeating groups or arrays
Each column has a unique name
Order of rows doesn't matter

Example

-- ❌ NOT in 1NF (multiple values in one cell)
orders
order_id | customer | products
1        | Alice    | "Laptop, Mouse, Keyboard"

-- ✅ In 1NF (atomic values)
orders
order_id | customer | product
1        | Alice    | Laptop
1        | Alice    | Mouse
1        | Alice    | Keyboard

2. Second Normal Form (2NF) ✅

Rules:

Must be in 1NF
No partial dependencies (all non-key attributes must depend on the ENTIRE primary key)

Example

-- ❌ NOT in 2NF (course_name depends only on course_id, not full key)
enrollments
student_id | course_id | course_name  | grade
1          | 101       | Math         | A
2          | 101       | Math         | B

-- ✅ In 2NF (split into two tables)
enrollments
student_id | course_id | grade
1          | 101       | A
2          | 101       | B

courses
course_id | course_name
101       | Math

3. Third Normal Form (3NF) ✅

Rules:

Must be in 2NF
No transitive dependencies (non-key attributes should not depend on other non-key attributes)

Example

-- ❌ NOT in 3NF (dept_location depends on dept_id, not employee_id)
employees
employee_id | name  | dept_id | dept_location
1           | Alice | 10      | NY
2           | Bob   | 10      | NY

-- ✅ In 3NF (split department info)
employees
employee_id | name  | dept_id
1           | Alice | 10
2           | Bob   | 10

departments
dept_id | dept_location
10      | NY

4. Boyce-Codd Normal Form (BCNF) ✅

Rules:

Must be in 3NF
For every functional dependency A → B, A must be a super key
BCNF is stricter than 3NF

When 3NF ≠ BCNF: When there are multiple candidate keys that overlap.

Benefits of Normalization:

Eliminates data redundancy
Ensures data consistency
Easier to maintain and update
Reduces storage space
Prevents insertion, update, and deletion anomalies

What is Denormalization? When is it preferred?

Denormalization is the process of intentionally introducing redundancy by combining normalized tables. It's the opposite of normalization.

Why Denormalize? 🚀

Improve read performance (fewer JOINs)
Reduce complex queries
Optimize for read-heavy applications
Better for analytics and reporting

Trade-offs ⚠️

Data redundancy (uses more storage)
Slower writes (must update multiple places)
Risk of data inconsistency
More complex to maintain

Example

-- Normalized (3NF) - Requires JOIN
orders: order_id, customer_id, total
customers: customer_id, name, email

SELECT o.order_id, c.name, o.total
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;

-- Denormalized - No JOIN needed (faster reads)
orders: order_id, customer_id, customer_name, total

SELECT order_id, customer_name, total
FROM orders;

When to Denormalize:

✅ Read-heavy applications (analytics, dashboards)
✅ When query performance is critical
✅ Data warehouses and OLAP systems
✅ Caching frequently accessed data
❌ Avoid in transactional systems (OLTP) where data consistency is critical

How would you design a database for YouTube/LicenseLane?

This tests your ability to design a real-world database schema. Let's design for YouTube as an example:

📋 Requirements Analysis:

Users can upload videos
Users can like, comment, subscribe
Videos have views, likes, tags
Users have channels, playlists

SQL Schema

-- Users Table
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    username VARCHAR(50) UNIQUE,
    email VARCHAR(100) UNIQUE,
    password_hash VARCHAR(255),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Videos Table
CREATE TABLE videos (
    video_id INT PRIMARY KEY,
    user_id INT,
    title VARCHAR(200),
    description TEXT,
    video_url VARCHAR(500),
    thumbnail_url VARCHAR(500),
    views INT DEFAULT 0,
    likes INT DEFAULT 0,
    dislikes INT DEFAULT 0,
    uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (user_id) REFERENCES users(user_id)
);

-- Comments Table
CREATE TABLE comments (
    comment_id INT PRIMARY KEY,
    video_id INT,
    user_id INT,
    comment_text TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (video_id) REFERENCES videos(video_id),
    FOREIGN KEY (user_id) REFERENCES users(user_id)
);

-- Subscriptions Table (Many-to-Many)
CREATE TABLE subscriptions (
    subscriber_id INT,
    channel_id INT,
    subscribed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (subscriber_id, channel_id),
    FOREIGN KEY (subscriber_id) REFERENCES users(user_id),
    FOREIGN KEY (channel_id) REFERENCES users(user_id)
);

-- Tags Table
CREATE TABLE tags (
    tag_id INT PRIMARY KEY,
    tag_name VARCHAR(50) UNIQUE
);

-- Video_Tags Junction Table (Many-to-Many)
CREATE TABLE video_tags (
    video_id INT,
    tag_id INT,
    PRIMARY KEY (video_id, tag_id),
    FOREIGN KEY (video_id) REFERENCES videos(video_id),
    FOREIGN KEY (tag_id) REFERENCES tags(tag_id)
);

Key Design Decisions:

Normalization: 3NF to avoid redundancy
Junction Tables: For many-to-many relationships (video_tags, subscriptions)
Indexes: On user_id, video_id for faster queries
Timestamps: Track when records were created
Scalability: Can add sharding, caching, CDN for videos later

3. ACID Properties of Transactions

Done

What are ACID Properties? Why are they important?

ACID ensures reliable transaction processing in databases. Critical for maintaining data integrity in financial systems, e-commerce, banking, etc.

Atomicity (All or Nothing)

Transaction is all-or-nothing. Either ALL operations succeed, or NONE do. No partial completion.

Example: Money transfer: Debit from Account A AND credit to Account B—both must happen or neither.

Consistency (Valid State)

Database moves from one valid state to another. All rules, constraints, and triggers are maintained.

Example: Total money in the system stays constant after transfer. No money is created or destroyed.

Isolation (No Interference)

Concurrent transactions don't interfere. Each transaction appears to execute in isolation from others.

Example: Two people booking the same flight seat—only one succeeds. Prevents race conditions.

Durability (Permanent)

Once committed, changes are permanent—even if system crashes immediately after. Data is written to non-volatile storage.

Example: Confirmed payment survives server restart or power failure.

SQL Transaction

-- Transaction Example: Bank Transfer
BEGIN TRANSACTION;

-- Step 1: Debit $500 from Account A
UPDATE accounts 
SET balance = balance - 500 
WHERE account_id = 123;

-- Step 2: Credit $500 to Account B
UPDATE accounts 
SET balance = balance + 500 
WHERE account_id = 456;

-- If BOTH succeed → COMMIT
COMMIT;

-- If ANY fails → ROLLBACK (undo all changes)
-- ROLLBACK;

Why ACID is Critical:

Data Integrity: Prevents corruption and inconsistencies
Reliability: Guarantees correct behavior even during failures
Concurrency: Multiple users can safely access database simultaneously
Trust: Essential for financial systems, healthcare, e-commerce

4. Advanced DBMS Concepts

Done

What is Sharding and Replication in Databases?

Sharding (Horizontal Partitioning) 🔀

Splitting a large database into smaller, faster pieces (shards) distributed across multiple servers. Each shard contains a subset of the data.

How it works:

Range-based: Users 1-1000 → Server1, 1001-2000 → Server2
Hash-based: Hash(user_id) % num_servers
Geography: US users → US server, EU → EU server

Benefits: Horizontal scalability, faster queries
Challenges: Complex joins across shards, data rebalancing

Replication (Data Copies) 🔄

Creating multiple copies of the same data across different servers for redundancy and availability.

Types:

Master-Slave: One master (writes), multiple slaves (reads)
Master-Master: All nodes can read/write
Peer-to-Peer: All nodes equal

Benefits: High availability, load balancing, disaster recovery
Challenges: Data consistency, replication lag

Key Difference:

Sharding: Splits data (each server has different data)
Replication: Copies data (each server has same data)

CAP Theorem: Consistency, Availability, Partition Tolerance

The CAP Theorem states that in a distributed database system, you can only achieve TWO out of THREE properties simultaneously.

Consistency (C)

All nodes see the same data at the same time. Every read gets the most recent write.

Availability (A)

Every request receives a response (success or failure). System is always operational.

Partition Tolerance (P)

System continues to operate despite network partitions/failures between nodes.

You can only pick TWO:

CP (Consistency + Partition Tolerance): MongoDB, HBase, Redis
Sacrifice availability—system may reject requests to maintain consistency during partition
AP (Availability + Partition Tolerance): Cassandra, DynamoDB, CouchDB
Sacrifice consistency—all nodes respond but may return stale data
CA (Consistency + Availability): Traditional RDBMS (MySQL, PostgreSQL)
Not partition-tolerant—single-node or tightly-coupled systems

Interview Tip:

In reality, Partition Tolerance is mandatory for distributed systems (network failures will happen). So the real choice is between CP vs AP.

Difference between OLTP and OLAP Systems

OLTP (Online Transaction Processing) 💳

Purpose: Day-to-day transactional operations

Operations: INSERT, UPDATE, DELETE (write-heavy)
Queries: Simple, frequent, affecting few records
Response Time: Milliseconds
Data: Current, detailed transactions
Normalization: Highly normalized (3NF)
Users: Many concurrent users

Examples: E-commerce checkout, ATM withdrawals, airline reservations, order processing

OLAP (Online Analytical Processing) 📊

Purpose: Data analysis and business intelligence

Operations: SELECT (read-heavy), complex aggregations
Queries: Complex, infrequent, scanning millions of rows
Response Time: Seconds to minutes
Data: Historical, aggregated, multi-dimensional
Normalization: Denormalized (star/snowflake schema)
Users: Few analysts/data scientists

Examples: Sales reports, trend analysis, forecasting, data warehouses

Example Queries

-- OLTP Query (simple, fast)
INSERT INTO orders (customer_id, product_id, quantity) 
VALUES (123, 456, 2);

-- OLAP Query (complex, analytical)
SELECT 
    product_category,
    YEAR(order_date) AS year,
    SUM(revenue) AS total_revenue,
    AVG(revenue) AS avg_revenue
FROM sales_fact
JOIN products ON sales_fact.product_id = products.id
WHERE order_date >= '2020-01-01'
GROUP BY product_category, year
ORDER BY total_revenue DESC;

What are Stored Procedures, Triggers, and Views?

1. Stored Procedures 📝

Precompiled SQL code stored in the database that can be executed multiple times. Like a function in programming.

SQL

DELIMITER //
CREATE PROCEDURE GetEmployeesByDept(IN dept_id INT)
BEGIN
    SELECT * FROM employees 
    WHERE department_id = dept_id;
END//
DELIMITER ;

-- Call the procedure
CALL GetEmployeesByDept(10);

Benefits: Reusability, security, performance (precompiled), network efficiency

2. Triggers ⚡

Automatic actions executed before or after INSERT, UPDATE, or DELETE operations. Like event listeners.

SQL

-- Auto-update 'updated_at' timestamp
CREATE TRIGGER before_employee_update
BEFORE UPDATE ON employees
FOR EACH ROW
BEGIN
    SET NEW.updated_at = NOW();
END;

-- Maintain audit log
CREATE TRIGGER after_order_insert
AFTER INSERT ON orders
FOR EACH ROW
BEGIN
    INSERT INTO audit_log (action, order_id, timestamp)
    VALUES ('INSERT', NEW.order_id, NOW());
END;

Use Cases: Audit logs, data validation, cascading updates, maintaining derived data

3. Views 👁️

Virtual tables based on a SELECT query. Don't store data—just save the query definition.

SQL

-- Create a view
CREATE VIEW high_salary_employees AS
SELECT name, salary, department
FROM employees
WHERE salary > 100000;

-- Query the view like a table
SELECT * FROM high_salary_employees;

Benefits: Simplify complex queries, security (hide columns), logical data independence

5. SQL Joins

Done

What are SQL Joins?

SQL Joins are used to combine rows from two or more tables based on a related column between them. They allow you to retrieve data from multiple tables in a single query.

Key Concept:

Joins work by matching values in specified columns (usually foreign keys) to combine related data from different tables.

Sample Tables for Examples

SQL

-- Create tables
CREATE TABLE employees (
    emp_id INT PRIMARY KEY,
    emp_name VARCHAR(50),
    dept_id INT,
    salary DECIMAL(10,2)
);

CREATE TABLE departments (
    dept_id INT PRIMARY KEY,
    dept_name VARCHAR(50),
    location VARCHAR(50)
);

-- Insert sample data
INSERT INTO employees VALUES 
(1, 'Alice', 10, 75000),
(2, 'Bob', 20, 65000),
(3, 'Charlie', 10, 80000),
(4, 'David', 30, 70000),
(5, 'Eve', NULL, 55000);

INSERT INTO departments VALUES 
(10, 'Engineering', 'New York'),
(20, 'Marketing', 'Boston'),
(40, 'Finance', 'Chicago');

Types of SQL Joins

1. INNER JOIN ∩

Returns only the matching records from both tables.

SQL

SELECT 
    e.emp_name,
    e.salary,
    d.dept_name,
    d.location
FROM employees e
INNER JOIN departments d 
    ON e.dept_id = d.dept_id;

Result: Alice, Bob, Charlie (matched)
Excluded: David (dept 30 not in departments), Eve (NULL dept_id), Finance dept (no employees)

2. LEFT JOIN (LEFT OUTER JOIN) ⊃

Returns ALL records from left table + matching from right table.

SQL

SELECT 
    e.emp_name,
    e.salary,
    d.dept_name,
    d.location
FROM employees e
LEFT JOIN departments d 
    ON e.dept_id = d.dept_id;

Result: All 5 employees
NULL values: David and Eve get NULL for dept_name and location
Excluded: Finance dept (no employees)

3. RIGHT JOIN (RIGHT OUTER JOIN) ⊂

Returns ALL records from right table + matching from left table.

SQL

SELECT 
    e.emp_name,
    e.salary,
    d.dept_name,
    d.location
FROM employees e
RIGHT JOIN departments d 
    ON e.dept_id = d.dept_id;

Result: All 3 departments
NULL values: Finance gets NULL for emp_name and salary
Excluded: David and Eve

4. FULL OUTER JOIN ∪

Returns ALL records from both tables (matched + unmatched).

SQL

-- MySQL workaround using UNION
SELECT * FROM employees e
LEFT JOIN departments d 
    ON e.dept_id = d.dept_id
UNION
SELECT * FROM employees e
RIGHT JOIN departments d 
    ON e.dept_id = d.dept_id;

Result: All employees + all departments
Note: MySQL doesn't support FULL OUTER JOIN directly

5. CROSS JOIN ×

Returns Cartesian product - all possible combinations of rows.

SQL

SELECT 
    e.emp_name,
    d.dept_name
FROM employees e
CROSS JOIN departments d;

-- Alternative syntax
SELECT e.emp_name, d.dept_name
FROM employees e, departments d;

Result: 5 employees × 3 departments = 15 rows
Use Case: Generate all possible combinations, schedules

6. SELF JOIN ↻

Table joins with itself to compare rows within the same table.

SQL

-- Find employees with same salary
SELECT 
    e1.emp_name AS employee1,
    e2.emp_name AS employee2,
    e1.salary
FROM employees e1
INNER JOIN employees e2 
    ON e1.salary = e2.salary 
    AND e1.emp_id < e2.emp_id;

Use Case: Employee hierarchy, comparing records within same table

Quick Reference:

INNER JOIN: Only matches (∩)
LEFT JOIN: All from left + matches (⊃)
RIGHT JOIN: All from right + matches (⊂)
FULL OUTER: All from both (∪)
CROSS JOIN: All combinations (×)
SELF JOIN: Table joins itself (↻)

Advanced Join Techniques

1. Multiple Table Joins 🔗

Join more than two tables in a single query.

SQL

-- Create projects table
CREATE TABLE projects (
    project_id INT PRIMARY KEY,
    project_name VARCHAR(50),
    dept_id INT
);

INSERT INTO projects VALUES 
(1, 'Website Redesign', 10),
(2, 'Marketing Campaign', 20),
(3, 'Mobile App', 10);

-- Join 3 tables
SELECT 
    e.emp_name,
    d.dept_name,
    p.project_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
INNER JOIN projects p ON d.dept_id = p.dept_id
ORDER BY e.emp_name;

2. Join with WHERE Conditions 🎯

Filter joined results with additional conditions.

SQL

-- Find high-salary employees in specific departments
SELECT 
    e.emp_name,
    d.dept_name,
    e.salary
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
WHERE e.salary > 70000 
    AND d.location = 'New York';

3. Join with GROUP BY 📊

Aggregate data from joined tables.

SQL

-- Count employees and average salary per department
SELECT 
    d.dept_name,
    COUNT(e.emp_id) AS employee_count,
    AVG(e.salary) AS avg_salary
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name
ORDER BY employee_count DESC;

4. Join with Subquery 🧩

Join with derived tables or inline views.

SQL

-- Find employees earning above their department average
SELECT 
    e.emp_name,
    e.salary,
    dept_avg.avg_salary
FROM employees e
INNER JOIN (
    SELECT dept_id, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY dept_id
) dept_avg ON e.dept_id = dept_avg.dept_id
WHERE e.salary > dept_avg.avg_salary;

5. Finding Unmatched Records 🔍

Use LEFT/RIGHT JOIN with NULL checks to find non-matching records.

SQL

-- Find employees WITHOUT a department
SELECT e.emp_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id
WHERE d.dept_id IS NULL;

-- Find departments WITHOUT employees
SELECT d.dept_name
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
WHERE e.emp_id IS NULL;

Join Performance Tips

Index join columns: Create indexes on columns used in ON clauses for faster lookups
Use INNER JOIN when possible: It's faster than OUTER JOINs as it processes fewer rows
Filter early: Apply WHERE conditions before joining to reduce data processed
Avoid SELECT *: Only select columns you need to reduce data transfer
Use EXPLAIN: Analyze query execution plan to identify bottlenecks
Join order matters: Start with the smallest table when joining multiple tables

SQL

-- Create indexes on frequently joined columns
CREATE INDEX idx_emp_dept ON employees(dept_id);
CREATE INDEX idx_dept_id ON departments(dept_id);

-- Analyze query performance
EXPLAIN SELECT * 
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;

Common Interview Questions

Q1: What's the difference between INNER JOIN and OUTER JOIN?

INNER JOIN: Returns only matching records from both tables.
OUTER JOIN: Returns all records from one or both tables, with NULLs for non-matching rows.

Q2: When would you use a SELF JOIN?

Use SELF JOIN when you need to compare rows within the same table, such as:
• Finding employee-manager relationships
• Comparing records with similar attributes (same salary, same city)
• Hierarchical data structures

Q3: What's the difference between JOIN and UNION?

JOIN: Combines columns from multiple tables horizontally (side by side).
UNION: Combines rows from multiple queries vertically (stacked on top).

Q4: Can you have multiple JOIN conditions?

Yes! Use AND/OR in the ON clause:
ON t1.id = t2.id AND t1.status = t2.status

Q5: What's the difference between WHERE and ON in joins?

ON: Specifies join condition (how tables relate).
WHERE: Filters results after the join is performed.
For INNER JOINs, they often produce the same result, but for OUTER JOINs, the difference is significant.

6. Indexing & Query Optimization

Done

What is Database Indexing?

An index is a database structure that improves the speed of data retrieval operations. Think of it like a book's index—instead of reading every page, you can jump directly to the relevant section.

How Indexes Work:

Indexes create a separate data structure (usually B-Tree or Hash) that stores column values and pointers to the actual rows. This allows the database to find data without scanning every row.

SQL

-- Create single-column index
CREATE INDEX idx_employee_name ON employees(name);

-- Create composite index (multiple columns)
CREATE INDEX idx_dept_salary ON employees(dept_id, salary);

-- Create unique index
CREATE UNIQUE INDEX idx_email ON users(email);

-- Drop index
DROP INDEX idx_employee_name ON employees;

✅ Pros of Indexing:

Dramatically faster SELECT queries
Speeds up WHERE, ORDER BY, JOIN
Enforces uniqueness (UNIQUE index)
Improves sorting performance

❌ Cons of Indexing:

Slower INSERT, UPDATE, DELETE
Takes extra storage space
Maintenance overhead
Too many indexes can hurt performance

When to Use Indexes:

Columns frequently used in WHERE clauses
Foreign key columns used in JOINs
Columns used in ORDER BY or GROUP BY
Columns with high cardinality (many unique values)
Don't index: Small tables, columns with low cardinality, frequently updated columns

7. Top 15 SQL Interview Queries 🔥

Done

Master these frequently asked SQL queries to ace your technical interviews. Each includes the problem, solution, and explanation.

1. Find the Second Highest Salary

Problem: Write a query to find the second highest salary from the employees table.

SQL - Method 1

-- Using LIMIT and OFFSET
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;

SQL - Method 2

-- Using Subquery (works for Nth highest)
SELECT MAX(salary) AS second_highest
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

Explanation:

Method 1: DISTINCT removes duplicates, ORDER BY DESC sorts highest first, OFFSET 1 skips the first row.
Method 2: Subquery finds max salary, outer query finds max below that value.

2. Find Nth Highest Salary (Generic Solution)

Problem: Write a function/query to find the Nth highest salary.

SQL

-- Using LIMIT and OFFSET (N = 3 for third highest)
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 2; -- N-1

-- Using DENSE_RANK (handles ties better)
WITH RankedSalaries AS (
    SELECT 
        salary,
        DENSE_RANK() OVER (ORDER BY salary DESC) AS rank_num
    FROM employees
)
SELECT DISTINCT salary
FROM RankedSalaries
WHERE rank_num = 3; -- N

Why DENSE_RANK?

DENSE_RANK handles duplicate salaries correctly. If two employees have the highest salary, both get rank 1, and the next salary gets rank 2 (not rank 3 like ROW_NUMBER would give).

3. Find Duplicate Records

Problem: Find all duplicate email addresses in the users table.

SQL

-- Find duplicate emails
SELECT email, COUNT(*) AS count
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

-- Show all details of duplicate records
SELECT u.*
FROM users u
INNER JOIN (
    SELECT email
    FROM users
    GROUP BY email
    HAVING COUNT(*) > 1
) duplicates ON u.email = duplicates.email;

Explanation:

GROUP BY groups rows by email, HAVING filters groups with count > 1. The second query joins back to get full record details of duplicates.

4. Delete Duplicate Records (Keep One)

Problem: Delete duplicate rows from a table, keeping only one instance of each duplicate.

SQL

-- Using ROW_NUMBER to identify duplicates
DELETE FROM users
WHERE id IN (
    SELECT id
    FROM (
        SELECT 
            id,
            ROW_NUMBER() OVER (
                PARTITION BY email 
                ORDER BY id
            ) AS row_num
        FROM users
    ) AS temp
    WHERE row_num > 1
);

-- Alternative: Using self-join
DELETE u1
FROM users u1
INNER JOIN users u2
WHERE u1.id > u2.id 
    AND u1.email = u2.email;

Explanation:

ROW_NUMBER assigns sequential numbers within each email group. We keep row_num = 1 (first occurrence) and delete row_num > 1 (duplicates).

5. Employees Earning More Than Their Managers

Problem: Find employees who earn more than their managers.

SQL

-- Assume employees table has: id, name, salary, manager_id
SELECT 
    e.name AS employee_name,
    e.salary AS employee_salary,
    m.name AS manager_name,
    m.salary AS manager_salary
FROM employees e
INNER JOIN employees m 
    ON e.manager_id = m.id
WHERE e.salary > m.salary;

Explanation:

Self-join the employees table where e represents employees and m represents managers. The join condition links employee's manager_id to manager's id.

6. Department with Highest Average Salary

Problem: Find the department with the highest average salary.

SQL

-- Method 1: Using LIMIT
SELECT 
    d.dept_name,
    AVG(e.salary) AS avg_salary
FROM departments d
INNER JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name
ORDER BY avg_salary DESC
LIMIT 1;

-- Method 2: Using subquery with MAX
SELECT 
    dept_name,
    avg_salary
FROM (
    SELECT 
        d.dept_name,
        AVG(e.salary) AS avg_salary
    FROM departments d
    INNER JOIN employees e ON d.dept_id = e.dept_id
    GROUP BY d.dept_name
) AS dept_averages
WHERE avg_salary = (
    SELECT MAX(avg_salary) 
    FROM (
        SELECT AVG(salary) AS avg_salary
        FROM employees
        GROUP BY dept_id
    ) AS temp
);

Explanation:

Calculate average salary per department using AVG() and GROUP BY. Method 1 is simpler, Method 2 handles ties (multiple departments with same highest average).

7. Calculate Cumulative Sum (Running Total)

Problem: Calculate a running total of sales by date.

SQL

-- Using Window Function (Modern SQL)
SELECT 
    sale_date,
    amount,
    SUM(amount) OVER (
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_sum
FROM sales
ORDER BY sale_date;

-- Simplified version
SELECT 
    sale_date,
    amount,
    SUM(amount) OVER (ORDER BY sale_date) AS cumulative_sum
FROM sales;

Explanation:

Window functions (OVER clause) calculate running totals without GROUP BY. Each row sees all previous rows up to the current row when ordered by sale_date.

8. Find Missing Numbers in Sequence

Problem: Find missing IDs in a sequence (e.g., 1,2,4,5,7 - missing 3 and 6).

SQL

-- Using LEAD to find gaps
WITH numbered AS (
    SELECT 
        id,
        LEAD(id) OVER (ORDER BY id) AS next_id
    FROM employees
)
SELECT 
    id + 1 AS missing_id_start,
    next_id - 1 AS missing_id_end
FROM numbered
WHERE next_id - id > 1;

-- Alternative: Generate sequence and find missing
WITH RECURSIVE seq AS (
    SELECT 1 AS num
    UNION ALL
    SELECT num + 1 
    FROM seq 
    WHERE num < (SELECT MAX(id) FROM employees)
)
SELECT seq.num AS missing_id
FROM seq
LEFT JOIN employees e ON seq.num = e.id
WHERE e.id IS NULL;

Explanation:

LEAD() gets the next row's value. If next_id - current_id > 1, there's a gap. The recursive CTE generates all numbers from 1 to MAX(id) and finds which don't exist.

9. Rank Students by Score (Handle Ties)

Problem: Rank students by their scores, handling ties appropriately.

SQL

-- ROW_NUMBER: 1, 2, 3, 4 (no ties)
SELECT 
    student_name,
    score,
    ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num
FROM students;

-- RANK: 1, 2, 2, 4 (ties share rank, next rank skips)
SELECT 
    student_name,
    score,
    RANK() OVER (ORDER BY score DESC) AS rank_num
FROM students;

-- DENSE_RANK: 1, 2, 2, 3 (ties share rank, next rank consecutive)
SELECT 
    student_name,
    score,
    DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank
FROM students;

Comparison:

ROW_NUMBER: Unique rank for each row (1,2,3,4)
RANK: Ties get same rank, next rank skips (1,2,2,4)
DENSE_RANK: Ties get same rank, next rank is consecutive (1,2,2,3)

10. Find Pairs of Employees in Same City

Problem: Find all pairs of employees who live in the same city.

SQL

-- Self-join to find pairs
SELECT 
    e1.name AS employee1,
    e2.name AS employee2,
    e1.city
FROM employees e1
INNER JOIN employees e2 
    ON e1.city = e2.city 
    AND e1.id < e2.id  -- Avoid duplicate pairs and self-pairing
ORDER BY e1.city, e1.name;

Explanation:

The condition e1.id < e2.id ensures we get each pair only once (prevents both "Alice-Bob" and "Bob-Alice") and prevents matching an employee with themselves.

11. Top 3 Highest Paid Employees per Department

Problem: Find the top 3 highest-paid employees in each department.

SQL

-- Using ROW_NUMBER with PARTITION BY
WITH RankedEmployees AS (
    SELECT 
        dept_id,
        name,
        salary,
        ROW_NUMBER() OVER (
            PARTITION BY dept_id 
            ORDER BY salary DESC
        ) AS rank_in_dept
    FROM employees
)
SELECT 
    d.dept_name,
    r.name,
    r.salary,
    r.rank_in_dept
FROM RankedEmployees r
JOIN departments d ON r.dept_id = d.dept_id
WHERE r.rank_in_dept <= 3
ORDER BY d.dept_name, r.rank_in_dept;

Explanation:

PARTITION BY dept_id creates separate ranking windows for each department. ROW_NUMBER ranks employees within their department, then we filter for top 3.

12. Pivot Table - Convert Rows to Columns

Problem: Transform sales data from rows to columns (pivot by month).

SQL

-- Original data: product, month, sales
-- Goal: product, Jan_sales, Feb_sales, Mar_sales

SELECT 
    product,
    SUM(CASE WHEN month = 'January' THEN sales ELSE 0 END) AS Jan_sales,
    SUM(CASE WHEN month = 'February' THEN sales ELSE 0 END) AS Feb_sales,
    SUM(CASE WHEN month = 'March' THEN sales ELSE 0 END) AS Mar_sales
FROM sales
GROUP BY product;

-- Using PIVOT (SQL Server syntax)
SELECT *
FROM sales
PIVOT (
    SUM(sales)
    FOR month IN ([January], [February], [March])
) AS pivot_table;

Explanation:

CASE WHEN creates conditional columns. Each CASE statement checks if the month matches and includes the sales value. GROUP BY aggregates across all months for each product.

13. Find Users Who Logged In for 3+ Consecutive Days

Problem: Identify users with consecutive login streaks of 3+ days.

SQL

-- Using ROW_NUMBER to detect consecutive sequences
WITH LoginGroups AS (
    SELECT 
        user_id,
        login_date,
        DATE_SUB(login_date, INTERVAL 
            ROW_NUMBER() OVER (
                PARTITION BY user_id 
                ORDER BY login_date
            ) DAY
        ) AS group_date
    FROM logins
),
ConsecutiveCounts AS (
    SELECT 
        user_id,
        group_date,
        COUNT(*) AS consecutive_days
    FROM LoginGroups
    GROUP BY user_id, group_date
    HAVING COUNT(*) >= 3
)
SELECT DISTINCT user_id
FROM ConsecutiveCounts;

Explanation:

Subtracting ROW_NUMBER from date creates groups where consecutive dates share the same group_date value. Count records per group to find streaks of 3+ days.

14. Calculate 7-Day Moving Average

Problem: Calculate a 7-day moving average of daily sales.

SQL

-- Using window frame
SELECT 
    sale_date,
    daily_sales,
    AVG(daily_sales) OVER (
        ORDER BY sale_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7days
FROM daily_sales
ORDER BY sale_date;

-- Alternative: Using date range
SELECT 
    sale_date,
    daily_sales,
    AVG(daily_sales) OVER (
        ORDER BY sale_date
        RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW
    ) AS moving_avg_7days
FROM daily_sales;

Explanation:

Window frame "6 PRECEDING AND CURRENT ROW" creates a sliding window of 7 rows (current + 6 previous). AVG calculates the average within this window for each row.

15. Find Customers Who Never Placed an Order

Problem: Identify customers who have registered but never placed an order.

SQL - Method 1

-- Using LEFT JOIN and NULL check
SELECT 
    c.customer_id,
    c.customer_name,
    c.email
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;

SQL - Method 2

-- Using NOT IN subquery
SELECT 
    customer_id,
    customer_name,
    email
FROM customers
WHERE customer_id NOT IN (
    SELECT DISTINCT customer_id 
    FROM orders
    WHERE customer_id IS NOT NULL
);

SQL - Method 3

-- Using NOT EXISTS (Most efficient)
SELECT 
    customer_id,
    customer_name,
    email
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 
    FROM orders o 
    WHERE o.customer_id = c.customer_id
);

Which Method to Use?

LEFT JOIN + NULL: Simple and readable
NOT IN: Easy to understand but watch for NULLs (use WHERE IS NOT NULL)
NOT EXISTS: Usually most efficient for large datasets, stops as soon as match is found

Interview Tips for SQL Queries:

Clarify requirements: Ask about NULL handling, duplicate treatment, and performance expectations
Start simple: Write a basic solution first, then optimize
Explain your approach: Talk through your thought process as you write
Consider edge cases: Empty tables, NULL values, duplicates, single-row tables
Know time complexity: Understand which operations are expensive (full table scans, subqueries)
Test mentally: Walk through sample data to verify your logic
Optimize when asked: Discuss indexes, query execution plans, alternative approaches