What is a DBMS?

A Database Management System (DBMS) is software that enables users to create, read, update, and delete data in a database efficiently. It acts as an interface between the database and end-users or application programs.

Key Functions of DBMS:
  • Data Definition: Define database structure (CREATE, ALTER, DROP)
  • Data Manipulation: Insert, update, delete, retrieve data (CRUD operations)
  • Data Security: User authentication, authorization, access control
  • Data Integrity: Maintain accuracy and consistency through constraints
  • Concurrency Control: Handle simultaneous data access by multiple users
  • Backup & Recovery: Protect against data loss and system failures
  • Data Independence: Separate physical storage from logical view

Popular DBMS: MySQL, PostgreSQL, Oracle, MongoDB, SQL Server, SQLite, Cassandra, Redis

SQL vs NoSQL: When to use each?
πŸ“Š SQL (Relational)
  • Structure: Tables with fixed schema (rows & columns)
  • Query Language: Structured Query Language (SQL)
  • Scalability: Vertical (scale up - add more CPU/RAM)
  • ACID: Strong consistency, transactions
  • Relationships: Foreign keys, JOINs
  • Best for: Complex queries, financial data, e-commerce
  • Examples: MySQL, PostgreSQL, Oracle, MS SQL
πŸš€ NoSQL (Non-Relational)
  • Structure: Flexible schema (JSON, key-value, graph)
  • Query Language: Database-specific APIs
  • Scalability: Horizontal (scale out - add more servers)
  • BASE: Eventual consistency, high availability
  • Relationships: Embedded documents, no JOINs
  • Best for: Big data, real-time apps, IoT, social media
  • Examples: MongoDB, Cassandra, Redis, DynamoDB
Decision Guide:

Choose SQL when:

  • You need complex joins and relationships between tables
  • Data integrity and ACID transactions are critical (banking, payments)
  • Your schema is well-defined and stable
  • You need strong consistency

Choose NoSQL when:

  • You need to scale horizontally across multiple servers
  • Your schema is flexible or evolves frequently
  • You prioritize speed and availability over strict consistency
  • You're handling huge volumes of unstructured data
Primary Key, Foreign Key, Candidate Key, Super Key
1. Primary Key πŸ”‘

Uniquely identifies each record in a table. Cannot be NULL. Only ONE per table.

SQL
CREATE TABLE students (
    student_id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100)
);
2. Foreign Key πŸ”—

Links two tables together. References the PRIMARY KEY of another table. Ensures referential integrity.

SQL
CREATE TABLE enrollments (
    enrollment_id INT PRIMARY KEY,
    student_id INT,
    course_id INT,
    FOREIGN KEY (student_id) REFERENCES students(student_id)
);

Benefit: If you try to delete a student who has enrollments, the database will prevent it (maintaining data integrity).

3. Candidate Key 🎯

Any column (or combination) that can uniquely identify a record. There can be multiple candidate keys. The Primary Key is chosen from candidate keys.

Example: In a Students table:

  • student_id β†’ Candidate Key (unique)
  • email β†’ Candidate Key (unique)
  • phone β†’ Candidate Key (if unique)

We choose student_id as PRIMARY KEY from these candidates.

4. Super Key πŸ”

Any combination of columns that uniquely identifies a record. A Candidate Key is a minimal Super Key (no unnecessary columns).

Example Super Keys:

  • {student_id} β†’ Minimal βœ…
  • {email} β†’ Minimal βœ…
  • {student_id, name} β†’ Has redundancy ❌
  • {student_id, name, email} β†’ Has redundancy ❌
Schema vs Instance in a Database
Schema (Blueprint) πŸ“

The logical design/structure of the database. Defines table names, column names, data types, constraints. Rarely changes.

SQL
CREATE TABLE users (
    id INT,
    name VARCHAR(50),
    age INT
);
Instance (Data) πŸ“Š

The actual data stored at a particular moment. Changes frequently as records are inserted/updated/deleted.

Data
(1, 'Alice', 25)
(2, 'Bob', 30)
(3, 'Charlie', 28)
Perfect Analogy:

Schema = House Blueprint (structure, layout)
Instance = Furniture Inside (actual content, changes often)

What are Constraints?

Constraints are rules applied to table columns to ensure data accuracy, consistency, and integrity. They prevent invalid data from being entered.

SQL
CREATE TABLE employees (
    id INT PRIMARY KEY,                    -- NOT NULL + UNIQUE
    name VARCHAR(100) NOT NULL,          -- Must have value
    email VARCHAR(100) UNIQUE,          -- No duplicates
    age INT CHECK (age >= 18),              -- Must be 18+
    salary DECIMAL DEFAULT 50000,         -- Default if not provided
    dept_id INT,
    FOREIGN KEY (dept_id) REFERENCES departments(id)
);
Types of Constraints:
  • NOT NULL: Column cannot have NULL value
  • UNIQUE: All values in column must be different
  • PRIMARY KEY: NOT NULL + UNIQUE combined (one per table)
  • FOREIGN KEY: Links tables, ensures referential integrity
  • CHECK: Ensures values meet a specific condition
  • DEFAULT: Sets default value if none is provided
What is Normalization? Explain 1NF, 2NF, 3NF, BCNF

Normalization is the process of organizing data to reduce redundancy and improve data integrity. It involves dividing large tables into smaller, related tables.

1. First Normal Form (1NF) βœ…

Rules:

  • Each column contains atomic (indivisible) values
  • No repeating groups or arrays
  • Each column has a unique name
  • Order of rows doesn't matter
Example
-- ❌ NOT in 1NF (multiple values in one cell)
orders
order_id | customer | products
1        | Alice    | "Laptop, Mouse, Keyboard"

-- βœ… In 1NF (atomic values)
orders
order_id | customer | product
1        | Alice    | Laptop
1        | Alice    | Mouse
1        | Alice    | Keyboard
2. Second Normal Form (2NF) βœ…

Rules:

  • Must be in 1NF
  • No partial dependencies (all non-key attributes must depend on the ENTIRE primary key)
Example
-- ❌ NOT in 2NF (course_name depends only on course_id, not full key)
enrollments
student_id | course_id | course_name  | grade
1          | 101       | Math         | A
2          | 101       | Math         | B

-- βœ… In 2NF (split into two tables)
enrollments
student_id | course_id | grade
1          | 101       | A
2          | 101       | B

courses
course_id | course_name
101       | Math
3. Third Normal Form (3NF) βœ…

Rules:

  • Must be in 2NF
  • No transitive dependencies (non-key attributes should not depend on other non-key attributes)
Example
-- ❌ NOT in 3NF (dept_location depends on dept_id, not employee_id)
employees
employee_id | name  | dept_id | dept_location
1           | Alice | 10      | NY
2           | Bob   | 10      | NY

-- βœ… In 3NF (split department info)
employees
employee_id | name  | dept_id
1           | Alice | 10
2           | Bob   | 10

departments
dept_id | dept_location
10      | NY
4. Boyce-Codd Normal Form (BCNF) βœ…

Rules:

  • Must be in 3NF
  • For every functional dependency A β†’ B, A must be a super key
  • BCNF is stricter than 3NF

When 3NF β‰  BCNF: When there are multiple candidate keys that overlap.

Benefits of Normalization:
  • Eliminates data redundancy
  • Ensures data consistency
  • Easier to maintain and update
  • Reduces storage space
  • Prevents insertion, update, and deletion anomalies
What is Denormalization? When is it preferred?

Denormalization is the process of intentionally introducing redundancy by combining normalized tables. It's the opposite of normalization.

Why Denormalize? πŸš€
  • Improve read performance (fewer JOINs)
  • Reduce complex queries
  • Optimize for read-heavy applications
  • Better for analytics and reporting
Trade-offs ⚠️
  • Data redundancy (uses more storage)
  • Slower writes (must update multiple places)
  • Risk of data inconsistency
  • More complex to maintain
Example
-- Normalized (3NF) - Requires JOIN
orders: order_id, customer_id, total
customers: customer_id, name, email

SELECT o.order_id, c.name, o.total
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;

-- Denormalized - No JOIN needed (faster reads)
orders: order_id, customer_id, customer_name, total

SELECT order_id, customer_name, total
FROM orders;
When to Denormalize:

βœ… Read-heavy applications (analytics, dashboards)
βœ… When query performance is critical
βœ… Data warehouses and OLAP systems
βœ… Caching frequently accessed data
❌ Avoid in transactional systems (OLTP) where data consistency is critical

How would you design a database for YouTube/LicenseLane?

This tests your ability to design a real-world database schema. Let's design for YouTube as an example:

πŸ“‹ Requirements Analysis:
  • Users can upload videos
  • Users can like, comment, subscribe
  • Videos have views, likes, tags
  • Users have channels, playlists
SQL Schema
-- Users Table
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    username VARCHAR(50) UNIQUE,
    email VARCHAR(100) UNIQUE,
    password_hash VARCHAR(255),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Videos Table
CREATE TABLE videos (
    video_id INT PRIMARY KEY,
    user_id INT,
    title VARCHAR(200),
    description TEXT,
    video_url VARCHAR(500),
    thumbnail_url VARCHAR(500),
    views INT DEFAULT 0,
    likes INT DEFAULT 0,
    dislikes INT DEFAULT 0,
    uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (user_id) REFERENCES users(user_id)
);

-- Comments Table
CREATE TABLE comments (
    comment_id INT PRIMARY KEY,
    video_id INT,
    user_id INT,
    comment_text TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (video_id) REFERENCES videos(video_id),
    FOREIGN KEY (user_id) REFERENCES users(user_id)
);

-- Subscriptions Table (Many-to-Many)
CREATE TABLE subscriptions (
    subscriber_id INT,
    channel_id INT,
    subscribed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (subscriber_id, channel_id),
    FOREIGN KEY (subscriber_id) REFERENCES users(user_id),
    FOREIGN KEY (channel_id) REFERENCES users(user_id)
);

-- Tags Table
CREATE TABLE tags (
    tag_id INT PRIMARY KEY,
    tag_name VARCHAR(50) UNIQUE
);

-- Video_Tags Junction Table (Many-to-Many)
CREATE TABLE video_tags (
    video_id INT,
    tag_id INT,
    PRIMARY KEY (video_id, tag_id),
    FOREIGN KEY (video_id) REFERENCES videos(video_id),
    FOREIGN KEY (tag_id) REFERENCES tags(tag_id)
);
Key Design Decisions:
  • Normalization: 3NF to avoid redundancy
  • Junction Tables: For many-to-many relationships (video_tags, subscriptions)
  • Indexes: On user_id, video_id for faster queries
  • Timestamps: Track when records were created
  • Scalability: Can add sharding, caching, CDN for videos later
What are ACID Properties? Why are they important?

ACID ensures reliable transaction processing in databases. Critical for maintaining data integrity in financial systems, e-commerce, banking, etc.

Atomicity (All or Nothing)

Transaction is all-or-nothing. Either ALL operations succeed, or NONE do. No partial completion.

Example: Money transfer: Debit from Account A AND credit to Account Bβ€”both must happen or neither.

Consistency (Valid State)

Database moves from one valid state to another. All rules, constraints, and triggers are maintained.

Example: Total money in the system stays constant after transfer. No money is created or destroyed.

Isolation (No Interference)

Concurrent transactions don't interfere. Each transaction appears to execute in isolation from others.

Example: Two people booking the same flight seatβ€”only one succeeds. Prevents race conditions.

Durability (Permanent)

Once committed, changes are permanentβ€”even if system crashes immediately after. Data is written to non-volatile storage.

Example: Confirmed payment survives server restart or power failure.

SQL Transaction
-- Transaction Example: Bank Transfer
BEGIN TRANSACTION;

-- Step 1: Debit $500 from Account A
UPDATE accounts 
SET balance = balance - 500 
WHERE account_id = 123;

-- Step 2: Credit $500 to Account B
UPDATE accounts 
SET balance = balance + 500 
WHERE account_id = 456;

-- If BOTH succeed β†’ COMMIT
COMMIT;

-- If ANY fails β†’ ROLLBACK (undo all changes)
-- ROLLBACK;
Why ACID is Critical:
  • Data Integrity: Prevents corruption and inconsistencies
  • Reliability: Guarantees correct behavior even during failures
  • Concurrency: Multiple users can safely access database simultaneously
  • Trust: Essential for financial systems, healthcare, e-commerce
What is Sharding and Replication in Databases?
Sharding (Horizontal Partitioning) πŸ”€

Splitting a large database into smaller, faster pieces (shards) distributed across multiple servers. Each shard contains a subset of the data.

How it works:

  • Range-based: Users 1-1000 β†’ Server1, 1001-2000 β†’ Server2
  • Hash-based: Hash(user_id) % num_servers
  • Geography: US users β†’ US server, EU β†’ EU server

Benefits: Horizontal scalability, faster queries
Challenges: Complex joins across shards, data rebalancing

Replication (Data Copies) πŸ”„

Creating multiple copies of the same data across different servers for redundancy and availability.

Types:

  • Master-Slave: One master (writes), multiple slaves (reads)
  • Master-Master: All nodes can read/write
  • Peer-to-Peer: All nodes equal

Benefits: High availability, load balancing, disaster recovery
Challenges: Data consistency, replication lag

Key Difference:

Sharding: Splits data (each server has different data)
Replication: Copies data (each server has same data)

CAP Theorem: Consistency, Availability, Partition Tolerance

The CAP Theorem states that in a distributed database system, you can only achieve TWO out of THREE properties simultaneously.

Consistency (C)

All nodes see the same data at the same time. Every read gets the most recent write.

Availability (A)

Every request receives a response (success or failure). System is always operational.

Partition Tolerance (P)

System continues to operate despite network partitions/failures between nodes.

You can only pick TWO:
  • CP (Consistency + Partition Tolerance): MongoDB, HBase, Redis
    Sacrifice availabilityβ€”system may reject requests to maintain consistency during partition
  • AP (Availability + Partition Tolerance): Cassandra, DynamoDB, CouchDB
    Sacrifice consistencyβ€”all nodes respond but may return stale data
  • CA (Consistency + Availability): Traditional RDBMS (MySQL, PostgreSQL)
    Not partition-tolerantβ€”single-node or tightly-coupled systems
Interview Tip:

In reality, Partition Tolerance is mandatory for distributed systems (network failures will happen). So the real choice is between CP vs AP.

Difference between OLTP and OLAP Systems
OLTP (Online Transaction Processing) πŸ’³

Purpose: Day-to-day transactional operations

  • Operations: INSERT, UPDATE, DELETE (write-heavy)
  • Queries: Simple, frequent, affecting few records
  • Response Time: Milliseconds
  • Data: Current, detailed transactions
  • Normalization: Highly normalized (3NF)
  • Users: Many concurrent users

Examples: E-commerce checkout, ATM withdrawals, airline reservations, order processing

OLAP (Online Analytical Processing) πŸ“Š

Purpose: Data analysis and business intelligence

  • Operations: SELECT (read-heavy), complex aggregations
  • Queries: Complex, infrequent, scanning millions of rows
  • Response Time: Seconds to minutes
  • Data: Historical, aggregated, multi-dimensional
  • Normalization: Denormalized (star/snowflake schema)
  • Users: Few analysts/data scientists

Examples: Sales reports, trend analysis, forecasting, data warehouses

Example Queries
-- OLTP Query (simple, fast)
INSERT INTO orders (customer_id, product_id, quantity) 
VALUES (123, 456, 2);

-- OLAP Query (complex, analytical)
SELECT 
    product_category,
    YEAR(order_date) AS year,
    SUM(revenue) AS total_revenue,
    AVG(revenue) AS avg_revenue
FROM sales_fact
JOIN products ON sales_fact.product_id = products.id
WHERE order_date >= '2020-01-01'
GROUP BY product_category, year
ORDER BY total_revenue DESC;
What are Stored Procedures, Triggers, and Views?
1. Stored Procedures πŸ“

Precompiled SQL code stored in the database that can be executed multiple times. Like a function in programming.

SQL
DELIMITER //
CREATE PROCEDURE GetEmployeesByDept(IN dept_id INT)
BEGIN
    SELECT * FROM employees 
    WHERE department_id = dept_id;
END//
DELIMITER ;

-- Call the procedure
CALL GetEmployeesByDept(10);

Benefits: Reusability, security, performance (precompiled), network efficiency

2. Triggers ⚑

Automatic actions executed before or after INSERT, UPDATE, or DELETE operations. Like event listeners.

SQL
-- Auto-update 'updated_at' timestamp
CREATE TRIGGER before_employee_update
BEFORE UPDATE ON employees
FOR EACH ROW
BEGIN
    SET NEW.updated_at = NOW();
END;

-- Maintain audit log
CREATE TRIGGER after_order_insert
AFTER INSERT ON orders
FOR EACH ROW
BEGIN
    INSERT INTO audit_log (action, order_id, timestamp)
    VALUES ('INSERT', NEW.order_id, NOW());
END;

Use Cases: Audit logs, data validation, cascading updates, maintaining derived data

3. Views πŸ‘οΈ

Virtual tables based on a SELECT query. Don't store dataβ€”just save the query definition.

SQL
-- Create a view
CREATE VIEW high_salary_employees AS
SELECT name, salary, department
FROM employees
WHERE salary > 100000;

-- Query the view like a table
SELECT * FROM high_salary_employees;

Benefits: Simplify complex queries, security (hide columns), logical data independence

What are SQL Joins?

SQL Joins are used to combine rows from two or more tables based on a related column between them. They allow you to retrieve data from multiple tables in a single query.

Key Concept:

Joins work by matching values in specified columns (usually foreign keys) to combine related data from different tables.

Sample Tables for Examples
SQL
-- Create tables
CREATE TABLE employees (
    emp_id INT PRIMARY KEY,
    emp_name VARCHAR(50),
    dept_id INT,
    salary DECIMAL(10,2)
);

CREATE TABLE departments (
    dept_id INT PRIMARY KEY,
    dept_name VARCHAR(50),
    location VARCHAR(50)
);

-- Insert sample data
INSERT INTO employees VALUES 
(1, 'Alice', 10, 75000),
(2, 'Bob', 20, 65000),
(3, 'Charlie', 10, 80000),
(4, 'David', 30, 70000),
(5, 'Eve', NULL, 55000);

INSERT INTO departments VALUES 
(10, 'Engineering', 'New York'),
(20, 'Marketing', 'Boston'),
(40, 'Finance', 'Chicago');
Types of SQL Joins
1. INNER JOIN ∩

Returns only the matching records from both tables.

SQL
SELECT 
    e.emp_name,
    e.salary,
    d.dept_name,
    d.location
FROM employees e
INNER JOIN departments d 
    ON e.dept_id = d.dept_id;

Result: Alice, Bob, Charlie (matched)
Excluded: David (dept 30 not in departments), Eve (NULL dept_id), Finance dept (no employees)

2. LEFT JOIN (LEFT OUTER JOIN) βŠƒ

Returns ALL records from left table + matching from right table.

SQL
SELECT 
    e.emp_name,
    e.salary,
    d.dept_name,
    d.location
FROM employees e
LEFT JOIN departments d 
    ON e.dept_id = d.dept_id;

Result: All 5 employees
NULL values: David and Eve get NULL for dept_name and location
Excluded: Finance dept (no employees)

3. RIGHT JOIN (RIGHT OUTER JOIN) βŠ‚

Returns ALL records from right table + matching from left table.

SQL
SELECT 
    e.emp_name,
    e.salary,
    d.dept_name,
    d.location
FROM employees e
RIGHT JOIN departments d 
    ON e.dept_id = d.dept_id;

Result: All 3 departments
NULL values: Finance gets NULL for emp_name and salary
Excluded: David and Eve

4. FULL OUTER JOIN βˆͺ

Returns ALL records from both tables (matched + unmatched).

SQL
-- MySQL workaround using UNION
SELECT * FROM employees e
LEFT JOIN departments d 
    ON e.dept_id = d.dept_id
UNION
SELECT * FROM employees e
RIGHT JOIN departments d 
    ON e.dept_id = d.dept_id;

Result: All employees + all departments
Note: MySQL doesn't support FULL OUTER JOIN directly

5. CROSS JOIN Γ—

Returns Cartesian product - all possible combinations of rows.

SQL
SELECT 
    e.emp_name,
    d.dept_name
FROM employees e
CROSS JOIN departments d;

-- Alternative syntax
SELECT e.emp_name, d.dept_name
FROM employees e, departments d;

Result: 5 employees Γ— 3 departments = 15 rows
Use Case: Generate all possible combinations, schedules

6. SELF JOIN ↻

Table joins with itself to compare rows within the same table.

SQL
-- Find employees with same salary
SELECT 
    e1.emp_name AS employee1,
    e2.emp_name AS employee2,
    e1.salary
FROM employees e1
INNER JOIN employees e2 
    ON e1.salary = e2.salary 
    AND e1.emp_id < e2.emp_id;

Use Case: Employee hierarchy, comparing records within same table

Quick Reference:

INNER JOIN: Only matches (∩)
LEFT JOIN: All from left + matches (βŠƒ)
RIGHT JOIN: All from right + matches (βŠ‚)
FULL OUTER: All from both (βˆͺ)
CROSS JOIN: All combinations (Γ—)
SELF JOIN: Table joins itself (↻)

Advanced Join Techniques
1. Multiple Table Joins πŸ”—

Join more than two tables in a single query.

SQL
-- Create projects table
CREATE TABLE projects (
    project_id INT PRIMARY KEY,
    project_name VARCHAR(50),
    dept_id INT
);

INSERT INTO projects VALUES 
(1, 'Website Redesign', 10),
(2, 'Marketing Campaign', 20),
(3, 'Mobile App', 10);

-- Join 3 tables
SELECT 
    e.emp_name,
    d.dept_name,
    p.project_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
INNER JOIN projects p ON d.dept_id = p.dept_id
ORDER BY e.emp_name;
2. Join with WHERE Conditions 🎯

Filter joined results with additional conditions.

SQL
-- Find high-salary employees in specific departments
SELECT 
    e.emp_name,
    d.dept_name,
    e.salary
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id
WHERE e.salary > 70000 
    AND d.location = 'New York';
3. Join with GROUP BY πŸ“Š

Aggregate data from joined tables.

SQL
-- Count employees and average salary per department
SELECT 
    d.dept_name,
    COUNT(e.emp_id) AS employee_count,
    AVG(e.salary) AS avg_salary
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name
ORDER BY employee_count DESC;
4. Join with Subquery 🧩

Join with derived tables or inline views.

SQL
-- Find employees earning above their department average
SELECT 
    e.emp_name,
    e.salary,
    dept_avg.avg_salary
FROM employees e
INNER JOIN (
    SELECT dept_id, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY dept_id
) dept_avg ON e.dept_id = dept_avg.dept_id
WHERE e.salary > dept_avg.avg_salary;
5. Finding Unmatched Records πŸ”

Use LEFT/RIGHT JOIN with NULL checks to find non-matching records.

SQL
-- Find employees WITHOUT a department
SELECT e.emp_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id
WHERE d.dept_id IS NULL;

-- Find departments WITHOUT employees
SELECT d.dept_name
FROM departments d
LEFT JOIN employees e ON d.dept_id = e.dept_id
WHERE e.emp_id IS NULL;
Join Performance Tips
  • Index join columns: Create indexes on columns used in ON clauses for faster lookups
  • Use INNER JOIN when possible: It's faster than OUTER JOINs as it processes fewer rows
  • Filter early: Apply WHERE conditions before joining to reduce data processed
  • Avoid SELECT *: Only select columns you need to reduce data transfer
  • Use EXPLAIN: Analyze query execution plan to identify bottlenecks
  • Join order matters: Start with the smallest table when joining multiple tables
SQL
-- Create indexes on frequently joined columns
CREATE INDEX idx_emp_dept ON employees(dept_id);
CREATE INDEX idx_dept_id ON departments(dept_id);

-- Analyze query performance
EXPLAIN SELECT * 
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;
Common Interview Questions
Q1: What's the difference between INNER JOIN and OUTER JOIN?

INNER JOIN: Returns only matching records from both tables.
OUTER JOIN: Returns all records from one or both tables, with NULLs for non-matching rows.

Q2: When would you use a SELF JOIN?

Use SELF JOIN when you need to compare rows within the same table, such as:
β€’ Finding employee-manager relationships
β€’ Comparing records with similar attributes (same salary, same city)
β€’ Hierarchical data structures

Q3: What's the difference between JOIN and UNION?

JOIN: Combines columns from multiple tables horizontally (side by side).
UNION: Combines rows from multiple queries vertically (stacked on top).

Q4: Can you have multiple JOIN conditions?

Yes! Use AND/OR in the ON clause:
ON t1.id = t2.id AND t1.status = t2.status

Q5: What's the difference between WHERE and ON in joins?

ON: Specifies join condition (how tables relate).
WHERE: Filters results after the join is performed.
For INNER JOINs, they often produce the same result, but for OUTER JOINs, the difference is significant.

What is Database Indexing?

An index is a database structure that improves the speed of data retrieval operations. Think of it like a book's indexβ€”instead of reading every page, you can jump directly to the relevant section.

How Indexes Work:

Indexes create a separate data structure (usually B-Tree or Hash) that stores column values and pointers to the actual rows. This allows the database to find data without scanning every row.

SQL
-- Create single-column index
CREATE INDEX idx_employee_name ON employees(name);

-- Create composite index (multiple columns)
CREATE INDEX idx_dept_salary ON employees(dept_id, salary);

-- Create unique index
CREATE UNIQUE INDEX idx_email ON users(email);

-- Drop index
DROP INDEX idx_employee_name ON employees;
βœ… Pros of Indexing:
  • Dramatically faster SELECT queries
  • Speeds up WHERE, ORDER BY, JOIN
  • Enforces uniqueness (UNIQUE index)
  • Improves sorting performance
❌ Cons of Indexing:
  • Slower INSERT, UPDATE, DELETE
  • Takes extra storage space
  • Maintenance overhead
  • Too many indexes can hurt performance
When to Use Indexes:
  • Columns frequently used in WHERE clauses
  • Foreign key columns used in JOINs
  • Columns used in ORDER BY or GROUP BY
  • Columns with high cardinality (many unique values)
  • Don't index: Small tables, columns with low cardinality, frequently updated columns

Master these frequently asked SQL queries to ace your technical interviews. Each includes the problem, solution, and explanation.

Problem: Write a query to find the second highest salary from the employees table.

SQL - Method 1
-- Using LIMIT and OFFSET
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
SQL - Method 2
-- Using Subquery (works for Nth highest)
SELECT MAX(salary) AS second_highest
FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
Explanation:

Method 1: DISTINCT removes duplicates, ORDER BY DESC sorts highest first, OFFSET 1 skips the first row.
Method 2: Subquery finds max salary, outer query finds max below that value.

Problem: Write a function/query to find the Nth highest salary.

SQL
-- Using LIMIT and OFFSET (N = 3 for third highest)
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 2; -- N-1

-- Using DENSE_RANK (handles ties better)
WITH RankedSalaries AS (
    SELECT 
        salary,
        DENSE_RANK() OVER (ORDER BY salary DESC) AS rank_num
    FROM employees
)
SELECT DISTINCT salary
FROM RankedSalaries
WHERE rank_num = 3; -- N
Why DENSE_RANK?

DENSE_RANK handles duplicate salaries correctly. If two employees have the highest salary, both get rank 1, and the next salary gets rank 2 (not rank 3 like ROW_NUMBER would give).

Problem: Find all duplicate email addresses in the users table.

SQL
-- Find duplicate emails
SELECT email, COUNT(*) AS count
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

-- Show all details of duplicate records
SELECT u.*
FROM users u
INNER JOIN (
    SELECT email
    FROM users
    GROUP BY email
    HAVING COUNT(*) > 1
) duplicates ON u.email = duplicates.email;
Explanation:

GROUP BY groups rows by email, HAVING filters groups with count > 1. The second query joins back to get full record details of duplicates.

Problem: Delete duplicate rows from a table, keeping only one instance of each duplicate.

SQL
-- Using ROW_NUMBER to identify duplicates
DELETE FROM users
WHERE id IN (
    SELECT id
    FROM (
        SELECT 
            id,
            ROW_NUMBER() OVER (
                PARTITION BY email 
                ORDER BY id
            ) AS row_num
        FROM users
    ) AS temp
    WHERE row_num > 1
);

-- Alternative: Using self-join
DELETE u1
FROM users u1
INNER JOIN users u2
WHERE u1.id > u2.id 
    AND u1.email = u2.email;
Explanation:

ROW_NUMBER assigns sequential numbers within each email group. We keep row_num = 1 (first occurrence) and delete row_num > 1 (duplicates).

Problem: Find employees who earn more than their managers.

SQL
-- Assume employees table has: id, name, salary, manager_id
SELECT 
    e.name AS employee_name,
    e.salary AS employee_salary,
    m.name AS manager_name,
    m.salary AS manager_salary
FROM employees e
INNER JOIN employees m 
    ON e.manager_id = m.id
WHERE e.salary > m.salary;
Explanation:

Self-join the employees table where e represents employees and m represents managers. The join condition links employee's manager_id to manager's id.

Problem: Find the department with the highest average salary.

SQL
-- Method 1: Using LIMIT
SELECT 
    d.dept_name,
    AVG(e.salary) AS avg_salary
FROM departments d
INNER JOIN employees e ON d.dept_id = e.dept_id
GROUP BY d.dept_name
ORDER BY avg_salary DESC
LIMIT 1;

-- Method 2: Using subquery with MAX
SELECT 
    dept_name,
    avg_salary
FROM (
    SELECT 
        d.dept_name,
        AVG(e.salary) AS avg_salary
    FROM departments d
    INNER JOIN employees e ON d.dept_id = e.dept_id
    GROUP BY d.dept_name
) AS dept_averages
WHERE avg_salary = (
    SELECT MAX(avg_salary) 
    FROM (
        SELECT AVG(salary) AS avg_salary
        FROM employees
        GROUP BY dept_id
    ) AS temp
);
Explanation:

Calculate average salary per department using AVG() and GROUP BY. Method 1 is simpler, Method 2 handles ties (multiple departments with same highest average).

Problem: Calculate a running total of sales by date.

SQL
-- Using Window Function (Modern SQL)
SELECT 
    sale_date,
    amount,
    SUM(amount) OVER (
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_sum
FROM sales
ORDER BY sale_date;

-- Simplified version
SELECT 
    sale_date,
    amount,
    SUM(amount) OVER (ORDER BY sale_date) AS cumulative_sum
FROM sales;
Explanation:

Window functions (OVER clause) calculate running totals without GROUP BY. Each row sees all previous rows up to the current row when ordered by sale_date.

Problem: Find missing IDs in a sequence (e.g., 1,2,4,5,7 - missing 3 and 6).

SQL
-- Using LEAD to find gaps
WITH numbered AS (
    SELECT 
        id,
        LEAD(id) OVER (ORDER BY id) AS next_id
    FROM employees
)
SELECT 
    id + 1 AS missing_id_start,
    next_id - 1 AS missing_id_end
FROM numbered
WHERE next_id - id > 1;

-- Alternative: Generate sequence and find missing
WITH RECURSIVE seq AS (
    SELECT 1 AS num
    UNION ALL
    SELECT num + 1 
    FROM seq 
    WHERE num < (SELECT MAX(id) FROM employees)
)
SELECT seq.num AS missing_id
FROM seq
LEFT JOIN employees e ON seq.num = e.id
WHERE e.id IS NULL;
Explanation:

LEAD() gets the next row's value. If next_id - current_id > 1, there's a gap. The recursive CTE generates all numbers from 1 to MAX(id) and finds which don't exist.

Problem: Rank students by their scores, handling ties appropriately.

SQL
-- ROW_NUMBER: 1, 2, 3, 4 (no ties)
SELECT 
    student_name,
    score,
    ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num
FROM students;

-- RANK: 1, 2, 2, 4 (ties share rank, next rank skips)
SELECT 
    student_name,
    score,
    RANK() OVER (ORDER BY score DESC) AS rank_num
FROM students;

-- DENSE_RANK: 1, 2, 2, 3 (ties share rank, next rank consecutive)
SELECT 
    student_name,
    score,
    DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank
FROM students;
Comparison:

ROW_NUMBER: Unique rank for each row (1,2,3,4)
RANK: Ties get same rank, next rank skips (1,2,2,4)
DENSE_RANK: Ties get same rank, next rank is consecutive (1,2,2,3)

Problem: Find all pairs of employees who live in the same city.

SQL
-- Self-join to find pairs
SELECT 
    e1.name AS employee1,
    e2.name AS employee2,
    e1.city
FROM employees e1
INNER JOIN employees e2 
    ON e1.city = e2.city 
    AND e1.id < e2.id  -- Avoid duplicate pairs and self-pairing
ORDER BY e1.city, e1.name;
Explanation:

The condition e1.id < e2.id ensures we get each pair only once (prevents both "Alice-Bob" and "Bob-Alice") and prevents matching an employee with themselves.

Problem: Find the top 3 highest-paid employees in each department.

SQL
-- Using ROW_NUMBER with PARTITION BY
WITH RankedEmployees AS (
    SELECT 
        dept_id,
        name,
        salary,
        ROW_NUMBER() OVER (
            PARTITION BY dept_id 
            ORDER BY salary DESC
        ) AS rank_in_dept
    FROM employees
)
SELECT 
    d.dept_name,
    r.name,
    r.salary,
    r.rank_in_dept
FROM RankedEmployees r
JOIN departments d ON r.dept_id = d.dept_id
WHERE r.rank_in_dept <= 3
ORDER BY d.dept_name, r.rank_in_dept;
Explanation:

PARTITION BY dept_id creates separate ranking windows for each department. ROW_NUMBER ranks employees within their department, then we filter for top 3.

Problem: Transform sales data from rows to columns (pivot by month).

SQL
-- Original data: product, month, sales
-- Goal: product, Jan_sales, Feb_sales, Mar_sales

SELECT 
    product,
    SUM(CASE WHEN month = 'January' THEN sales ELSE 0 END) AS Jan_sales,
    SUM(CASE WHEN month = 'February' THEN sales ELSE 0 END) AS Feb_sales,
    SUM(CASE WHEN month = 'March' THEN sales ELSE 0 END) AS Mar_sales
FROM sales
GROUP BY product;

-- Using PIVOT (SQL Server syntax)
SELECT *
FROM sales
PIVOT (
    SUM(sales)
    FOR month IN ([January], [February], [March])
) AS pivot_table;
Explanation:

CASE WHEN creates conditional columns. Each CASE statement checks if the month matches and includes the sales value. GROUP BY aggregates across all months for each product.

Problem: Identify users with consecutive login streaks of 3+ days.

SQL
-- Using ROW_NUMBER to detect consecutive sequences
WITH LoginGroups AS (
    SELECT 
        user_id,
        login_date,
        DATE_SUB(login_date, INTERVAL 
            ROW_NUMBER() OVER (
                PARTITION BY user_id 
                ORDER BY login_date
            ) DAY
        ) AS group_date
    FROM logins
),
ConsecutiveCounts AS (
    SELECT 
        user_id,
        group_date,
        COUNT(*) AS consecutive_days
    FROM LoginGroups
    GROUP BY user_id, group_date
    HAVING COUNT(*) >= 3
)
SELECT DISTINCT user_id
FROM ConsecutiveCounts;
Explanation:

Subtracting ROW_NUMBER from date creates groups where consecutive dates share the same group_date value. Count records per group to find streaks of 3+ days.

Problem: Calculate a 7-day moving average of daily sales.

SQL
-- Using window frame
SELECT 
    sale_date,
    daily_sales,
    AVG(daily_sales) OVER (
        ORDER BY sale_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7days
FROM daily_sales
ORDER BY sale_date;

-- Alternative: Using date range
SELECT 
    sale_date,
    daily_sales,
    AVG(daily_sales) OVER (
        ORDER BY sale_date
        RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW
    ) AS moving_avg_7days
FROM daily_sales;
Explanation:

Window frame "6 PRECEDING AND CURRENT ROW" creates a sliding window of 7 rows (current + 6 previous). AVG calculates the average within this window for each row.

Problem: Identify customers who have registered but never placed an order.

SQL - Method 1
-- Using LEFT JOIN and NULL check
SELECT 
    c.customer_id,
    c.customer_name,
    c.email
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;
SQL - Method 2
-- Using NOT IN subquery
SELECT 
    customer_id,
    customer_name,
    email
FROM customers
WHERE customer_id NOT IN (
    SELECT DISTINCT customer_id 
    FROM orders
    WHERE customer_id IS NOT NULL
);
SQL - Method 3
-- Using NOT EXISTS (Most efficient)
SELECT 
    customer_id,
    customer_name,
    email
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 
    FROM orders o 
    WHERE o.customer_id = c.customer_id
);
Which Method to Use?

LEFT JOIN + NULL: Simple and readable
NOT IN: Easy to understand but watch for NULLs (use WHERE IS NOT NULL)
NOT EXISTS: Usually most efficient for large datasets, stops as soon as match is found

Interview Tips for SQL Queries:
  • Clarify requirements: Ask about NULL handling, duplicate treatment, and performance expectations
  • Start simple: Write a basic solution first, then optimize
  • Explain your approach: Talk through your thought process as you write
  • Consider edge cases: Empty tables, NULL values, duplicates, single-row tables
  • Know time complexity: Understand which operations are expensive (full table scans, subqueries)
  • Test mentally: Walk through sample data to verify your logic
  • Optimize when asked: Discuss indexes, query execution plans, alternative approaches