Database?

Core Concepts

Database vs. File System: Extra Features Offered by Databases

When compared to using a direct file system for storing and retrieving data, databases provide numerous advanced features that improve data management, efficiency, and integrity.

Feature Database File System
Indexes/fast search
ACID transactions
Redundancy control
Query language (SQL/MQL)
Relationships/joins
Schema/data typing
Access control/auth Basic (weak)
Backup/restore tools Manual
Data validation
Multi-user concurrency Basic/file locks
Crash recovery Minimal

How Databases Store Data Persistently on Disk

Databases store data persistently on disk using specialized components called storage engines which manage the physical layout, retrieval, and durability of data.

Storage Engine

Database/System Storage Engine Approach Persistence Details
MySQL InnoDB Uses tablespaces with page-oriented storage. Uses tables, indexes, and logs with WAL for crash recovery. Supports row-level locking and transactions.
PostgreSQL Single unified storage engine with heap files. Stores data and indexes as files; uses WAL for durability and MVCC for concurrency.
MongoDB Document storage in BSON format on disk. Uses data files and journal logs; supports replica sets and sharding for durability and scale.

Additional Mechanisms

Row-Oriented Storage

Pros

Cons

Use Cases

Column-Oriented Storage

Pros

Cons

Use Cases

Comparison Table

Feature/Aspect Row-Oriented Storage Column-Oriented Storage
Data layout Row by row Column by column
Best for OLTP (transactions, CRUD) OLAP (analytics, aggregates, scan)
Full row read/write Fast Slow (needs to stitch columns)
Single column read Slow (load all columns per row) Very fast (scan only relevant column)
Compression Moderate Very high (similar values)
Example DBs MySQL, PostgreSQL, Oracle Apache Parquet, Amazon Redshift, ClickHouse, Google BigQuery, Apache Cassandra (hybrid)

Mixed/Hybrid Approaches

Summary:
- Row-oriented storage = best for transactional workloads with frequent full-row accesses.
- Column-oriented storage = best for analytical workloads with queries on selective columns across many rows.

Choose the storage model that matches your primary query and update patterns for optimal performance.

Relational Databases

1. Keys in RDBMS

Keys uniquely identify rows and establish relationships.

Primary Key

Foreign Key

Candidate Key

Alternate Key

Composite Key

Super Key

2. Integrity Constraints

Ensure data is accurate and consistent.

3. ON DELETE modes

  1. ON DELETE CASCADE:
  2. What it does: Deletes all dependent rows in the child table when the parent row is deleted.
  3. Use case: Ensure referential integrity by removing related data automatically.
  4. Example: In an orders (child) and customers (parent) table, deleting a customer removes all their orders.
  5. SQL: FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE CASCADE
  6. Pros: Maintains consistency; no orphaned records.
  7. Cons: Can unintentionally delete large amounts of data.

  8. ON DELETE SET NULL:

  9. What it does: Sets the foreign key column in the child table to NULL when the parent row is deleted.
  10. Use case: Preserve child records but remove the parent reference if optional.
  11. Example: Deleting a customer sets customer_id to NULL in orders (if customer_id allows NULL).
  12. SQL: FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE SET NULL
  13. Pros: Keeps child data; useful for optional relationships.
  14. Cons: Requires nullable foreign key; may complicate queries.

  15. ON DELETE SET DEFAULT:

  16. What it does: Sets the foreign key column in the child table to its default value when the parent row is deleted.
  17. Use case: Assign a fallback reference (e.g., a default parent ID).
  18. Example: Deleting a customer sets customer_id in orders to a default customer ID (e.g., a generic "unknown" customer).
  19. SQL: FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE SET DEFAULT
  20. Pros: Maintains reference to a valid parent; preserves data.
  21. Cons: Requires a valid default value; less common.

  22. ON DELETE RESTRICT (or NO ACTION):

  23. What it does: Prevents deletion of the parent row if dependent rows exist in the child table.
  24. Use case: Enforce strict referential integrity by blocking deletes.
  25. Example: Cannot delete a customer if they have existing orders.
  26. SQL: FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE RESTRICT
  27. Pros: Protects data integrity; prevents accidental deletions.
  28. Cons: Requires manual cleanup of child rows before deleting parent.

Additional Notes

4. Types of Attributes

Simple Attribute

Composite Attribute

Single-valued Attribute

Multi-valued Attribute

Derived Attribute

Prime Attribute

Non-Prime Attribute

5. Relationships in Databases

Define associations between entities.

6. Participation & Mapping

Concept Description Diagram Notation Impact on Table Design
Total Participation Every entity instance participates in relationship Double line Enforce NOT NULL on foreign key
Partial Participation Some instances participate, others may not Single line Foreign key may be nullable
Entity Mapping Entity → Table Table with attributes, primary key
1:N Relationship "One" PK as FK in "Many" table Foreign key in "Many" table
M:N Relationship New table for relationship, includes both PKs as FKs Join table with composite PK

Normalization in Databases

What is Normalization?

Why is it Required?


Normal Forms (Short Explanations with Examples)

1NF (First Normal Form)

2NF (Second Normal Form)

3NF (Third Normal Form)

BCNF (Boyce-Codd Normal Form)

4NF (Fourth Normal Form)


Summary

Normal Form Main Rule Problem Solved
1NF Atomic values, no repeating groups No arrays/multiple values
2NF 1NF + Prime attribute shouldn't determine non-prime attribute No partial dependency
3NF 2NF + No non-prime attribute should determine other non-prime attribute No indirect dependency
BCNF Every determinant is a super key Stronger than 3NF
4NF BCNF + No multi-valued dependencies No multi-valued facts overlap

Transactions and Concurrency Control in Databases

Transaction

Example:
Transferring money from Account A to Account B requires: 1. Deduct amount from Account A. 2. Add amount to Account B.

Both actions must succeed together, or neither should occur.


ACID Properties

ACID ensures database transactions are processed reliably.

1. Atomicity

2. Consistency

3. Isolation

4. Durability

Visual Table: ACID Properties

Property Purpose Guarantee Example
Atomicity All-or-nothing No partial results saved Both debit & credit in transfer both happen
Consistency Preserves rules/constraints Only valid data states Balance never goes negative
Isolation Transactions don't interfere No dirty or partial reads Two transfers don’t corrupt each other
Durability Once done, always persisted Survives crashes/failures Money transfer is never lost

Transaction Isolation Levels in Databases

Isolation levels control how and when the effects of one transaction become visible to others, balancing data consistency and concurrency.

Key Phenomena in Concurrency

Standard SQL Isolation Levels

1. Read Uncommitted

Description: - Lowest isolation. - Transactions can read uncommitted changes from others (dirty reads allowed).

Problems/Anomalies: - Dirty reads: Possible. - Non-repeatable reads: Possible. - Phantom reads: Possible.

Example: - T1 updates a row but hasn't committed. T2 reads this uncommitted value. T1 rolls back, so T2 used a value that never existed.

Advantages: - Maximum concurrency and performance. - Minimal locking and wait times.

Disadvantages: - Data can be inconsistent and unreliable. - Should be used only when accuracy is non-critical (e.g., big analytic scans).

2. Read Committed

Description: - A transaction can only read committed data. - Default in many DBMS (e.g., SQL Server, Oracle).

Problems/Anomalies: - Dirty reads: Prevented. - Non-repeatable reads: Possible. - Phantom reads: Possible.

Example: - T1 reads a row. T2 updates and commits the row. If T1 reads again, it sees a new value (non-repeatable read).

Advantages: - Prevents viewing uncommitted ("dirty") data. - Good balance between concurrency and data integrity.

Disadvantages: - Value of repeated reads can change. - Phantom rows can appear/disappear between queries.

3. Repeatable Read

Description: - Ensures that all rows read by a transaction remain stable and cannot be changed by other concurrent transactions until the transaction completes.

Problems/Anomalies: - Dirty reads: Prevented. - Non-repeatable reads: Prevented. - Phantom reads: Possible (unless further mechanisms like next-key locking are used).

Example: - T1 reads row X. T2 attempts to update or delete row X and must wait (blocked) until T1 ends. - T1 still may see new rows matching a WHERE clause (phantom reads) if another transaction inserts them.

Advantages: - Ensures repeated reads of the same row always return the same data. - Good for workflows needing strong consistency per row.

Disadvantages: - Higher locking overhead. - Phantom reads remain possible.

4. What is Snapshot Isolation?

5. Serializable

Description: - Highest (strictest) standard isolation. - Serializes access: transactions are executed with the effect as if run one after another, never concurrently.

Problems/Anomalies: - Dirty reads: Prevented. - Non-repeatable reads: Prevented. - Phantom reads: Prevented.

Example: - T1 reads all accounts with balance >₹1000. T2 cannot insert/delete qualifying rows until T1 completes.

Advantages: - Complete isolation, no concurrency anomalies at all. - Ideal for critical financial/accounting systems.

Disadvantages: - Lowest concurrency—may block or serialize many transactions, reducing performance. - Deadlocks or transaction retries can increase.

Comparison Table

Isolation Level Dirty Reads Non-Repeatable Reads Phantom Reads Performance Typical Use Case
Read Uncommitted Possible Possible Possible Fastest Low-accuracy analytics
Read Committed Prevented Possible Possible High General OLTP; moderate business
Repeatable Read Prevented Prevented Possible Moderate Financial reads, inventory
Snapshot Isolation Prevented Prevented Prevented Moderate Financial reads, inventory
Serializable Prevented Prevented Prevented Slowest Critical integrity/finance apps

6. Multi-Version Concurrency Control (MVCC)

How MVCC Works: Core Principles

  1. Versioning of Records:
  2. Every row (record) in the database has multiple versions, each tagged with metadata (like a transaction ID, timestamp, etc.).
  3. When a transaction changes a row, the DBMS creates a new version rather than modifying the row in place.

  4. Snapshots for Transactions:

  5. When a transaction begins, it sees a snapshot of the database reflecting the committed state at that start moment.
  6. Reads are always performed on the snapshot, ensuring a consistent view even as other transactions are modifying data.

  7. Non-Blocking Reads and Writes:

  8. Readers never block writers and writers never block readers.
  9. While a write is in progress, readers continue to see the old version of data; once the transaction commits, new transactions see the new version.
  10. Only lightweight locks or internal synchronization may be used for operations like index updates (not for the row data itself).

  11. Version Clean-up (Garbage Collection):

  12. As old versions of data become unneeded (when no active transaction could see them), a background process removes them to reclaim space (e.g., VACUUM in PostgreSQL).

Benefits of MVCC

MVCC Types and Details

Write Conflicts

Trade-offs & Limitations

Real-World Example

Suppose two bank employees are working: - Alice reads Bob's balance ($100) and starts a transaction to withdraw $50. - Meanwhile, Bob deposits $40; his transaction creates a new version ($140) and commits. - Alice's transaction, on commit, checks her snapshot (still $100), subtracts $50, and creates a new version ($50). - Any new transaction now sees Bob's balance as $50, though Alice read $100, never seeing Bob's deposit until she finishes.

MVCC vs. Traditional Locking

Aspect MVCC Pessimistic Locking
Reads Non-blocking, consistent snapshot May block on write locks
Writes Create new versions; may abort on conflict Must acquire locks; blocks others
Concurrency High Lower (due to lock contention)
Deadlocks Rare (except for metadata/index ops) More frequent with many writers/readers

MVCC in Practice

Additional Notes

Recoverability and Serializability of Schedules in DBMS

Database schedule management balances concurrent execution (performance) with data consistency and recoverability.

1. Recoverability of Schedules

Recoverability ensures that if a transaction fails, no other transaction commits based on the failed (and rolled back) transaction’s output; thus, the database can be safely recovered to a consistent state.

Types of Schedules Based on Recoverability

a. Recoverable Schedule

  T1: W(x) ... C1
  T2: R(x) ... C2

Here, T2 reads x written by T1 and C2 happens after C1. If T1 fails and is rolled back, T2 can be rolled back safely.

b. Cascading Rollbacks

c. Cascadeless/Avoid Cascading Abort Schedule

d. Strict Schedule

2. Serializability

Serializability is the gold standard for determining if a concurrent schedule maintains database consistency by yielding the same effect as some serial (one-at-a-time) execution.

a. Serial and Non-serial Schedules

b. Conflict Serializability

c. View Serializability

Example:

Suppose in both the actual and serial schedule, T2 reads X as written by T1, and final writes match. The schedule is view-serializable even if not conflict-serializable.


Concept What It Ensures / Description Advantages Disadvantages / Issues
Recoverable Schedule Commits only after all dependent commits Prevents unrecoverable states May allow cascading rollbacks
Cascading Rollback Rollbacks propagate through dependencies Data loss, complex recovery
Cascadeless Schedule Only reads committed data No cascading aborts May require stricter protocols
Conflict Serializability Can reorder conflicts to serial schedule Easy to check Doesn’t capture all 'good' schedules
View Serializability Same reads/writes as serial schedule More comprehensive Harder to verify
Non-Conflicting Operations Same transaction, or different data items, or both reads Allow flexible reordering

4. Practical Guidance

Lock-Based Concurrency Control

Lock-based protocols manage access to data in concurrent transactions to ensure consistency and isolation.

1. Lock Types

Shared Lock (S-Lock)

Exclusive Lock (X-Lock)

Lock Type Allows Read? Allows Write? Multiple Holders?
Shared Yes No Yes
Exclusive Yes Yes No

2. Two-Phase Locking Protocol (2PL)

Protocol Overview

3. Strict Two-Phase Locking

4. Lock Upgradation (Lock Conversion)

5. Deadlock & Starvation

Deadlock

Starvation

6. Timestamp-Based Protocols

7. Summary Table

Concept Purpose/Behavior Pros Cons
Shared Lock Read-only by many; no writes allowed High concurrency May block writers
Exclusive Lock Single holder; allows read & write Data integrity More locking waits
2PL Split into 'growing' and 'shrinking' phases Ensures serializability Deadlock possible
Strict 2PL Hold X-locks until commit/abort No cascading aborts Lower concurrency
Lock Upgradation Convert S-lock to X-lock without unlock/lock Efficient for read→write Potential deadlock
Deadlock Cyclic wait for locks, nobody can proceed Must detect/handle
Starvation Some transactions never proceed Unfairness
Timestamp Protocol Timestamps decide execution order (no locks) No deadlock More aborts possible

Files and Indexing Concepts in DBMS

Efficient file organization and indexing are critical for fast data access in databases.

1. File Organization

a. Heap (Unordered) File

b. Sequential (Ordered) File

c. Hash File

2. Indexing

Indexes improve query speed by allowing direct access to records.

a. Dense Index

b. Sparse Index

Dense vs. Sparse Index Comparison

Aspect Dense Index Sparse Index
Entries Every search key Only some keys (per page)
Space Larger Smaller
Search Time Fastest Slightly slower
Maintenance More updates needed Fewer updates needed

3. Multi-way Trees for Indexing (m-way Search Tree)

4. B-Tree

5. B+ Tree

B-Tree vs B+ Tree Comparison

Feature B-Tree B+ Tree
Record Storage In all nodes Only in leaf nodes
Internal Nodes Keys + pointers + data Only keys + pointers (no actual data)
Range Query Slower (data spread in tree) Faster (all data in sequential leaf nodes)
Leaf Node Link No Yes (doubly or singly linked)

LSM Indexing and SSTables

LSM Tree Architecture Components

LSM Operations

SSTable (Sorted String Table)

3. Compaction in LSM Trees

4. LSM vs. Traditional B-trees

Aspect LSM Tree B-tree
Writes Fast, sequential, batched Slow, random in-place
Reads Potentially slower, multiple files Fast, single structure
Best For Write-heavy, log, analytics Read-heavy, OLTP
Structure Multiple levels of SSTables Single balanced tree

5. Summary Table: LSM Components

Component Where? Role
Memtable In-memory Buffer, fast writes/updates
WAL Disk Durability of in-memory data
SSTable Disk Immutable, on-disk, sorted data files
Bloom Filter In-memory Fast surf for non-existence in SSTables
Compaction Disk Merge/optimize SSTables, drop obsolete

In Practice

Geo Indexing

Goal: Efficiently answer "Find all points near X" in 2D space (latitude/longitude).

Why it’s hard: - 2D data doesn’t fit well in B-trees - Naïve scan = O(n) → too slow for millions of points - Need logarithmic query time

Two dominant solutions: 1. Quad Tree → Tree-based, dynamic 2. Geohash → String-based, fixed-grid

Quad Tree – The Tree That Splits the World

Concept - Recursively divides 2D space into 4 quadrants - Each node represents a bounding box - Stops when ≤ N points per node (usually 4)

                [Earth]
            /    |    \    \
         NW     NE     SW     SE
        /|      /|     |\      |\

Types | Type | Use Case | |-------------------|-----------------------------------| | Point Quad Tree | Store individual points (cities, users) | | Region Quad Tree | Raster images, terrain data | | PR Quad Tree | Fixed subdivision rules |

How Insertion Works?

if node has < 4 points:
    add point
else:
    subdivide into 4 children
    re-insert all 5 points into correct child

Pros

Cons - High memory usage (pointers everywhere) - Hard to persist in databases - Poor cache performance - Not distributed-friendly - Imbalanced trees possible

Quad Tree Query

def find_in_radius(self, center, radius):
    if not self.boundary.intersects_circle(center, radius):
        return []
    results = [p for p in self.points if p.distance(center) <= radius]
    if self.divided:
        for child in [self.nw, self.ne, self.sw, self.se]:
            results += child.find_in_radius(center, radius)
    return results

Geohash – Turn Location into a String

Concept

Converts (lat, lng) → short string using base-32 encoding

40.7484, -73.9857  →  dr5ru6jz1
(Empire State Building)

How It Works (Bit Interleaving)

Latitude:  01001011... (32 bits)
Longitude: 10110100... (32 bits)
Interleave:  01100101... → base32 → dr5ru6jz1

Precision Table

Length Area Size Accuracy Example
5 4.9km × 4.9km ~2.4km dr5ru
7 610m × 610m ~300m dr5ru6j
9 76m × 76m ~38m dr5ru6jz1
11 9.5m × 9.5m ~5m dr5ru6jz1j3

The 9-Cell Trick

One geohash = one box
But user is in center → need 8 neighbors

[7][0][1]
[6][C][2]   ← C = center geohash
[5][4][3]

Query 9 geohash cells → perfect 500m radius coverage

Why Redis Loves Geohash

Redis Feature Geohash Provides
Fast string keys 9-char strings
Sorted Sets (ZSET) Store user IDs with timestamp
Prefix search Same prefix = nearby
Memory efficient 9 bytes per location
ZADD geo:dr5ru6jz1 1700000000 user123
ZADD geo:dr5ru6jz3 1700000001 user456

Geohash + Redis (Python)

import redis, geohash, time

r = redis.Redis()

def add_user(user_id, lat, lng):
    gh = geohash.encode(lat, lng, 9)
    r.zadd(f"geo:{gh}", {user_id: int(time.time())})

def nearby(lat, lng):
    center = geohash.encode(lat, lng, 9)
    cells = [center] + geohash.neighbors(center)
    users = []
    for cell in cells:
        users.extend(r.zrange(f"geo:{cell}", 0, -1))
    return users

Comparison

Feature Quad Tree Geohash
Data Structure Tree (dynamic) String (fixed)
Query Speed O(log n) O(1) with Redis
Memory Usage High (pointers) Tiny (9 bytes)
Database Indexing Hard Easy (B-tree on string)
Distributed Systems Hard to shard Perfect sharding
Nearest Neighbor Excellent Good (with 9-cell)
Circle Queries Perfect Approximate (square grid)
Implementation Complex 10 lines of code
Production Scalability Medium 100M+ users (Uber, Tinder)

Real Production Systems

Geohash Winners

Company Implementation
Uber Redis GEO + GEORADIUS for driver matching
Instagram Nearby posts → geohash in Cassandra
Tinder Match discovery → Redis GEO
Snapchat Geo-filters → Redis geospatial
Foursquare Venue search → Redis ZSETs per geohash

Quad Tree Winners

Company Use Case
Google Maps Frontend marker clustering
Pokémon GO Creature spawning density
Mapbox Vector tile clustering

When to Use Which?

Use Case Winner Reason
Find drivers within 1km Geohash + Redis 1ms response
Show 1M map markers without lag Quad Tree Zoom clustering
Store user locations in DB Geohash Indexable string
2D game collision detection Quad Tree Real-time
Shard data across 100 servers Geohash Natural partitions
Precise circular range queries Quad Tree True circles
Mobile app with 10M users Geohash + Redis Scalability

Best Practice: Use Both!

Modern geo apps do exactly this:

Frontend (Map)     → Quad Tree     → Beautiful clustering
Backend (API)      → Geohash + Redis → Lightning-fast queries
Database (Persist) → Geohash column → Easy indexing

Example: Uber - Drivers stored with Geohash in Redis - Map shows Quad Tree clustered markers - Database has geohash VARCHAR(12) INDEX

The winners use both — Geohash for backend, Quad Tree for frontend.

This is exactly how Uber, Instagram, Tinder, and Google Maps work in 2025.

Common Database Systems

MySQL (Relational, SQL-based)

Key Storage Engines:

Engine Features Use Cases
InnoDB Default engine. Supports transactions, row-level locking, foreign keys, crash recovery. Transactional apps (e.g., e-commerce, banking)
MyISAM Fast reads, table-level locking, no transaction support, smaller disk use. Read-heavy, analytics, logging
Memory Data in RAM (non-persistent), ultra-fast. Temp tables, caching
Archive Compressed storage for archival, no indexing Historical data, infrequent queries
CSV Stores data as .csv files, no indexes Data exchange, interoperability with other tools

Index Types:

PostgreSQL (Relational, Object-Relational, SQL-based)

Architecture:

Index Types:

Example: - Ideal for complex transactional, analytical, geospatial, and semi-structured data workloads.

Steps for Updating a Row in a PostgreSQL Table

Sequential Steps for Updating a Row

  1. Query Parsing and Planning:
  2. The UPDATE query (e.g., UPDATE users SET name = 'Alice' WHERE id = 123;) is received via a client (e.g., psql, JDBC).
  3. PostgreSQL’s parser validates the query syntax and creates an abstract syntax tree (AST).
  4. The query planner/optimizer generates an execution plan, choosing the most efficient path (e.g., index scan or sequential scan) based on table statistics and indexes.

  5. Transaction Initiation:

  6. A transaction is started (explicitly with BEGIN or implicitly for a single statement).
  7. PostgreSQL assigns a transaction ID (XID) to track changes and ensure MVCC.

  8. Row Lookup:

  9. The executor uses the query plan to locate the target row(s) based on the WHERE condition.
  10. If an index exists (e.g., B-tree on id), it’s used for faster lookup; otherwise, a sequential scan is performed.
  11. The row’s tuple identifier (TID) is retrieved from the heap table.

  12. MVCC Check:

  13. PostgreSQL checks the row’s visibility using MVCC:

    • Verifies if the row is visible to the current transaction (based on Xmin/Xmax fields in the tuple header).
    • Ensures no conflicting transactions are modifying the same row (e.g., using locks if needed).
  14. Row Update (New Tuple Creation):

  15. Instead of overwriting the existing row, PostgreSQL creates a new tuple with updated values (e.g., name = 'Alice').
  16. The new tuple is marked with the current transaction’s XID (Xmin) and the old tuple’s XID is updated (Xmax) to mark it as “dead” for MVCC.
  17. The new tuple is written to the heap table, typically in the same or a new page (8KB default).

  18. Index Updates:

  19. If the table has indexes, PostgreSQL updates them to reflect the new tuple:
    • Removes the old tuple’s index entry (or marks it invalid).
    • Adds a new index entry pointing to the new tuple’s TID.
  20. This applies to all index types (e.g., B-tree, GIN, GiST).

  21. Write-Ahead Logging (WAL):

  22. The update operation (new tuple and index changes) is logged to the WAL before committing to disk.
  23. WAL ensures durability by recording changes in a sequential log, allowing recovery in case of crashes.

  24. Locking (if Needed):

  25. If concurrent transactions access the same row, PostgreSQL may acquire row-level locks (e.g., FOR UPDATE) to prevent conflicts.
  26. Table-level locks are rare unless explicitly requested (e.g., LOCK TABLE).

  27. Commit or Rollback:

  28. On COMMIT, the transaction is finalized, and the new tuple becomes visible to subsequent transactions (based on MVCC rules).
  29. WAL records are marked as committed, ensuring durability.
  30. On ROLLBACK, changes are discarded, and the new tuple is ignored.

  31. Storage and Maintenance:

    • The old tuple remains in the heap as a “dead tuple” for MVCC (visible to older transactions).
    • PostgreSQL’s autovacuum process later runs VACUUM to mark dead tuples for reuse and reclaim space.
    • ANALYZE may update table statistics for future query optimization.

Example

For UPDATE users SET name = 'Alice' WHERE id = 123;: 1. Parse query and plan (use index on id). 2. Start transaction, assign XID. 3. Find row with id = 123 via index. 4. Check MVCC visibility. 5. Create new tuple with name = 'Alice', mark old tuple as dead. 6. Update index to point to new tuple. 7. Log changes to WAL. 8. Apply any necessary locks. 9. Commit transaction, making new tuple visible. 10. Autovacuum later cleans up dead tuple.

MongoDB (NoSQL, Document-Oriented)

Architecture Highlights:

Index Types:

Example Use Cases: Real-time analytics, IoT, content management, apps with fast-evolving schemas.

Additional Important Points

RDBMS vs. NoSQL:

System Data Model Best For Scaling Transactions Schema
MySQL Table/Rows Traditional, structured data Vert/Horiz Strong/ACID Rigid
PostgreSQL Table/Rows Complex & extensible workloads Vert/Horiz (ext) Strong/ACID Rigid
MongoDB Document Unstructured, dynamic schemas Native Horiz Eventual Flexible

Distributed Systems

Database Sharding

Types of Sharding Architectures

1. Range-Based Sharding

2. Key-Based (Hash-Based) Sharding

3. Directory-Based Sharding

Horizontal Partitioning vs. Sharding

Advantages

Disadvantages

Hierarchical Sharding

Master-Slave Architecture for High Availability

Consistency

Consistency measures how up-to-date and synchronized distributed data copies are.

1. Linearizable Consistency

2. Eventual Consistency

3. Causal Consistency

4. Quorum Consistency

Data Consistency Levels Tradeoff Table

Level Consistency Level Efficiency
Linearizable High Low
Eventual Lowest Highest
Causal Between Eventual & Linearizable Medium
Quorum Configurable Configurable

Linearizability in Databases

What is Linearizability?

Linearizability is a strong consistency guarantee for distributed systems and databases. It requires that all operations (reads and writes) on a data item appear to execute atomically and in some real-time, total order—that is, every operation takes effect instantly between its invocation and its response.

Key Properties

Visual Example

Suppose two clients interact with a key-value store:

Linearizability guarantees:

Client 3’s read will definitely return 20, since the write of 20 happened before the read started.

Time axis → |---W(X=10)---|---W(X=20)---|---R(X)---|

C1 writes C2 writes C3 reads

Comparison: Linearizability vs Other Guarantees

Guarantee Real-Time Order? Atomic Visibility Possible Stale Reads?
Linearizability Yes Yes No
Sequential Consistency No Yes Yes (within session)
Eventual Consistency No No Yes

Where is Linearizability Useful?

Trade-Offs

Replication Topologies

Replication topologies define how data is copied and synchronized across nodes in a distributed database system.

1. Single-Leader Replication (Master-Slave / Primary-Secondary)

Pros: - Strong consistency: All writes go through one node, simplifying state management and eliminating write conflicts. - Simpler conflict resolution: Easy to reason about ordering of updates. - High read scalability: Followers can serve read-only traffic, easing load on the leader. - Operational simplicity: More straightforward to implement and manage.

Cons: - Single point of failure: If the leader fails, writes are unavailable until failover. - Write bottleneck: Write throughput is limited to leader’s capacity. - Potential lag: Followers may be somewhat behind the leader (especially with asynchronous replication). - Failover complexity: Promoting a new leader adds operational challenges.

How is a New Leader Elected When the Leader Fails?

Single-leader replication relies on exactly one node (the leader/master/primary) for all writes. If this node fails, the system uses a standardized "failover" process to restore availability:

1. Detecting Leader Failure

2. Leader Election

3. Reconfiguration

Main Challenges in Leader Election and Failover

a. Detection Delay and False Positives

b. Consensus Complexity

c. Data Consistency

d. Split-Brain Scenario

e. Client Reconfiguration

f. Service Downtime

2. Multi-Leader Replication (Multi-Master / Active-Active)

Pros: - High availability: System remains writable even if some leaders fail. - Low latency for writes: Writes can be accepted at the nearest leader (useful for geo-distributed systems). - Geographic distribution: Clients connect to the closest leader, benefiting global applications. - No single write bottleneck: Write load is distributed

Cons: - Conflict resolution required: Concurrent updates to the same data from different leaders must be resolved (e.g., last-writer-wins, custom rules). - Implementation complexity: Conflict detection and resolution logic is required. - Increased chance of inconsistencies: Temporary data divergence is possible before conflicts are reconciled. - Topology management: Certain setups (circular/star) may have single points of failure in the data propagation path if not well-designed.

Common Conflict Resolution Strategies

When multiple leader nodes (servers) accept simultaneous writes to the same data in different locations, conflicts can arise when updates are exchanged and merged. Handling these conflicts efficiently and predictably is a key challenge:

CRDTs: Conflict-free Replicated Data Types (Not clear)

CRDTs are special data structures that enable multiple nodes to update data independently and concurrently. They are designed so that all copies converge to the same, correct state automatically, with no need for central coordination or manual conflict resolution.

Lamport Clocks

What is a Lamport Clock?

A Lamport clock is a simple, monotonic counter that helps order events in a distributed system. How it works: - Every node keeps a local counter.

Why using real timestamp is not reliable?

Merkle Trees

Merkle Trees are hash-based tree data structures used for efficient content verification and synchronization in distributed systems, blockchains, and databases.

Summary Table

Topic Purpose / Role Key Advantage
Conflict Handling in Multi-Leader Ensures consistency when updates collide Multiple strategies suit various needs
CRDTs Merges concurrent updates automatically, no user conflict resolution Guaranteed eventual convergence, high availability
Merkle Trees Verifies/synchronizes large datasets efficiently and securely Fast change detection, minimal data movement

Note: For applications that absolutely require linearizability, multi-leader replication is generally not used, or extra coordination (like distributed locking or consensus) is added at the cost of performance and availability.

3. Leaderless Replication

Pros: - Highest availability: Any node can handle requests even if others are down. - No single point of failure: True resilience (as long as quorum is preserved). - Horizontal scalability: Both reads and writes can be scaled by adding more nodes. - Geographic flexibility: Well-suited for global architectures.

Cons: - Eventual consistency (not strong): Reads may return stale data until nodes converge. - Conflict management complexity: Requires version vectors, Merkle trees, or similar to resolve conflicts and reconcile divergent histories. - Operational complexity: More client logic needed for things like read repair, handling conflicting writes, and node resurrection. - Stronger consistency costs: Achieving strong consistency (by using strict quorums) can add latency, lower availability, or both.

Read Repair

How Read Repair Works

  1. Read Request: A client reads data with a given consistency level (e.g., quorum or majority) from multiple replicas.
  2. Data Comparison: The database compares the data (or digests/hashes) returned from each replica.
  3. Detect Inconsistency: If some replicas return different versions, an inconsistency is detected.
  4. Repair: The system identifies the correct (latest) version (often based on version/timestamp, "last write wins", or quorum rules).
    • The up-to-date value is written back to the stale replicas.
    • Repairs can be blocking (client waits for the repair before getting a response) or asynchronous (repair continues after returning the response).
  5. Return Data: The client receives the correct, most recent value.

When and Why Is Read Repair Used?

Types of Read Repair

Pros

Cons

Visual Comparison Table

Topology Who Writes? Scalability Consistency Fault Tolerance Complexity Example Databases
Single-Leader Only the leader node High (reads only) Strong OK (leader failover) Low MySQL, PostgreSQL, MongoDB, SQL Server
Multi-Leader Any leader node High (reads/writes) Eventual / Custom High Medium-High MySQL (Group Replication), MariaDB Galera, CouchDB, AWS Aurora (multi-master)
Leaderless Any node Highest (R/W) Eventual* Highest High Cassandra, DynamoDB, Riak

*Note: Consistency in leaderless setups can be customized; strong consistency can be achieved at the cost of read/write performance or availability.

Quick Usage Guidance

Synchronization Between Leader and Follower Nodes in Database Replication

1. Change Propagation

2. Replication Methods

a. Synchronous Replication

b. Asynchronous Replication

c. Semi-Synchronous Replication

3. Application of Replicated Changes

4. Setting Up New Followers or Catching Up Lagging Followers

5. Special Cases: Log Shipping

6. Failure & Resynchronization

7. Heartbeat and Monitoring

Summary Table

Process Description
Change log/stream Leader records all writes in an ordered log
Log shipping/replication Followers fetch (or receive) change logs and apply in order
Synchronous replication Leader waits for follower confirmation before acknowledging client
Asynchronous replication Leader does NOT wait; followers catch up in background
Semi-synchronous At least one follower is sync; others async
Recovery/catch-up Followers replay missed logs on rejoin to become consistent

Distributed System Protocols

Two-Phase Commit (2PC) Explained

Two-Phase Commit (2PC) is a distributed transaction protocol used to ensure that a transaction spanning multiple databases or nodes is either committed everywhere or rolled back everywhere, so all nodes remain consistent.

1. Coordinator and Participants

2. Phases

Phase 1: Prepare (Voting Phase)

Phase 2: Commit (Decision Phase)

Properties

Limitations

Summary

Two-Phase Commit ensures that distributed transactions are all-or-nothing and all involved nodes end up in a consistent state. However, it is prone to blocking under certain failure conditions.

Distributed Systems Consensus Algorithms

Raft Consensus Algorithm Explained

Key Ideas and Workflow

1. Roles

Each server in the cluster can be in one of three roles at any time: - Leader: Handles all client requests that modify data and manages log replication to followers. - Follower: Passive; responds to requests from the leader or candidates. - Candidate: Seeks election as leader if it doesn't hear from one.

2. Terms

3. Leader Election

4. Log Replication

5. Safety

6. Membership Changes

Advantages

Adoption

Summary

Raft brings reliability and strong consistency to distributed systems through clear leader election, log replication, and robust safety mechanisms, making it a de-facto standard for consensus in modern distributed architectures.