Database?
- An organized collection of structured data, typically stored electronically in computer systems.
- Most common databases use a table-based model (rows & columns) for efficient processing, querying, and management.
-
Diagram:
ID Name Email 1 Alice alice@... 2 Bob bob@... -
SQL (Structured Query Language): The dominant language for querying, manipulating, and defining data in relational databases.
- NoSQL databases (e.g., MongoDB) emerge to handle large volumes of unstructured/semi-structured data with high speed, flexibility, and scalability.
Core Concepts
Database vs. File System: Extra Features Offered by Databases
When compared to using a direct file system for storing and retrieving data, databases provide numerous advanced features that improve data management, efficiency, and integrity.
- Data Redundancy Control and Integrity
- Support for unique constraints and primary keys ensures no duplicate data.
- Normalization techniques minimize redundant data storage.
-
Referential integrity via foreign keys prevents "orphan" or inconsistent data entries.
-
Structured Query Language (SQL) / Query Support
-
Powerful and flexible query languages (like SQL/MQL) allow complex searching, filtering, aggregation, and joining data across tables—something not possible with basic file-system operations.
-
Indexing for Faster Querying
-
Automatic/managed indexes (e.g., B-Tree, hash, full-text) drastically speed up data searches compared to brute-force file scans.
-
Transaction Support and Atomicity
- ACID properties (Atomicity, Consistency, Isolation, Durability) guarantee reliable multi-step updates (transactions) and prevent partial changes due to errors.
-
Rollback and commit mechanisms for safe data modification.
-
Concurrency and Multi-user Support
-
Mechanisms like locking, MVCC (Multi-Version Concurrency Control), and transaction isolation enable safe access for multiple simultaneous users/processes, preventing overwrites and conflicts.
-
Automatic Crash Recovery and Durability
-
Write-Ahead Logging (WAL) and transaction logs enable recovery to the last consistent state after system failure.
-
Data Abstraction and Logical Views
-
Schemas, views, and data abstraction layers allow users to see data in logical formats without concern for how it’s physically stored.
-
Data Validation and Access Controls
- Type validation, default values, and custom triggers enforce data quality.
-
Robust user authentication, authorization, and role-based permissions restrict who can do what.
-
Backup, Restore, and Replication
- Built-in tools for automatic backup and restore.
-
Replication and failover features for high availability and disaster recovery.
-
Joins and Relationships
-
Ability to create, manage, and query relationships across multiple data sets (tables)—unlike independent files.
-
Data Consistency Guarantees
- Built-in mechanisms for consistency (e.g., constraints, triggers, transactions) ensure that all data follows rules and business logic.
| Feature | Database | File System |
|---|---|---|
| Indexes/fast search | ✅ | ❌ |
| ACID transactions | ✅ | ❌ |
| Redundancy control | ✅ | ❌ |
| Query language (SQL/MQL) | ✅ | ❌ |
| Relationships/joins | ✅ | ❌ |
| Schema/data typing | ✅ | ❌ |
| Access control/auth | ✅ | Basic (weak) |
| Backup/restore tools | ✅ | Manual |
| Data validation | ✅ | ❌ |
| Multi-user concurrency | ✅ | Basic/file locks |
| Crash recovery | ✅ | Minimal |
How Databases Store Data Persistently on Disk
Databases store data persistently on disk using specialized components called storage engines which manage the physical layout, retrieval, and durability of data.
-
Storage Engine:
- Core component responsible for managing how data is stored, retrieved, and indexed on disk or in memory.
- Each storage engine uses specific optimizations based on workload type (e.g., transactional, read-heavy).
- Modified (dirty) pages are flushed asynchronously to disk files by the storage engine
-
Data Organization:
- Data is stored in files on disk, which the operating system treats as opaque byte streams.
- Inside these files, data is organized into fixed-size units called pages or blocks (commonly around 4 KB).
- Pages hold rows/records in a format specific to the DBMS.
- Files on disk represent tablespaces, indexes, and internal system metadata
-
Indexes (like B-trees) are also stored as disk structures, organized into pages for efficient querying and updating.
-
Buffer Pool: An in-memory cache in the DBMS that holds pages read from disk. When the database engine needs data, it loads the corresponding pages into memory from disk via the buffer pool, performs operations, then eventually writes modified pages back to disk.
-
Write-Ahead Logging (WAL) / Transaction Logs:
- To guarantee durability and crash recovery, changes are first recorded in a log before being applied to the main data files.
- This ensures that if a crash occurs, the DBMS can replay the logs to restore consistency.
-
Upon startup or recovery, the database engine reads from data files and transaction logs to restore its in-memory state.
-
Atomic Writes and Page Management:
- The hardware guarantees atomic writes at the page/block level (e.g., 4 KB).
- If a database page spans multiple hardware pages, the DBMS handles partial writes carefully to avoid corruption.
Storage Engine
- Definition: Pluggable component that manages how data is stored, retrieved, and indexed on disk.
- Provides trade-offs in performance, features, and reliability (e.g., transactional vs. read-heavy workloads).
- Example: In MySQL, you can specify
ENGINE=InnoDBorENGINE=MyISAMwhen creating tables.
| Database/System | Storage Engine Approach | Persistence Details |
|---|---|---|
| MySQL InnoDB | Uses tablespaces with page-oriented storage. | Uses tables, indexes, and logs with WAL for crash recovery. Supports row-level locking and transactions. |
| PostgreSQL | Single unified storage engine with heap files. | Stores data and indexes as files; uses WAL for durability and MVCC for concurrency. |
| MongoDB | Document storage in BSON format on disk. | Uses data files and journal logs; supports replica sets and sharding for durability and scale. |
Additional Mechanisms
- Checkpointing: Periodically writes all in-memory changes to disk to reduce recovery time.
- Crash Recovery: Uses WAL and checkpoints to restore data to a consistent state after system failures.
- Memory Management: Storage engines manage caching and eviction policies to balance performance and durability.
Row-Oriented Storage
-
Data is stored row by row: All attribute values for a single record are stored together on disk.
-
Storage:
Alice,23,Mumbai | Bob,28,Delhi | ...
Pros
- Efficient for OLTP (Online Transaction Processing) workloads:
- Reading/writing entire records is fast (e.g., retrieving or updating a customer profile).
- Easy to insert/update/delete full records.
- Great for scenarios where queries touch most or all columns of a row.
Cons
- Inefficient for analytics: Scanning a single column requires loading all column data (wastes I/O bandwidth).
- Compression less effective: Since different data types and ranges are mixed together.
Use Cases
- Transaction-heavy applications:
- Banking systems
- E-commerce order management
- User profile management (retrieve full user records)
- Any CRUD (Create, Read, Update, Delete) heavy workload
Column-Oriented Storage
-
Data is stored column by column: All values for a given column are stored together on disk, separate from other columns.
-
Storage:
name: Alice, Bob, ...age: 23, 28, ...city: Mumbai, Delhi, ...
Pros
- Efficient for OLAP (Online Analytical Processing) workloads:
- Fast aggregation (e.g., sum, avg) on single columns — no need to load unnecessary data.
- Can scan only the needed columns, skipping others, reducing disk I/O.
- Highly compressible: Similar data types in a column compress extremely well (e.g., bitmaps, run-length encoding).
- Faster columnar analytics and reports.
Cons
- Slowe for row-level operations: Reading/updating a full record requires multiple disk reads (one per column).
- Not ideal for heavy transactional or highly selective single-record updates/inserts.
Use Cases
- Analytical and reporting applications:
- Data warehouses, business intelligence
- Real-time analytics and dashboards
- Large-scale time series engines
- Log analytics (e.g., clickstream, telemetry)
- Any workload where queries read few columns but many rows
Comparison Table
| Feature/Aspect | Row-Oriented Storage | Column-Oriented Storage |
|---|---|---|
| Data layout | Row by row | Column by column |
| Best for | OLTP (transactions, CRUD) | OLAP (analytics, aggregates, scan) |
| Full row read/write | Fast | Slow (needs to stitch columns) |
| Single column read | Slow (load all columns per row) | Very fast (scan only relevant column) |
| Compression | Moderate | Very high (similar values) |
| Example DBs | MySQL, PostgreSQL, Oracle | Apache Parquet, Amazon Redshift, ClickHouse, Google BigQuery, Apache Cassandra (hybrid) |
Mixed/Hybrid Approaches
- Some modern systems (e.g., SQL Server with columnstore indexes, Cassandra) allow both row and column-oriented storage or indexes to optimize mixed-use workloads.
Summary:
- Row-oriented storage = best for transactional workloads with frequent full-row accesses.
- Column-oriented storage = best for analytical workloads with queries on selective columns across many rows.
Choose the storage model that matches your primary query and update patterns for optimal performance.
Relational Databases
1. Keys in RDBMS
Keys uniquely identify rows and establish relationships.
Primary Key
- Definition: Uniquely identifies each record in a table.
- Properties:
- Must be unique and NOT NULL.
- Only one per table.
Foreign Key
- Definition: A field (or set) in one table that refers to the primary key of another table.
- Purpose: Maintains referential integrity between tables.
Candidate Key
- Definition: Any field or minimal set of fields that can uniquely identify a record; a table can have multiple.
Alternate Key
- Definition: Candidate keys not chosen as the primary key.
Composite Key
- Definition: Combines two or more columns to uniquely identify a record.
Super Key
- Definition: Union of Candidate key and other attributes.
2. Integrity Constraints
Ensure data is accurate and consistent.
- Entity Integrity: No primary key can have NULL values.
- Referential Integrity: Foreign key values must match primary key values in the referenced table or be NULL.
- Unique Constraint: Ensures all values in a column or set of columns are unique.
- Not Null Constraint: Field must have a value (cannot be NULL).
- Check Constraint: Restricts values based on a condition (e.g.,
age >= 18).
3. ON DELETE modes
- ON DELETE CASCADE:
- What it does: Deletes all dependent rows in the child table when the parent row is deleted.
- Use case: Ensure referential integrity by removing related data automatically.
- Example: In an
orders(child) andcustomers(parent) table, deleting a customer removes all their orders. - SQL:
FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE CASCADE - Pros: Maintains consistency; no orphaned records.
-
Cons: Can unintentionally delete large amounts of data.
-
ON DELETE SET NULL:
- What it does: Sets the foreign key column in the child table to
NULLwhen the parent row is deleted. - Use case: Preserve child records but remove the parent reference if optional.
- Example: Deleting a customer sets
customer_idtoNULLinorders(ifcustomer_idallows NULL). - SQL:
FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE SET NULL - Pros: Keeps child data; useful for optional relationships.
-
Cons: Requires nullable foreign key; may complicate queries.
-
ON DELETE SET DEFAULT:
- What it does: Sets the foreign key column in the child table to its default value when the parent row is deleted.
- Use case: Assign a fallback reference (e.g., a default parent ID).
- Example: Deleting a customer sets
customer_idinordersto a default customer ID (e.g., a generic "unknown" customer). - SQL:
FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE SET DEFAULT - Pros: Maintains reference to a valid parent; preserves data.
-
Cons: Requires a valid default value; less common.
-
ON DELETE RESTRICT (or NO ACTION):
- What it does: Prevents deletion of the parent row if dependent rows exist in the child table.
- Use case: Enforce strict referential integrity by blocking deletes.
- Example: Cannot delete a customer if they have existing orders.
- SQL:
FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE RESTRICT - Pros: Protects data integrity; prevents accidental deletions.
- Cons: Requires manual cleanup of child rows before deleting parent.
Additional Notes
- Default Behavior: If no
ON DELETEclause is specified, databases like MySQL default toRESTRICTorNO ACTION, while others may require explicit declaration. - Database Support: Not all databases support all modes (e.g.,
SET DEFAULTis less common in MySQL). - Performance:
CASCADEcan be slower for large datasets;RESTRICT/NO ACTIONadds checks. - Choosing a Mode: Depends on business logic (e.g.,
CASCADEfor tightly coupled data,SET NULLfor optional links).
4. Types of Attributes
Simple Attribute
- Cannot be divided further.
Example:age
Composite Attribute
- Can be divided.
Example:address→{street, city, state}
Single-valued Attribute
- One value per entity.
Example:date_of_birth
Multi-valued Attribute
- Multiple values per entity.
Example:phone_numbers
Derived Attribute
- Computed from other attributes.
Example:age(derived fromdate_of_birth)
Prime Attribute
- An attribute that is part of any candidate key of a relation (table). Always a proper subset of candidate key.
Non-Prime Attribute
- An attribute that is not part of any candidate key.
5. Relationships in Databases
Define associations between entities.
- One-to-One (1:1): Each entity in A relates to a single entity in B.
Example: Each person has one passport. - One-to-Many (1:N): One entity in A relates to multiple in B.
Example: One teacher teaches many students. - Many-to-Many (M:N): Entities in A relate to multiple in B and vice versa.
Example: Students and courses.
6. Participation & Mapping
| Concept | Description | Diagram Notation | Impact on Table Design |
|---|---|---|---|
| Total Participation | Every entity instance participates in relationship | Double line | Enforce NOT NULL on foreign key |
| Partial Participation | Some instances participate, others may not | Single line | Foreign key may be nullable |
| Entity Mapping | Entity → Table | — | Table with attributes, primary key |
| 1:N Relationship | "One" PK as FK in "Many" table | — | Foreign key in "Many" table |
| M:N Relationship | New table for relationship, includes both PKs as FKs | — | Join table with composite PK |
Normalization in Databases
What is Normalization?
- Normalization is the process of organizing data in a database to:
- Minimize redundancy (duplicate data).
- Minimize number of tables.
- Improve data integrity.
- Achieved by dividing large tables into smaller, related tables and defining relationships among them.
Why is it Required?
- Prevents anomalies (insertion, update, deletion problems).
- Saves storage by removing duplicate data.
- Ensures data is logically stored and consistent.
- Simplifies maintaining, updating, and querying databases.
Normal Forms (Short Explanations with Examples)
1NF (First Normal Form)
- Rule: Each column contains only atomic (indivisible) values; no repeating groups or arrays.
- Why: Ensures each field holds single values—no sets or lists.
- Example:
| Student | Subjects | ❌ (not 1NF) | |---------|------------------|---------------| | Alice | Math, Physics | (violates 1NF)| - Fix: Split into separate rows per subject.
2NF (Second Normal Form)
- Rule: Table is in 1NF AND all non-key attributes are fully functionally dependent on the entire primary key (no partial dependency).
- Partial Dependency: If the prime attribute determines the non-prime attribute.
- Why: Removes partial dependency (especially in tables with composite keys).
- Example:
Enrollment(student_id, course_id, student_name, grade) student_namedepends only onstudent_id, not on course.- Fix: Move
student_nameto separate Student table.
3NF (Third Normal Form)
- Rule: Table is in 2NF AND all columns are directly dependent ONLY on the primary key (no transitive dependency).
- Why: Removes columns dependent on other non-key columns.
- Example:
Employee(emp_id, dept_id, dept_name, emp_name) dept_namedepends ondept_id, not directly onemp_id.- Fix: Separate Department(dept_id, dept_name) table.
BCNF (Boyce-Codd Normal Form)
- Rule: Table is in 3NF AND for every (non-trivial) functional dependency X → Y, X is a super key.
- Why: Handles situations where 3NF still leaves certain types of anomalies due to composite or overlapping candidate keys.
- Example:
A table with columns (course, instructor, room) where instructors and rooms each uniquely determine the course, but neither is a full primary key. - Fix: Restructure tables so that only super keys determine other attributes.
4NF (Fourth Normal Form)
- Rule: Table is in BCNF AND no multi-valued dependencies (MVDs) exist other than a candidate key.
- Why: Prevents two or more independent multi-valued facts about an entity from coexisting in one table.
- Example:
| Student | Hobby | Language |
|---------|--------------|--------------|
| Alice | Painting | English |
| Alice | Painting | French |
| Alice | Music | English | - Hobby and Language are independent—should be split into separate tables.
Summary
| Normal Form | Main Rule | Problem Solved |
|---|---|---|
| 1NF | Atomic values, no repeating groups | No arrays/multiple values |
| 2NF | 1NF + Prime attribute shouldn't determine non-prime attribute | No partial dependency |
| 3NF | 2NF + No non-prime attribute should determine other non-prime attribute | No indirect dependency |
| BCNF | Every determinant is a super key | Stronger than 3NF |
| 4NF | BCNF + No multi-valued dependencies | No multi-valued facts overlap |
Transactions and Concurrency Control in Databases
Transaction
- Definition:
A transaction is a sequence of one or more database operations (like insert, update, delete) executed as a single logical unit of work.
Example:
Transferring money from Account A to Account B requires:
1. Deduct amount from Account A.
2. Add amount to Account B.
Both actions must succeed together, or neither should occur.
ACID Properties
ACID ensures database transactions are processed reliably.
1. Atomicity
- All-or-nothing execution:
- All operations in a transaction succeed, or none do.
- If failure occurs, all interim changes are rolled back.
- Example: If failure happens between deducting from A and crediting B, the deduction is undone.
2. Consistency
- The database must always move from one valid state to another.
- All integrity constraints (e.g., unique, foreign key) must be preserved.
- If a transaction violates a constraint, it is rolled back.
3. Isolation
- Concurrent transactions do not affect each other’s intermediate state.
- Ensure that DB state would be same if each transaction was executed in isolation and sequentially without interleaving.
- Prevents problems like "dirty reads," "non-repeatable reads," and "phantom reads."
- Isolation Levels:
- Read Uncommitted
- Read Committed
- Repeatable Read
- Serializable (most strict)
- Example: Two simultaneous transfers don't see each other's partial changes.
4. Durability
- Once committed, changes are permanent—even in the event of system crashes or power failures.
- Achieved via transaction logs (WAL) and secure disk writes.
- Example: After receiving a success confirmation, bank transfers can't be "lost" due to crash.
Visual Table: ACID Properties
| Property | Purpose | Guarantee | Example |
|---|---|---|---|
| Atomicity | All-or-nothing | No partial results saved | Both debit & credit in transfer both happen |
| Consistency | Preserves rules/constraints | Only valid data states | Balance never goes negative |
| Isolation | Transactions don't interfere | No dirty or partial reads | Two transfers don’t corrupt each other |
| Durability | Once done, always persisted | Survives crashes/failures | Money transfer is never lost |
Transaction Isolation Levels in Databases
Isolation levels control how and when the effects of one transaction become visible to others, balancing data consistency and concurrency.
Key Phenomena in Concurrency
- Dirty Read: Reading data modified by another transaction but not yet committed.
- Non-Repeatable Read: Reading the same row twice yields different results due to another transaction’s modifications.
- Phantom Read: A transaction re-executes a query and finds new rows (matching the criteria) added or deleted by another transaction.
Standard SQL Isolation Levels
1. Read Uncommitted
Description: - Lowest isolation. - Transactions can read uncommitted changes from others (dirty reads allowed).
Problems/Anomalies: - Dirty reads: Possible. - Non-repeatable reads: Possible. - Phantom reads: Possible.
Example: - T1 updates a row but hasn't committed. T2 reads this uncommitted value. T1 rolls back, so T2 used a value that never existed.
Advantages: - Maximum concurrency and performance. - Minimal locking and wait times.
Disadvantages: - Data can be inconsistent and unreliable. - Should be used only when accuracy is non-critical (e.g., big analytic scans).
2. Read Committed
Description: - A transaction can only read committed data. - Default in many DBMS (e.g., SQL Server, Oracle).
Problems/Anomalies: - Dirty reads: Prevented. - Non-repeatable reads: Possible. - Phantom reads: Possible.
Example: - T1 reads a row. T2 updates and commits the row. If T1 reads again, it sees a new value (non-repeatable read).
Advantages: - Prevents viewing uncommitted ("dirty") data. - Good balance between concurrency and data integrity.
Disadvantages: - Value of repeated reads can change. - Phantom rows can appear/disappear between queries.
3. Repeatable Read
Description: - Ensures that all rows read by a transaction remain stable and cannot be changed by other concurrent transactions until the transaction completes.
-
This means every read of the same row within the transaction will return the same data (i.e., reads are repeatable).
-
However, depending on the database implementation, Repeatable Read may or may not provide a full consistent snapshot of the entire database state throughout the transaction.
-
In some systems, it guarantees repeatable reads only for rows already read (leaving the possibility of phantom reads), while in others (especially MVCC-based databases), it behaves more like a full snapshot isolation.
-
Default in MySQL InnoDB.
Problems/Anomalies: - Dirty reads: Prevented. - Non-repeatable reads: Prevented. - Phantom reads: Possible (unless further mechanisms like next-key locking are used).
Example:
- T1 reads row X. T2 attempts to update or delete row X and must wait (blocked) until T1 ends.
- T1 still may see new rows matching a WHERE clause (phantom reads) if another transaction inserts them.
Advantages: - Ensures repeated reads of the same row always return the same data. - Good for workflows needing strong consistency per row.
Disadvantages: - Higher locking overhead. - Phantom reads remain possible.
4. What is Snapshot Isolation?
-
Definition:
Snapshot Isolation is an advanced isolation level where each transaction operates on a "snapshot" (a consistent view) of the database as it existed at the start of the transaction. -
Key characteristics:
-
All reads within a transaction see only committed data as it was at the transaction’s start time — later changes by other transactions are hidden until commit.
-
Transactions do not block each other for reads/writes using heavy locks; instead, the DB engine maintains multiple versions of rows (“row versioning”) to support snapshots.
-
Write–write conflicts: If two concurrent transactions try to update the same row, only the first committer “wins” and the other is rolled back (to prevent lost updates).
-
Eliminates dirty reads, non-repeatable reads, and most phantom reads, but is not fully serializable (write skew anomalies are possible).
-
Write skew is a concurrency anomaly that can occur when:
- Two or more concurrent transactions read overlapping data (i.e., some shared rows).
- Each transaction makes decisions based on the data it read.
- Then, they update different (non-overlapping) parts of the data.
- Because they work on snapshots taken at the transaction start, neither transaction sees the other's concurrent update.
- When both commit, the overall database state can become inconsistent or violate integrity constraints that should be enforced.
-
Example: Suppose,
- T1 and T2 start at the same time (their snapshot: DB state at t0).
- T1 updates
row A; T2 updatesrow B. -
Both commit — allowed, as they're updating different rows.
But if both try to update
row A, one will be forced to roll back.
5. Serializable
Description: - Highest (strictest) standard isolation. - Serializes access: transactions are executed with the effect as if run one after another, never concurrently.
Problems/Anomalies: - Dirty reads: Prevented. - Non-repeatable reads: Prevented. - Phantom reads: Prevented.
Example: - T1 reads all accounts with balance >₹1000. T2 cannot insert/delete qualifying rows until T1 completes.
Advantages: - Complete isolation, no concurrency anomalies at all. - Ideal for critical financial/accounting systems.
Disadvantages: - Lowest concurrency—may block or serialize many transactions, reducing performance. - Deadlocks or transaction retries can increase.
Comparison Table
| Isolation Level | Dirty Reads | Non-Repeatable Reads | Phantom Reads | Performance | Typical Use Case |
|---|---|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible | Fastest | Low-accuracy analytics |
| Read Committed | Prevented | Possible | Possible | High | General OLTP; moderate business |
| Repeatable Read | Prevented | Prevented | Possible | Moderate | Financial reads, inventory |
| Snapshot Isolation | Prevented | Prevented | Prevented | Moderate | Financial reads, inventory |
| Serializable | Prevented | Prevented | Prevented | Slowest | Critical integrity/finance apps |
6. Multi-Version Concurrency Control (MVCC)
- MVCC (Multi-Version Concurrency Control) is a database concurrency control method that allows multiple transactions to access data simultaneously without interfering with each other by maintaining several versions of each data record.
- It is widely used in modern relational (e.g., PostgreSQL, MySQL/InnoDB) and NoSQL databases to maximize concurrent access and system throughput.
- It is foundational for modern high-traffic, multi-user systems.
How MVCC Works: Core Principles
- Versioning of Records:
- Every row (record) in the database has multiple versions, each tagged with metadata (like a transaction ID, timestamp, etc.).
-
When a transaction changes a row, the DBMS creates a new version rather than modifying the row in place.
-
Snapshots for Transactions:
- When a transaction begins, it sees a snapshot of the database reflecting the committed state at that start moment.
-
Reads are always performed on the snapshot, ensuring a consistent view even as other transactions are modifying data.
-
Non-Blocking Reads and Writes:
- Readers never block writers and writers never block readers.
- While a write is in progress, readers continue to see the old version of data; once the transaction commits, new transactions see the new version.
-
Only lightweight locks or internal synchronization may be used for operations like index updates (not for the row data itself).
-
Version Clean-up (Garbage Collection):
- As old versions of data become unneeded (when no active transaction could see them), a background process removes them to reclaim space (e.g., VACUUM in PostgreSQL).
Benefits of MVCC
- High concurrency: Readers don't wait for writers and vice versa.
- Reduced lock contention: Fewer deadlocks and waiting compared to strict locking.
- Consistent, repeatable reads: Each transaction sees a stable snapshot, making logic simpler for developers.
- Improved user experience: Fast, non-blocking reads in heavy workloads.
MVCC Types and Details
- Timestamp-based MVCC: Each transaction/version is timestamped; what you read depends on the timestamp.
- Snapshot-based MVCC: Each transaction sees a snapshot of the database as it was at its start time.
- History-based MVCC: Every modification is recorded, simplifying rollbacks.
- Hybrid MVCC: Combines approaches for flexibility and performance.
Write Conflicts
- When two concurrent transactions update the same row, one might be aborted or asked to retry to avoid lost updates. MVCC handles these via version/timestamp checks at commit.
Trade-offs & Limitations
- Storage Overhead: Extra disk/memory required to store multiple versions of rows.
- Cleanup Required: Old/dead row versions accumulate and must be periodically cleaned.
- Not Always Strictly Serializable: MVCC commonly provides snapshot isolation, which is strong but not fully serializable (write skew is possible).
Real-World Example
Suppose two bank employees are working: - Alice reads Bob's balance ($100) and starts a transaction to withdraw $50. - Meanwhile, Bob deposits $40; his transaction creates a new version ($140) and commits. - Alice's transaction, on commit, checks her snapshot (still $100), subtracts $50, and creates a new version ($50). - Any new transaction now sees Bob's balance as $50, though Alice read $100, never seeing Bob's deposit until she finishes.
MVCC vs. Traditional Locking
| Aspect | MVCC | Pessimistic Locking |
|---|---|---|
| Reads | Non-blocking, consistent snapshot | May block on write locks |
| Writes | Create new versions; may abort on conflict | Must acquire locks; blocks others |
| Concurrency | High | Lower (due to lock contention) |
| Deadlocks | Rare (except for metadata/index ops) | More frequent with many writers/readers |
MVCC in Practice
- PostgreSQL: Requires vacuuming to reclaim dead tuples.
- MySQL/InnoDB: Uses undo logs to store old versions.
- NoSQL (Cassandra, etc.): Uses similar multi-version logic for distributed consistency and high-throughput.
Additional Notes
- Isolation vs. Performance Tradeoff:
Higher isolation = fewer anomalies, but lower concurrency and higher resource usage. - MVCC (Multi-Version Concurrency Control):
Many DBs (Postgres, MySQL) use MVCC so read locks are minimized, allowing high concurrency at higher isolation. - Practical Use:
Use the lowest isolation level that fits your data consistency needs, to maximize performance.
Recoverability and Serializability of Schedules in DBMS
Database schedule management balances concurrent execution (performance) with data consistency and recoverability.
1. Recoverability of Schedules
Recoverability ensures that if a transaction fails, no other transaction commits based on the failed (and rolled back) transaction’s output; thus, the database can be safely recovered to a consistent state.
Types of Schedules Based on Recoverability
a. Recoverable Schedule
-
A schedule is recoverable if every transaction that reads data written by another transaction commits only after the writer transaction commits.
-
Example:
T1: W(x) ... C1
T2: R(x) ... C2
Here, T2 reads x written by T1 and C2 happens after C1. If T1 fails and is rolled back, T2 can be rolled back safely.
- Why Needed: To prevent scenarios where a committed transaction depends on uncommitted (possibly rolled-back) data.
b. Cascading Rollbacks
-
If a transaction is rolled back, all dependent transactions (which read uncommitted changes) must also be rolled back, potentially causing a cascade of aborts.
-
Example:
- T1 writes x.
- T2 reads x and writes y.
- T3 reads y.
-
If T1 aborts, T2 and T3 must also abort.
-
Problem: Leads to complex recovery and data loss.
c. Cascadeless/Avoid Cascading Abort Schedule
- A schedule avoids cascading aborts if transactions only read values written by committed transactions.
- This ensures that rollbacks do not propagate, making recovery easier.
- How to Avoid: Use strict two-phase locking or delay reads until commit.
d. Strict Schedule
- A stricter form where a transaction can neither read nor write an item until the previous transaction that last wrote to it has committed or aborted.
2. Serializability
Serializability is the gold standard for determining if a concurrent schedule maintains database consistency by yielding the same effect as some serial (one-at-a-time) execution.
a. Serial and Non-serial Schedules
- Serial Schedule: Transactions are executed one after the other; always safe but inefficient.
- Non-serial Schedule: Transactions' operations are interleaved for efficiency; needs to be serializable for correctness.
b. Conflict Serializability
-
Definition: A schedule is conflict-serializable if by swapping non-conflicting operations, it can be transformed into a serial schedule.
-
Conflicting Operations:
- Both operations shouldn't belong to the same transaction
- Both operations must be on same data
- Atleast one of the operations must be write.
-
ex: R–W, W–R, W–W.
-
Non-Conflicting Pairs: Same data item, both reads, or different data items.
-
Testing Conflict Serializability:
-
For each conflict, create a precedence graph (nodes: transactions; edges: Ti → Tj if Ti’s operation precedes and conflicts with Tj’s).
-
If the graph has no cycles, the schedule is conflict-serializable.
-
Benefit: Fast checking, avoids most anomalies, but some correct non-conflict serializable schedules are missed.
-
Example:
T1: W(A) T2: R(A), W(A) -
W1(A) and R2(A): conflict (write-read, different Tx, same item)
- W1(A) and W2(A): conflict (write-write)
- If all conflicts can be reordered to match a serial execution, it is conflict-serializable.
c. View Serializability
- Definition: A schedule is view-serializable if it is view-equivalent to some serial schedule, i.e., every transaction "sees" database changes as in some serial execution.
- Conditions for View Equivalence:
- Initial Reads: For each data item, both schedules must have the same transaction performing the first read.
- Final Writes: For each data item, both must have the same transaction perform the last write.
- Update Reads: If Tj reads data written by Ti, both schedules must preserve this read-from relationship.
- Advantage: Captures all correct schedules; more powerful than conflict-serializability but harder to check algorithmically.
Example:
Suppose in both the actual and serial schedule, T2 reads X as written by T1, and final writes match. The schedule is view-serializable even if not conflict-serializable.
| Concept | What It Ensures / Description | Advantages | Disadvantages / Issues |
|---|---|---|---|
| Recoverable Schedule | Commits only after all dependent commits | Prevents unrecoverable states | May allow cascading rollbacks |
| Cascading Rollback | Rollbacks propagate through dependencies | — | Data loss, complex recovery |
| Cascadeless Schedule | Only reads committed data | No cascading aborts | May require stricter protocols |
| Conflict Serializability | Can reorder conflicts to serial schedule | Easy to check | Doesn’t capture all 'good' schedules |
| View Serializability | Same reads/writes as serial schedule | More comprehensive | Harder to verify |
| Non-Conflicting Operations | Same transaction, or different data items, or both reads | Allow flexible reordering | — |
4. Practical Guidance
- Always prefer cascadeless and strict schedules in production systems for easier recovery and integrity.
- Aim for conflict-serializable schedules via two-phase locking or logical timestamp protocols for high concurrency but correct outcomes.
- For advanced correctness and optimization (especially in distributed systems), check for view-serializability.
Lock-Based Concurrency Control
Lock-based protocols manage access to data in concurrent transactions to ensure consistency and isolation.
1. Lock Types
Shared Lock (S-Lock)
- Purpose: Allows read-only access to a data item.
- Behavior: Multiple transactions can hold shared locks on the same item simultaneously, but cannot write.
- Example: If T1 and T2 both hold shared locks on row A, both can read but neither can write until all shared locks are released.
Exclusive Lock (X-Lock)
- Purpose: Allows read & write operations.
- Behavior: Only one transaction can hold an exclusive lock on a data item; no other shared or exclusive locks are allowed on the item during this time.
- Example: If T1 holds an exclusive lock on row B, no other transaction can read or write B until T1 releases the lock.
| Lock Type | Allows Read? | Allows Write? | Multiple Holders? |
|---|---|---|---|
| Shared | Yes | No | Yes |
| Exclusive | Yes | Yes | No |
2. Two-Phase Locking Protocol (2PL)
Protocol Overview
- Goal: Guarantees serializability (correctness) of schedules.
- Phases:
- Growing Phase: Transaction can acquire new locks but cannot release any.
- Shrinking Phase: Transaction releases locks but cannot acquire new ones.
- Lock Point: The moment where the last lock is acquired—afterward, only unlocks are allowed. All lock operations must precede all unlocks.
3. Strict Two-Phase Locking
- Stricter variant: Transaction holds all exclusive locks until it commits or aborts.
- Advantage: Ensures no cascading rollbacks—if a transaction fails, no other committed transaction has seen its uncommitted data, making recovery simpler and safer.
- Characteristic: Shared locks may be released earlier, but exclusive locks are held until commit.
4. Lock Upgradation (Lock Conversion)
- Definition: The process of converting a shared lock to an exclusive lock (S → X) when a transaction that was reading decides it now needs to write.
- Use Case: A transaction reads a value, then wants to update it without releasing and reacquiring the lock (which could allow other transactions to slip in).
- Constraint: Upgradation is only possible if no other transaction holds a shared lock on the same item.
5. Deadlock & Starvation
Deadlock
- Occurs when: Two or more transactions are waiting indefinitely for resources locked by each other, forming a cycle.
- Classic Example: T1 locks A, requests B; T2 locks B, requests A — both wait forever.
- Necessary conditions: Mutual exclusion, hold and wait, no preemption, circular wait.
- Management: Deadlock detection & victim selection for rollback are required.
Starvation
- Occurs when: A transaction waits indefinitely because other transactions repeatedly preempt its required resources.
- Causes: Higher-priority or frequent transactions keep grabbing locks before the waiting one can proceed.
- Solution: Fair scheduling, resource allocation strategies, or queue-based locking to ensure fairness.
6. Timestamp-Based Protocols
- Principle: Assigns a unique timestamp to each transaction upon arrival.
- Ordering: Transactions are ordered and executed logically in timestamp order, using timestamps for validation (older → newer).
- If a younger transaction tries to overwrite a value read/written by an older, it may be delayed or aborted.
- Advantages: Avoids deadlocks (no waiting for locks), suitable for high-concurrency environments.
- Drawbacks: May cause more transaction restarts/rollbacks if conflicts occur.
7. Summary Table
| Concept | Purpose/Behavior | Pros | Cons |
|---|---|---|---|
| Shared Lock | Read-only by many; no writes allowed | High concurrency | May block writers |
| Exclusive Lock | Single holder; allows read & write | Data integrity | More locking waits |
| 2PL | Split into 'growing' and 'shrinking' phases | Ensures serializability | Deadlock possible |
| Strict 2PL | Hold X-locks until commit/abort | No cascading aborts | Lower concurrency |
| Lock Upgradation | Convert S-lock to X-lock without unlock/lock | Efficient for read→write | Potential deadlock |
| Deadlock | Cyclic wait for locks, nobody can proceed | – | Must detect/handle |
| Starvation | Some transactions never proceed | – | Unfairness |
| Timestamp Protocol | Timestamps decide execution order (no locks) | No deadlock | More aborts possible |
Files and Indexing Concepts in DBMS
Efficient file organization and indexing are critical for fast data access in databases.
1. File Organization
a. Heap (Unordered) File
- Records are placed wherever space is available.
- Fast inserts, slow for searches unless scanned entirely.
b. Sequential (Ordered) File
- Records are sorted by key.
- Fast for range/quicksort queries, slower inserts and updates.
c. Hash File
- Records are placed based on a hash of the search key.
- Very fast for equality lookups but not suited for range queries.
2. Indexing
Indexes improve query speed by allowing direct access to records.
a. Dense Index
- Definition: Index entry exists for every search key value in the data file.
- Feature: Every record is pointed to; fast searches; larger index size.
- Example: If a table has 1,000 records, the dense index will have 1,000 entries.
b. Sparse Index
- Definition: Index entries exist only for some search key values (typically, one per block/page of the data file).
- Feature: Smaller index; entries point to data blocks; for not-finding exact matches, search nearest lower key then scan within the block.
- Example: If each page holds 50 records and there are 1,000 records, the sparse index will have 20 entries.
Dense vs. Sparse Index Comparison
| Aspect | Dense Index | Sparse Index |
|---|---|---|
| Entries | Every search key | Only some keys (per page) |
| Space | Larger | Smaller |
| Search Time | Fastest | Slightly slower |
| Maintenance | More updates needed | Fewer updates needed |
3. Multi-way Trees for Indexing (m-way Search Tree)
- Each node can have up to m children and (m–1) keys.
- Generalization of binary search trees; used in DBMS for efficient indexing.
4. B-Tree
- A balanced m-way search tree (usually, m is large: 50–200).
- Properties:
- All leaves at the same level (balanced).
- Each internal node (except root) has between ⎡m⁄2⎤ and m children.
- Fast search, insertion, and deletion (O(log n)).
- Each node can have multiple keys and pointers.
- Keys are kept sorted inside nodes.
- Used for indexing on disk-based large data (minimizes disk reads by maximizing data in each node).
5. B+ Tree
- A variant of B-tree used in most practical DBMS implementations.
- Differences from B-Tree:
- Only leaf nodes contain actual data record pointers; internal nodes store only keys for navigation.
- Leaf nodes are linked in a linked-list fashion for fast range queries.
- More keys per node—shallow tree—fewer disk accesses.
B-Tree vs B+ Tree Comparison
| Feature | B-Tree | B+ Tree |
|---|---|---|
| Record Storage | In all nodes | Only in leaf nodes |
| Internal Nodes | Keys + pointers + data | Only keys + pointers (no actual data) |
| Range Query | Slower (data spread in tree) | Faster (all data in sequential leaf nodes) |
| Leaf Node Link | No | Yes (doubly or singly linked) |
LSM Indexing and SSTables
-
LSM (Log-Structured Merge) tree is a disk-friendly indexing structure optimized for very high write throughput, used in many modern NoSQL and distributed key-value databases (e.g., Cassandra, ScyllaDB, LevelDB, RocksDB).
-
Core Idea:
- Writes are first buffered and stored sequentially in memory, then periodically flushed to disk in batches.
- Unlike B-tree indexes (which use in-place modifications), LSM trees use append-only and merge-based strategies to minimize random writes and maximize disk throughput.
LSM Tree Architecture Components
- Memtable (in-memory):
- An in-memory, mutable, usually sorted data structure (like a Skip List or Red-Black Tree).
- Fast for inserts/updates/deletes.
-
All new writes go here first.
-
Write-Ahead Log (WAL):
-
A sequential log to ensure durability before the Memtable is flushed to disk.
-
SSTables (Sorted String Tables, on disk):
- When Memtable is full, its contents are flushed to disk as a new immutable, sorted file (SSTable).
-
Multiple SSTables exist on disk, often organized into levels.
-
Compaction:
- Periodically, SSTables are merged/compacted to remove obsolete or duplicate entries and optimize disk usage.
- Compaction strategies (like leveled or tiered) affect performance trade-offs.
LSM Operations
- Write:
- Add new data to Memtable.
- Log to WAL for durability.
- When Memtable full, flush as new SSTable (sorted, immutable).
- Read:
- Check Memtable.
- If not found, check SSTables (newest to oldest).
- Use Bloom filters and indexes to speed up reads.
- Delete:
- Insert a "tombstone" indicating data was deleted; real deletion occurs during compaction.
SSTable (Sorted String Table)
- SSTable: A disk file storing immutable, sorted (by key) key-value pairs—used as the basic building block of an LSM tree's on-disk structure.
- Immutable: Once written, SSTables are never updated in place.
-
Sorted: Allows for fast binary search, range queries, and efficient merge (compaction) operations.
-
Structure
-
Index: Contains pointers to the location of keys in data blocks, allowing for binary search and fast lookups.
-
Data Blocks:
- The actual key-value entries, sorted by key.
- May store metadata, timestamps, or tombstones (for marking deletes).
3. Compaction in LSM Trees
-
Compaction is the process of merging multiple SSTables, dropping obsolete keys (including old values and tombstones), and reducing disk space and read amplification.
-
Benefits: Keeps read performance stable, minimizes space used, and ensures deletes/updates propagate.
4. LSM vs. Traditional B-trees
| Aspect | LSM Tree | B-tree |
|---|---|---|
| Writes | Fast, sequential, batched | Slow, random in-place |
| Reads | Potentially slower, multiple files | Fast, single structure |
| Best For | Write-heavy, log, analytics | Read-heavy, OLTP |
| Structure | Multiple levels of SSTables | Single balanced tree |
5. Summary Table: LSM Components
| Component | Where? | Role |
|---|---|---|
| Memtable | In-memory | Buffer, fast writes/updates |
| WAL | Disk | Durability of in-memory data |
| SSTable | Disk | Immutable, on-disk, sorted data files |
| Bloom Filter | In-memory | Fast surf for non-existence in SSTables |
| Compaction | Disk | Merge/optimize SSTables, drop obsolete |
In Practice
- Cassandra, LevelDB, RocksDB, and HBase all use LSM indexing and SSTable files.
- Advantages: Handles massive write loads and supports high-throughput ingestion.
- Trade-off: Reads may check several SSTables (mitigated with Bloom filters and frequent compaction).
Geo Indexing
Goal: Efficiently answer "Find all points near X" in 2D space (latitude/longitude).
Why it’s hard: - 2D data doesn’t fit well in B-trees - Naïve scan = O(n) → too slow for millions of points - Need logarithmic query time
Two dominant solutions: 1. Quad Tree → Tree-based, dynamic 2. Geohash → String-based, fixed-grid
Quad Tree – The Tree That Splits the World
Concept - Recursively divides 2D space into 4 quadrants - Each node represents a bounding box - Stops when ≤ N points per node (usually 4)
[Earth]
/ | \ \
NW NE SW SE
/| /| |\ |\
Types | Type | Use Case | |-------------------|-----------------------------------| | Point Quad Tree | Store individual points (cities, users) | | Region Quad Tree | Raster images, terrain data | | PR Quad Tree | Fixed subdivision rules |
How Insertion Works?
if node has < 4 points:
add point
else:
subdivide into 4 children
re-insert all 5 points into correct child
Pros
- Perfect for nearest neighbor and range queries
- Adapts to data density (dense cities → deep tree)
- Natural clustering (points in same node = close)
- Works with circles, polygons
Cons - High memory usage (pointers everywhere) - Hard to persist in databases - Poor cache performance - Not distributed-friendly - Imbalanced trees possible
Quad Tree Query
def find_in_radius(self, center, radius):
if not self.boundary.intersects_circle(center, radius):
return []
results = [p for p in self.points if p.distance(center) <= radius]
if self.divided:
for child in [self.nw, self.ne, self.sw, self.se]:
results += child.find_in_radius(center, radius)
return results
Geohash – Turn Location into a String
Concept
Converts (lat, lng) → short string using base-32 encoding
40.7484, -73.9857 → dr5ru6jz1
(Empire State Building)
How It Works (Bit Interleaving)
Latitude: 01001011... (32 bits)
Longitude: 10110100... (32 bits)
Interleave: 01100101... → base32 → dr5ru6jz1
Precision Table
| Length | Area Size | Accuracy | Example |
|---|---|---|---|
| 5 | 4.9km × 4.9km | ~2.4km | dr5ru |
| 7 | 610m × 610m | ~300m | dr5ru6j |
| 9 | 76m × 76m | ~38m | dr5ru6jz1 |
| 11 | 9.5m × 9.5m | ~5m | dr5ru6jz1j3 |
The 9-Cell Trick
One geohash = one box
But user is in center → need 8 neighbors
[7][0][1]
[6][C][2] ← C = center geohash
[5][4][3]
Query 9 geohash cells → perfect 500m radius coverage
Why Redis Loves Geohash
| Redis Feature | Geohash Provides |
|---|---|
| Fast string keys | 9-char strings |
| Sorted Sets (ZSET) | Store user IDs with timestamp |
| Prefix search | Same prefix = nearby |
| Memory efficient | 9 bytes per location |
ZADD geo:dr5ru6jz1 1700000000 user123
ZADD geo:dr5ru6jz3 1700000001 user456
Geohash + Redis (Python)
import redis, geohash, time
r = redis.Redis()
def add_user(user_id, lat, lng):
gh = geohash.encode(lat, lng, 9)
r.zadd(f"geo:{gh}", {user_id: int(time.time())})
def nearby(lat, lng):
center = geohash.encode(lat, lng, 9)
cells = [center] + geohash.neighbors(center)
users = []
for cell in cells:
users.extend(r.zrange(f"geo:{cell}", 0, -1))
return users
Comparison
| Feature | Quad Tree | Geohash |
|---|---|---|
| Data Structure | Tree (dynamic) | String (fixed) |
| Query Speed | O(log n) | O(1) with Redis |
| Memory Usage | High (pointers) | Tiny (9 bytes) |
| Database Indexing | Hard | Easy (B-tree on string) |
| Distributed Systems | Hard to shard | Perfect sharding |
| Nearest Neighbor | Excellent | Good (with 9-cell) |
| Circle Queries | Perfect | Approximate (square grid) |
| Implementation | Complex | 10 lines of code |
| Production Scalability | Medium | 100M+ users (Uber, Tinder) |
Real Production Systems
Geohash Winners
| Company | Implementation |
|---|---|
| Uber | Redis GEO + GEORADIUS for driver matching |
| Nearby posts → geohash in Cassandra | |
| Tinder | Match discovery → Redis GEO |
| Snapchat | Geo-filters → Redis geospatial |
| Foursquare | Venue search → Redis ZSETs per geohash |
Quad Tree Winners
| Company | Use Case |
|---|---|
| Google Maps | Frontend marker clustering |
| Pokémon GO | Creature spawning density |
| Mapbox | Vector tile clustering |
When to Use Which?
| Use Case | Winner | Reason |
|---|---|---|
| Find drivers within 1km | Geohash + Redis | 1ms response |
| Show 1M map markers without lag | Quad Tree | Zoom clustering |
| Store user locations in DB | Geohash | Indexable string |
| 2D game collision detection | Quad Tree | Real-time |
| Shard data across 100 servers | Geohash | Natural partitions |
| Precise circular range queries | Quad Tree | True circles |
| Mobile app with 10M users | Geohash + Redis | Scalability |
Best Practice: Use Both!
Modern geo apps do exactly this:
Frontend (Map) → Quad Tree → Beautiful clustering
Backend (API) → Geohash + Redis → Lightning-fast queries
Database (Persist) → Geohash column → Easy indexing
Example: Uber
- Drivers stored with Geohash in Redis
- Map shows Quad Tree clustered markers
- Database has geohash VARCHAR(12) INDEX
The winners use both — Geohash for backend, Quad Tree for frontend.
This is exactly how Uber, Instagram, Tinder, and Google Maps work in 2025.
Common Database Systems
MySQL (Relational, SQL-based)
- Type: RDBMS (Relational Database Management System)
- ACID compliant: Guarantees reliable transactions (Atomicity, Consistency, Isolation, Durability).
- Multiple Storage Engines: Choose per-table (e.g., InnoDB, MyISAM), each with different features.
- Scalability: Vertical scaling natively; horizontal scaling via replication & sharding (MySQL Cluster).
- Performance: Fast for read-heavy workloads with indexes.
- Replication: Asynchronous or semi-synchronous, enabling high availability.
- JSON: Supports limited JSON fields (since 5.7), semi-structured data.
- Protocol: Native TCP (port 3306).
- Works better in case of more conflicting writes.
Key Storage Engines:
| Engine | Features | Use Cases |
|---|---|---|
| InnoDB | Default engine. Supports transactions, row-level locking, foreign keys, crash recovery. | Transactional apps (e.g., e-commerce, banking) |
| MyISAM | Fast reads, table-level locking, no transaction support, smaller disk use. | Read-heavy, analytics, logging |
| Memory | Data in RAM (non-persistent), ultra-fast. | Temp tables, caching |
| Archive | Compressed storage for archival, no indexing | Historical data, infrequent queries |
| CSV | Stores data as .csv files, no indexes | Data exchange, interoperability with other tools |
- Mix-and-match: Tables in the same DB can use different engines.
- Engine affects: Transaction support, concurrency, performance, features.
Index Types:
- B-Tree Index: Default, for quick lookups, range queries, sorts.
- Full-Text Index: For keyword searches in text data.
- Hash Index: Fast equality checks, not for ranges (mainly Memory tables).
- Spatial Index: For geographic/spatial data queries.
- Unique/Composite Index: Enforces uniqueness or indexes multiple columns.
PostgreSQL (Relational, Object-Relational, SQL-based)
- Type: RDBMS, object-relational, ACID compliant.
- Advanced Data Types: Arrays, JSON/JSONB, hstore, XML, custom types.
- Extensible: User-defined functions, custom operators, extensions (e.g., PostGIS for geography).
- Concurrency: MVCC (Multi-Version Concurrency Control) avoids reader locks, supports high concurrency.
- SQL: Full ANSI SQL + advanced features (window functions, CTEs, etc.).
- Replication: Streaming (sync/async), logical replication.
- Full-text Search: Built-in, indexed.
- Scalability: Vertical and—via tools/extensions—horizontal (e.g., Citus).
- Protocol: Custom (port 5432).
- Works better in case of more non-conflicting writes.
Architecture:
- Single Storage Engine (Heap + Extensions): Data is stored in "heap" tables; extensible via pluggable table access methods (since v12+).
- MVCC: Supports concurrent reads/writes by storing row versions.
- WAL (Write-Ahead Log): Ensures crash recovery, replication.
Index Types:
- B-Tree: Default, widely used.
- GiST, GIN, BRIN: For full-text search, arrays, JSON, spatial data.
- Hash, Unique, Composite: For specific use cases, complex queries.
Example: - Ideal for complex transactional, analytical, geospatial, and semi-structured data workloads.
Steps for Updating a Row in a PostgreSQL Table
Sequential Steps for Updating a Row
- Query Parsing and Planning:
- The
UPDATEquery (e.g.,UPDATE users SET name = 'Alice' WHERE id = 123;) is received via a client (e.g., psql, JDBC). - PostgreSQL’s parser validates the query syntax and creates an abstract syntax tree (AST).
-
The query planner/optimizer generates an execution plan, choosing the most efficient path (e.g., index scan or sequential scan) based on table statistics and indexes.
-
Transaction Initiation:
- A transaction is started (explicitly with
BEGINor implicitly for a single statement). -
PostgreSQL assigns a transaction ID (XID) to track changes and ensure MVCC.
-
Row Lookup:
- The executor uses the query plan to locate the target row(s) based on the
WHEREcondition. - If an index exists (e.g., B-tree on
id), it’s used for faster lookup; otherwise, a sequential scan is performed. -
The row’s tuple identifier (TID) is retrieved from the heap table.
-
MVCC Check:
-
PostgreSQL checks the row’s visibility using MVCC:
- Verifies if the row is visible to the current transaction (based on Xmin/Xmax fields in the tuple header).
- Ensures no conflicting transactions are modifying the same row (e.g., using locks if needed).
-
Row Update (New Tuple Creation):
- Instead of overwriting the existing row, PostgreSQL creates a new tuple with updated values (e.g.,
name = 'Alice'). - The new tuple is marked with the current transaction’s XID (Xmin) and the old tuple’s XID is updated (Xmax) to mark it as “dead” for MVCC.
-
The new tuple is written to the heap table, typically in the same or a new page (8KB default).
-
Index Updates:
- If the table has indexes, PostgreSQL updates them to reflect the new tuple:
- Removes the old tuple’s index entry (or marks it invalid).
- Adds a new index entry pointing to the new tuple’s TID.
-
This applies to all index types (e.g., B-tree, GIN, GiST).
-
Write-Ahead Logging (WAL):
- The update operation (new tuple and index changes) is logged to the WAL before committing to disk.
-
WAL ensures durability by recording changes in a sequential log, allowing recovery in case of crashes.
-
Locking (if Needed):
- If concurrent transactions access the same row, PostgreSQL may acquire row-level locks (e.g.,
FOR UPDATE) to prevent conflicts. -
Table-level locks are rare unless explicitly requested (e.g.,
LOCK TABLE). -
Commit or Rollback:
- On
COMMIT, the transaction is finalized, and the new tuple becomes visible to subsequent transactions (based on MVCC rules). - WAL records are marked as committed, ensuring durability.
-
On
ROLLBACK, changes are discarded, and the new tuple is ignored. -
Storage and Maintenance:
- The old tuple remains in the heap as a “dead tuple” for MVCC (visible to older transactions).
- PostgreSQL’s autovacuum process later runs
VACUUMto mark dead tuples for reuse and reclaim space. ANALYZEmay update table statistics for future query optimization.
Example
For UPDATE users SET name = 'Alice' WHERE id = 123;:
1. Parse query and plan (use index on id).
2. Start transaction, assign XID.
3. Find row with id = 123 via index.
4. Check MVCC visibility.
5. Create new tuple with name = 'Alice', mark old tuple as dead.
6. Update index to point to new tuple.
7. Log changes to WAL.
8. Apply any necessary locks.
9. Commit transaction, making new tuple visible.
10. Autovacuum later cleans up dead tuple.
MongoDB (NoSQL, Document-Oriented)
- Type: NoSQL, document-based database.
- Document Model: Data stored as BSON documents (flexible JSON-like structure).
- Example:
{ "name": "Alice", "age": 30, "address": { "city": "Delhi" } } - Schema-less: Data structure can change over time with no strict schema enforcement.
- Scalability: Built-in sharding for horizontal scale, replica sets for availability.
- Query Language: MQL (Mongo Query Language), powerful aggregation framework.
- ACID Transactions: Supported (since v4.0), but designed for eventual consistency.
- Performance: Very fast for write-heavy and dynamic-schema apps.
- Protocol: Custom TCP protocol (port 27017).
Architecture Highlights:
- Sharding: Splits data based on a shard key, distributes across multiple servers.
- Replica Sets: Primary node for writes, secondary nodes for redundancy & failover.
Index Types:
- B-Tree: Default for most queries.
- Compound: Multiple fields.
- Text: Full-text search across text fields.
- Geospatial: For location queries.
- Hashed: For uniform sharding.
- Unique: Ensures uniqueness.
- TTL (Time-To-Live): Auto-delete documents after specified interval.
Example Use Cases: Real-time analytics, IoT, content management, apps with fast-evolving schemas.
Additional Important Points
- Consistency, scalability, and performance needs determine database and storage engine choice.
- SQL vs. NoSQL: Choose SQL for strong consistency, transactions, structured data; NoSQL for flexible, large-scale, and dynamic use cases.
- All major systems offer powerful indexing, but index choice and design directly affect performance.
- Backups, failover, replication, and associated tooling are essential for production deployments in any stack.
RDBMS vs. NoSQL:
| System | Data Model | Best For | Scaling | Transactions | Schema |
|---|---|---|---|---|---|
| MySQL | Table/Rows | Traditional, structured data | Vert/Horiz | Strong/ACID | Rigid |
| PostgreSQL | Table/Rows | Complex & extensible workloads | Vert/Horiz (ext) | Strong/ACID | Rigid |
| MongoDB | Document | Unstructured, dynamic schemas | Native Horiz | Eventual | Flexible |
Distributed Systems
Database Sharding
- Database sharding splits large datasets into smaller, manageable pieces called shards, stored on separate database instances.
- Each shard holds a subset of the total data, determined by a shard key (an attribute used for partitioning).
- Sharding improves scalability, performance, and availability for large, high-traffic applications.
Types of Sharding Architectures
1. Range-Based Sharding
- How it works: Data is divided based on value ranges of the shard key.
- Pros: Simple to implement.
- Cons: Can lead to uneven data distribution (hot spots).
- Example:
userID 000 - 199 ➔ database 1 userID 200 - 399 ➔ database 2 userID 400 - 599 ➔ database 3 userID 600 - 799 ➔ database 4 userID 800 - 999 ➔ database 5
2. Key-Based (Hash-Based) Sharding
- How it works: Applies a hash function to the key (e.g., userID), and assigns the result to a shard.
- Pros: Tends to distribute data more evenly than range sharding.
- Cons: Simple hash functions can still create imbalances; consistent hashing can address this.
- Example:
h(x) = x % 3, where x = userID Distributes userIDs 1–6 across 3 shards
3. Directory-Based Sharding
- How it works: Maintains a lookup table that maps each key to its shard.
- Pros: Highly flexible, as any key can be mapped to any shard.
- Cons: The lookup table is a single point of failure and can be a bottleneck.
Horizontal Partitioning vs. Sharding
- Horizontal Partitioning: Splits a table into many tables within the same database instance (partitions must have unique table names).
- Sharding: Distributes data across multiple database instances (each can use the same table names).
Advantages
- High Availability: Other shards stay online even if one fails.
- Security: Access can be controlled at the shard level.
- Faster Queries: Each shard manages less data, making indexes and queries more efficient.
- Increased Throughput: Reads and writes are isolated per shard, increasing performance.
- High Scalability: Additional shards can be added to handle growing data and traffic.
Disadvantages
- Increased Complexity: Query routing, data distribution, and cross-shard coordination add complexity.
- Transactions/Rollbacks across Shards: Difficult or impossible without distributed transaction protocols.
- Expensive Joins: Joining data across shards requires costly, cross-network operations.
- Higher Infrastructure Cost: Requires multiple servers, raising costs compared to single-instance solutions.
Hierarchical Sharding
- Problem: Fixed number of shards may lead to unbalanced data distribution (hot spots).
- Solution: Hierarchical sharding shards a large shard further into smaller mini-shards, managed by a shard manager.
- Helps manage growth and balance data distribution within problematic shards.
Master-Slave Architecture for High Availability
- Approach: Each shard has a master instance (handles writes) and one or more slave replicas (handle reads).
- Failover: On master failure, a slave can be promoted to master.
- Benefit: Supports continuous reads and maintains data redundancy for disaster recovery.
Consistency
Consistency measures how up-to-date and synchronized distributed data copies are.
1. Linearizable Consistency
- Definition: Every read returns the latest written value; all operations appear instantaneous and in order.
- How: Typically achieved with a single-threaded server (easy to reason about, but low efficiency).
2. Eventual Consistency
- Definition: All copies will converge to the same value eventually, though reads may return stale data in the short term.
- How: Reads and writes can occur in parallel across replicas.
- Trade-off: Higher efficiency at the expense of short-term consistency.
3. Causal Consistency
- Definition: Ensures that causally related operations are seen in the same order by all processes, but unrelated operations may arrive out of order.
- Benefit: Better than eventual, lower consistency than linearizable. Suitable for operations that are mostly independent.
- Limitation: Fails for aggregation queries, since these require knowledge of the complete (global) order.
4. Quorum Consistency
- Definition: Reads and writes require responses from a minimum number of replicas. Follows the rule:
R + W > Nwhere: - R = number of replicas to read from
- W = number of replicas to write to
- N = total number of replicas
- Configurability: Can tune for stronger or weaker consistency as needed.
- Drawbacks: Requires more servers, and even numbers can lead to split-brain scenarios.
Data Consistency Levels Tradeoff Table
| Level | Consistency Level | Efficiency |
|---|---|---|
| Linearizable | High | Low |
| Eventual | Lowest | Highest |
| Causal | Between Eventual & Linearizable | Medium |
| Quorum | Configurable | Configurable |
Linearizability in Databases
What is Linearizability?
Linearizability is a strong consistency guarantee for distributed systems and databases. It requires that all operations (reads and writes) on a data item appear to execute atomically and in some real-time, total order—that is, every operation takes effect instantly between its invocation and its response.
Key Properties
- Real-Time Ordering: If operation A completes before operation B starts, then A must be visible to B and all subsequent operations.
- Atomicity: Every operation appears to take effect instantly at a single, unique point within its begin/end interval.
- Single Global Timeline: All clients see changes in the same order; there are no “stale” reads or causal gaps.
- User Intuition: Each read always returns the most recent written value (as observed in real time), which is what most users expect in strong consistency systems.
Visual Example
Suppose two clients interact with a key-value store:
- At 08:00, Client 1 writes
X = 10. - At 08:01, Client 2 writes
X = 20. - At 08:02, Client 3 reads X.
Linearizability guarantees:
Client 3’s read will definitely return 20, since the write of 20 happened before the read started.
Time axis → |---W(X=10)---|---W(X=20)---|---R(X)---|
C1 writes C2 writes C3 reads
Comparison: Linearizability vs Other Guarantees
| Guarantee | Real-Time Order? | Atomic Visibility | Possible Stale Reads? |
|---|---|---|---|
| Linearizability | Yes | Yes | No |
| Sequential Consistency | No | Yes | Yes (within session) |
| Eventual Consistency | No | No | Yes |
- Linearizability is the strictest; all reads/writes appear instantly at a single point on the timeline.
Where is Linearizability Useful?
- Financial transactions (bank transfers)
- Distributed locking and coordination services
- User-facing systems where stale reads cannot be tolerated
- Leader election and consensus algorithms (e.g., Raft, Paxos)
Trade-Offs
- Linearizability often comes with higher latency and lower availability during partitions or node failures (see CAP theorem).
- Many cloud or distributed databases provide linearizable consistency as an option, but may default to weaker consistency for higher performance/scalability.
Replication Topologies
Replication topologies define how data is copied and synchronized across nodes in a distributed database system.
1. Single-Leader Replication (Master-Slave / Primary-Secondary)
- One node is designated as the leader (or master/primary) and handles all write operations.
- Multiple follower nodes (slaves/secondaries) replicate data from the leader and can serve read requests.
- In the event of leader failure, a follower is promoted to leader.
Pros: - Strong consistency: All writes go through one node, simplifying state management and eliminating write conflicts. - Simpler conflict resolution: Easy to reason about ordering of updates. - High read scalability: Followers can serve read-only traffic, easing load on the leader. - Operational simplicity: More straightforward to implement and manage.
Cons: - Single point of failure: If the leader fails, writes are unavailable until failover. - Write bottleneck: Write throughput is limited to leader’s capacity. - Potential lag: Followers may be somewhat behind the leader (especially with asynchronous replication). - Failover complexity: Promoting a new leader adds operational challenges.
How is a New Leader Elected When the Leader Fails?
Single-leader replication relies on exactly one node (the leader/master/primary) for all writes. If this node fails, the system uses a standardized "failover" process to restore availability:
1. Detecting Leader Failure
- Followers or external monitors (like health checks or heartbeats) detect that the leader is unresponsive.
- A time-out is used: if the leader doesn’t respond within a set period (e.g., 30 sec), it is presumed dead.
2. Leader Election
- Once failure is detected, remaining followers coordinate to elect a new leader.
- The most common algorithm is to select the follower with the most up-to-date data log, to minimize data loss.
- This ensures that committed but not-yet-replicated data is preserved as much as possible.
- Election can be handled via:
- Simple coordination (e.g., by priority/ID)
- Consensus protocols like Raft or Paxos for distributed agreement
- External systems (e.g., ZooKeeper) for coordination and atomic reassignment
3. Reconfiguration
- The cluster is updated:
- Clients are notified to send new writes to the elected leader.
- Followers "follow" the new leader for replication.
- If the old leader comes back, it rejoins as a follower to prevent "split-brain" (two leaders at once).
Main Challenges in Leader Election and Failover
a. Detection Delay and False Positives
- Accurate, fast failure detection is difficult (network delays, slow nodes can be mistaken for failure).
- Too-short timeouts may cause unnecessary failovers ("flapping"), triggering repeated elections and instability.
b. Consensus Complexity
- All (or most) nodes must agree on the new leader; this is a "consensus" problem, especially tricky during network partitions or message loss.
- Algorithms such as Raft/Paxos or coordination services (ZooKeeper) are established solutions, but add complexity and require careful tuning.
c. Data Consistency
- Choosing a follower that is out-of-date risks data loss (writes confirmed by old leader but not yet replicated are lost).
- System must choose as leader the follower with the most complete data log, and must resolve any partial transactions safely.
d. Split-Brain Scenario
- If two nodes believe they’re both leaders (because of a network partition, etc.), both can accept writes, causing data divergence and conflict. Robust leader election algorithms must prevent this.
e. Client Reconfiguration
- Clients must reliably and quickly discover and start sending writes to the new leader; lag can cause failed writes or inconsistent system state until all clients update their configurations.
f. Service Downtime
- Time to detect failure, run consensus, and promote a new leader = a period where no writes are possible.
- Goal is to minimize downtime, but cannot eliminate it completely.
2. Multi-Leader Replication (Multi-Master / Active-Active)
- Multiple nodes act as leaders; each can independently accept write operations.
- These leaders replicate data among themselves.
- Network topology can be circular, star, or all-to-all. Topology impacts fault tolerance and replication overhead.
Pros: - High availability: System remains writable even if some leaders fail. - Low latency for writes: Writes can be accepted at the nearest leader (useful for geo-distributed systems). - Geographic distribution: Clients connect to the closest leader, benefiting global applications. - No single write bottleneck: Write load is distributed
Cons: - Conflict resolution required: Concurrent updates to the same data from different leaders must be resolved (e.g., last-writer-wins, custom rules). - Implementation complexity: Conflict detection and resolution logic is required. - Increased chance of inconsistencies: Temporary data divergence is possible before conflicts are reconciled. - Topology management: Certain setups (circular/star) may have single points of failure in the data propagation path if not well-designed.
Common Conflict Resolution Strategies
When multiple leader nodes (servers) accept simultaneous writes to the same data in different locations, conflicts can arise when updates are exchanged and merged. Handling these conflicts efficiently and predictably is a key challenge:
- Last Write Wins (LWW):
- Each write is assigned a unique identifier (e.g., timestamp); the version with the highest ID is kept.
- Pros: Simple to implement.
-
Cons: Risks losing valuable updates as earlier writes are discarded, potentially leading to data loss.
-
Multi-Value Conflict Resolution:
- Both conflicting versions are stored.
-
The application must later merge or choose between them, possibly involving user input or custom merging logic.
-
Custom/Application-Level Logic:
-
Developers provide conflict resolution code, either:
- On Write: The conflict handler resolves conflicts during replication.
- On Read: All versions are returned, and the application or user merges data.
-
Avoidance:
- Partition the data-space so that all writes for a given record go to one specific leader, reducing conflict risk.
CRDTs: Conflict-free Replicated Data Types (Not clear)
CRDTs are special data structures that enable multiple nodes to update data independently and concurrently. They are designed so that all copies converge to the same, correct state automatically, with no need for central coordination or manual conflict resolution.
- Mergeability: When two replicas diverge (make different changes), merging their states always results in the same eventual value, regardless of the order of operations.
- Operations: All operations are commutative (order doesn't matter), associative, and idempotent (applying multiple times makes no difference).
- Types:
- Counters: Increments can be merged from multiple places (e.g., distributed "likes" count).
- Sets: Add, remove, or update elements, ensuring all valid changes appear in the merged result.
- Maps, Lists, Registers: More complex types for real-world apps (e.g., collaborative documents).
- Use Cases: Distributed collaborative editing, messaging, notes apps, cloud-based multiplayer games, and scalable NoSQL databases.
Lamport Clocks
What is a Lamport Clock?
A Lamport clock is a simple, monotonic counter that helps order events in a distributed system. How it works: - Every node keeps a local counter.
-
When a node performs an operation (local event), it increments its counter.
-
When a node sends a message (e.g., a replicated write), it includes its counter value.
-
When a node receives a message, it updates its own counter to be one higher than the maximum of its own and the incoming clock.
-
This allows the system to order all events in a way that's consistent with possible causality, even if events happen concurrently on different nodes.
-
Example:
-
Node A (counter 5) writes and sends to Node B.
-
Node B (counter 3) receives the message and sets its counter to max(3,5)+1=6.
-
With every write having a Lamport timestamp, conflicts can be resolved by comparing timestamps; the higher timestamp "wins" (this is sometimes called last-write-wins or LWW).
-
This mechanism ensures all nodes converge to the same state but does not guarantee true linearizability because Lamport clocks provide only logical (not real-time) ordering. Physical clocks or hybrid clocks are needed for real-time order.
-
Lamport clocks alone do not capture real-time (wall-clock) order—two concurrent writes with different values may need additional tie-breakers (node ID, etc.).
Why using real timestamp is not reliable?
- Physical clocks on separate machines almost never remain perfectly synchronized, even with protocols like NTP. Small skews (milliseconds) can result in incorrect ordering of updates—especially during network partitions, heavy loads, or following system clock corrections.
- Relying solely on physical clocks can create “time gaps” where a write that has happened is not yet visible to other nodes due to network delays, allowing later writes (with higher timestamps) to be accepted and possibly overwriting genuine updates.
Merkle Trees
Merkle Trees are hash-based tree data structures used for efficient content verification and synchronization in distributed systems, blockchains, and databases.
- Structure:
- Leaf nodes: Contain hashes of individual data blocks (e.g., database rows, transactions).
- Internal nodes: Each is a hash of its children's hashes.
- Root node (Merkle root): Contains a single hash summarizing all the data beneath it.
- Properties:
- Any change to a single data block changes its hash, which affects all parent hashes, up to the root.
- To compare/synchronize two datasets, only the minimal set of inconsistent blocks (where hashes differ) need to be checked, rather than the whole dataset.
- Applications:
- Used to efficiently detect and sync only changed data in replication or backup scenarios.
- Blockchains use Merkle roots to verify the integrity of entire sets of transactions within each block.
Summary Table
| Topic | Purpose / Role | Key Advantage |
|---|---|---|
| Conflict Handling in Multi-Leader | Ensures consistency when updates collide | Multiple strategies suit various needs |
| CRDTs | Merges concurrent updates automatically, no user conflict resolution | Guaranteed eventual convergence, high availability |
| Merkle Trees | Verifies/synchronizes large datasets efficiently and securely | Fast change detection, minimal data movement |
Note: For applications that absolutely require linearizability, multi-leader replication is generally not used, or extra coordination (like distributed locking or consensus) is added at the cost of performance and availability.
3. Leaderless Replication
- No single leader exists; all nodes are equal—any node can accept reads or writes.
- Writes are sent to several replicas; quorum (N, R, W) consistency (e.g., “write must succeed on W out of N nodes”).
- Quorum reads ensure clients get the most recent or majority-backed values.
Pros: - Highest availability: Any node can handle requests even if others are down. - No single point of failure: True resilience (as long as quorum is preserved). - Horizontal scalability: Both reads and writes can be scaled by adding more nodes. - Geographic flexibility: Well-suited for global architectures.
Cons: - Eventual consistency (not strong): Reads may return stale data until nodes converge. - Conflict management complexity: Requires version vectors, Merkle trees, or similar to resolve conflicts and reconcile divergent histories. - Operational complexity: More client logic needed for things like read repair, handling conflicting writes, and node resurrection. - Stronger consistency costs: Achieving strong consistency (by using strict quorums) can add latency, lower availability, or both.
Read Repair
- Read Repair is a mechanism in distributed databases and leaderless replication systems (like Cassandra, ScyllaDB, DynamoDB) used to detect and correct data inconsistencies among replicas during read operations.
- When a client issues a read, the system checks data from multiple replicas, If inconsistencies (stale or divergent values) are detected, the system sends updates to out-of-date replicas, ensuring they converge towards the latest, correct value.
- Common in leaderless/ eventually consistent databases: Cassandra, ScyllaDB, DynamoDB, etc.
- Used together with quorum reads/writes and background anti-entropy processes.
- Relies on version vectors, timestamps, or digests to track and merge divergent data
How Read Repair Works
- Read Request: A client reads data with a given consistency level (e.g., quorum or majority) from multiple replicas.
- Data Comparison: The database compares the data (or digests/hashes) returned from each replica.
- Detect Inconsistency: If some replicas return different versions, an inconsistency is detected.
- Repair: The system identifies the correct (latest) version (often based on version/timestamp, "last write wins", or quorum rules).
- The up-to-date value is written back to the stale replicas.
- Repairs can be blocking (client waits for the repair before getting a response) or asynchronous (repair continues after returning the response).
- Return Data: The client receives the correct, most recent value.
When and Why Is Read Repair Used?
- Triggered during reads: Only replicas involved in a specific read operation are repaired (incrementally; the rest are left for later).
- Complements background anti-entropy repairs: Read repair does not scan all data but is opportunistically triggered by real reads, reducing the need for heavy background repair jobs.
- Improves eventual consistency: Since leaderless databases can diverge due to network partitions or delayed replication, read repair helps replicas converge on the correct state over time.
Types of Read Repair
- Blocking (Foreground) Read Repair:
The response to the client is delayed until repairs are completed for the data involved in the read. - Background (Asynchronous) Read Repair:
The client quickly receives data, and repairs to out-of-date replicas are done in the background to avoid extra read latency.
Pros
- Enhances data consistency incrementally, without needing heavy system-wide repair.
- Reduces time windows of inconsistency after network partitions or failed writes.
- Improves reliability and user experience by repairing as data is accessed.
Cons
- Only fixes data touched by reads; rarely/never-read data may remain inconsistent unless proactively repaired.
- Can add latency to reads if repairs are blocking/synchronous.
- Not a substitute for full (anti-entropy) background repair when many divergent replicas exist.
Visual Comparison Table
| Topology | Who Writes? | Scalability | Consistency | Fault Tolerance | Complexity | Example Databases |
|---|---|---|---|---|---|---|
| Single-Leader | Only the leader node | High (reads only) | Strong | OK (leader failover) | Low | MySQL, PostgreSQL, MongoDB, SQL Server |
| Multi-Leader | Any leader node | High (reads/writes) | Eventual / Custom | High | Medium-High | MySQL (Group Replication), MariaDB Galera, CouchDB, AWS Aurora (multi-master) |
| Leaderless | Any node | Highest (R/W) | Eventual* | Highest | High | Cassandra, DynamoDB, Riak |
*Note: Consistency in leaderless setups can be customized; strong consistency can be achieved at the cost of read/write performance or availability.
Quick Usage Guidance
- Single-Leader: Use when you need simple, strong consistency, manageable failovers, and high read scalability.
- Multi-Leader: Best for global low-latency writes, office branch synchronization, collaborative/real-time apps, but requires careful conflict handling.
- Leaderless: For ultra-high availability, no single point of failure, and massive scale (e.g., IoT, CDN, big data apps)—but with eventual consistency and extra complexity.
Synchronization Between Leader and Follower Nodes in Database Replication
1. Change Propagation
- Writes happen on the leader: All client writes (inserts, updates, deletes) are sent to the leader node only.
- Change log: The leader records these changes in a replication log (also called write-ahead log, binlog, or change stream) as soon as it commits the change to its own data store.
- Sending changes: After committing, the leader sends the same changes (or the log entries) to all followers, so they can apply the exact sequence of operations.
2. Replication Methods
a. Synchronous Replication
- Process: Leader waits for followers to acknowledge/confirm that they’ve applied the change before confirming success to the client.
- Pros: Followers are always up-to-date; system is strongly consistent.
- Cons: Increased write latency (system stalls if followers are slow or offline); rarely are all followers set to synchronous, usually just one “synchronous” follower for high durability.
- Typical setup: Used for mission-critical data needing minimal lag.
b. Asynchronous Replication
- Process: Leader sends changes to followers but does NOT wait for acknowledgments; confirms success to the client once change is written locally.
- Pros: Fast writes, higher throughput.
- Cons: Followers can lag behind, leading to replication lag and brief inconsistency if leader fails and a lagging follower is promoted.
- Typical setup: Default for most web-scale databases; most followers are asynchronous for scalability.
c. Semi-Synchronous Replication
- Hybrid: One (or more) follower(s) is synchronous, the rest are asynchronous. Guarantees data is in at least two places before success is reported.
- Benefit: Balances safety and performance.
3. Application of Replicated Changes
- Log Positioning: Each follower knows the log sequence number or offset up to which it has processed changes (e.g., LSN in PostgreSQL, binlog coordinates in MySQL).
- Pull-based Synchronization: Followers typically pull new log entries from the leader starting from their last processed point.
- Exact Order: Followers apply changes in the same order as processed on the leader, ensuring consistency.
4. Setting Up New Followers or Catching Up Lagging Followers
- Snapshot: When a new follower is added, it receives a consistent snapshot of the leader's data.
- Replay Logs: After the snapshot, the follower requests and applies all changes that happened since the snapshot to catch up fully.
- Continuous Sync: Once synced, the follower starts applying new real-time changes from the replication log.
5. Special Cases: Log Shipping
- Some systems (e.g., SQL Server, PostgreSQL) use log shipping: the leader ships transaction log files to followers, who replay these logs to apply data modifications, often with a delay or schedule for disaster recovery scenarios.
6. Failure & Resynchronization
- Follower failure: If a follower goes offline, on recovery, it requests all missed changes since its last sync position, replays them, and then rejoins real-time replication.
- Leader failure: Remaining followers with the most up-to-date data can be elected as the new leader, using the latest log position to minimize data loss.
7. Heartbeat and Monitoring
- Leader and followers often exchange periodic heartbeat messages (e.g., Redis, MySQL) to confirm each other’s availability and track replication health.
- Followers report their replication position; the leader can identify lagging nodes and handle failover when needed.
Summary Table
| Process | Description |
|---|---|
| Change log/stream | Leader records all writes in an ordered log |
| Log shipping/replication | Followers fetch (or receive) change logs and apply in order |
| Synchronous replication | Leader waits for follower confirmation before acknowledging client |
| Asynchronous replication | Leader does NOT wait; followers catch up in background |
| Semi-synchronous | At least one follower is sync; others async |
| Recovery/catch-up | Followers replay missed logs on rejoin to become consistent |
Distributed System Protocols
Two-Phase Commit (2PC) Explained
Two-Phase Commit (2PC) is a distributed transaction protocol used to ensure that a transaction spanning multiple databases or nodes is either committed everywhere or rolled back everywhere, so all nodes remain consistent.
1. Coordinator and Participants
- Coordinator: A special node (often called the transaction manager) that orchestrates the commit process.
- Participants: The databases or resource managers involved in the transaction.
2. Phases
Phase 1: Prepare (Voting Phase)
- The coordinator sends a “prepare to commit?” message to all participants after executing all transaction operations.
- Each participant tries to prepare to commit (by writing to a log, acquiring locks, etc).
- Each participant replies with:
- “Yes” (ready to commit), or
- “No” (cannot commit).
Phase 2: Commit (Decision Phase)
- If all participants replied “Yes”:
- The coordinator sends a “commit” message to all participants, instructing them to finalize the transaction and release locks.
- If any participant replied “No”:
- The coordinator sends a “rollback” (abort) message to all, instructing them to undo any changes.
- After committing or rolling back, each participant sends an acknowledgment to the coordinator.
Properties
- Atomicity: The transaction either completes everywhere or nowhere—no partial commits.
- Consistency: Keeps distributed databases consistent.
- Durability: Uses logs (often Write-Ahead Log), so the decision survives crashes and can be recovered.
Limitations
- If the coordinator fails after collecting votes and before sending the commit/abort message, participants may be left waiting indefinitely (“blocking”).
- Not optimal for scalability or fault-tolerance in large or high-latency systems. Some modern systems use alternatives or enhancements (e.g., Three-Phase Commit, Saga patterns).
Summary
Two-Phase Commit ensures that distributed transactions are all-or-nothing and all involved nodes end up in a consistent state. However, it is prone to blocking under certain failure conditions.
Distributed Systems Consensus Algorithms
Raft Consensus Algorithm Explained
- Raft is a leader-based consensus algorithm widely used in distributed systems to ensure that all nodes agree on the same sequence of operations, even in the presence of failures.
- It is used for building fault-tolerant, strongly consistent distributed systems, such as replicated databases.
Key Ideas and Workflow
1. Roles
Each server in the cluster can be in one of three roles at any time: - Leader: Handles all client requests that modify data and manages log replication to followers. - Follower: Passive; responds to requests from the leader or candidates. - Candidate: Seeks election as leader if it doesn't hear from one.
2. Terms
- Time is divided into terms (numbered intervals).
- Each term starts with a leader election.
- If a leader is chosen, that term continues until the next election is needed.
3. Leader Election
- If a follower doesn't hear from a leader within an election timeout, it becomes a candidate, increments its term, and asks for votes.
- Servers vote for the first candidate they hear from in the term.
- If a candidate gets votes from a majority, it becomes the new leader.
- If there’s a tie (split vote), a new term is started and the process repeats (randomized timeouts help avoid repeated ties).
4. Log Replication
- Only the leader can accept writes. It appends the operation to its log and sends AppendEntries (log updates) to followers.
- All log entries are stored locally on every server in the cluster—including leaders and followers. Each Raft server maintains its own persistent log (usually on disk).
- When a client requests to perform an operation (such as a write), the current leader appends a new entry to its own log and then sends AppendEntries RPCs to followers, asking them to append the same entries to their logs.
- Once a log entry is confirmed by a majority, it’s considered committed. All followers then apply it to their local state.
- Followers copy the leader’s log to stay up to date, ensuring strong consistency.
- If the leader crashes before successfully replicating the new entry to a majority, the entry may not be considered committed and might be overwritten by a new leader during the recovery/election process.
- But the uncommitted entry remains available on the leader’s local persistent log. As part of leader election and log reconciliation, Raft ensures that any "dangling" or uncommitted entries that are not part of the committed majority will be overwritten or rolled back as clusters converge on a consistent log after a new leader is elected.
5. Safety
- Once a log entry is committed, Raft guarantees it will not be lost, even if the leader crashes and another leader is elected.
- Only a node with all committed entries can become leader, ensuring new leaders can’t “forget” already-committed operations.
6. Membership Changes
- Raft safely adds/removes nodes using a two-step process (joint consensus), so there’s no ambiguity or data loss during reconfiguration.
Advantages
- Simplicity: Easier to understand and implement than Paxos.
- Linearizability: Every operation appears atomic and totally ordered (single-key strong consistency).
- Fault Tolerance: A cluster of N servers can tolerate up to (N-1)/2 failures. For example, a 5-node cluster can remain available if 2 servers crash.
Adoption
- Used in: Consul, etcd, TiDB, YugabyteDB, and many other distributed databases and key-value stores.
Summary
Raft brings reliability and strong consistency to distributed systems through clear leader election, log replication, and robust safety mechanisms, making it a de-facto standard for consensus in modern distributed architectures.