Database?

An organized collection of structured data, typically stored electronically in computer systems.
Most common databases use a table-based model (rows & columns) for efficient processing, querying, and management.
Diagram:

ID Name Email

1 Alice alice@...

2 Bob bob@...
SQL (Structured Query Language): The dominant language for querying, manipulating, and defining data in relational databases.
NoSQL databases (e.g., MongoDB) emerge to handle large volumes of unstructured/semi-structured data with high speed, flexibility, and scalability.

ID	Name	Email
1	Alice	alice@...
2	Bob	bob@...

Core Concepts

Database vs. File System: Extra Features Offered by Databases

When compared to using a direct file system for storing and retrieving data, databases provide numerous advanced features that improve data management, efficiency, and integrity.

Data Redundancy Control and Integrity
Support for unique constraints and primary keys ensures no duplicate data.
Normalization techniques minimize redundant data storage.
Referential integrity via foreign keys prevents "orphan" or inconsistent data entries.
Structured Query Language (SQL) / Query Support
Powerful and flexible query languages (like SQL/MQL) allow complex searching, filtering, aggregation, and joining data across tables—something not possible with basic file-system operations.
Indexing for Faster Querying
Automatic/managed indexes (e.g., B-Tree, hash, full-text) drastically speed up data searches compared to brute-force file scans.
Transaction Support and Atomicity
ACID properties (Atomicity, Consistency, Isolation, Durability) guarantee reliable multi-step updates (transactions) and prevent partial changes due to errors.
Rollback and commit mechanisms for safe data modification.
Concurrency and Multi-user Support
Mechanisms like locking, MVCC (Multi-Version Concurrency Control), and transaction isolation enable safe access for multiple simultaneous users/processes, preventing overwrites and conflicts.
Automatic Crash Recovery and Durability
Write-Ahead Logging (WAL) and transaction logs enable recovery to the last consistent state after system failure.
Data Abstraction and Logical Views
Schemas, views, and data abstraction layers allow users to see data in logical formats without concern for how it’s physically stored.
Data Validation and Access Controls
Type validation, default values, and custom triggers enforce data quality.
Robust user authentication, authorization, and role-based permissions restrict who can do what.
Backup, Restore, and Replication
Built-in tools for automatic backup and restore.
Replication and failover features for high availability and disaster recovery.
Joins and Relationships
Ability to create, manage, and query relationships across multiple data sets (tables)—unlike independent files.
Data Consistency Guarantees
Built-in mechanisms for consistency (e.g., constraints, triggers, transactions) ensure that all data follows rules and business logic.

Feature	Database	File System
Indexes/fast search	✅	❌
ACID transactions	✅	❌
Redundancy control	✅	❌
Query language (SQL/MQL)	✅	❌
Relationships/joins	✅	❌
Schema/data typing	✅	❌
Access control/auth	✅	Basic (weak)
Backup/restore tools	✅	Manual
Data validation	✅	❌
Multi-user concurrency	✅	Basic/file locks
Crash recovery	✅	Minimal

How Databases Store Data Persistently on Disk

Databases store data persistently on disk using specialized components called storage engines which manage the physical layout, retrieval, and durability of data.

Storage Engine:
- Core component responsible for managing how data is stored, retrieved, and indexed on disk or in memory.
- Each storage engine uses specific optimizations based on workload type (e.g., transactional, read-heavy).
- Modified (dirty) pages are flushed asynchronously to disk files by the storage engine
Data Organization:
Data is stored in files on disk, which the operating system treats as opaque byte streams.
Inside these files, data is organized into fixed-size units called pages or blocks (commonly around 4 KB).
Pages hold rows/records in a format specific to the DBMS.
Files on disk represent tablespaces, indexes, and internal system metadata
Indexes (like B-trees) are also stored as disk structures, organized into pages for efficient querying and updating.
Buffer Pool: An in-memory cache in the DBMS that holds pages read from disk. When the database engine needs data, it loads the corresponding pages into memory from disk via the buffer pool, performs operations, then eventually writes modified pages back to disk.
Write-Ahead Logging (WAL) / Transaction Logs:
To guarantee durability and crash recovery, changes are first recorded in a log before being applied to the main data files.
This ensures that if a crash occurs, the DBMS can replay the logs to restore consistency.
Upon startup or recovery, the database engine reads from data files and transaction logs to restore its in-memory state.
Atomic Writes and Page Management:
The hardware guarantees atomic writes at the page/block level (e.g., 4 KB).
If a database page spans multiple hardware pages, the DBMS handles partial writes carefully to avoid corruption.

Storage Engine

Definition: Pluggable component that manages how data is stored, retrieved, and indexed on disk.
Provides trade-offs in performance, features, and reliability (e.g., transactional vs. read-heavy workloads).
Example: In MySQL, you can specify ENGINE=InnoDB or ENGINE=MyISAM when creating tables.

Database/System	Storage Engine Approach	Persistence Details
MySQL InnoDB	Uses tablespaces with page-oriented storage.	Uses tables, indexes, and logs with WAL for crash recovery. Supports row-level locking and transactions.
PostgreSQL	Single unified storage engine with heap files.	Stores data and indexes as files; uses WAL for durability and MVCC for concurrency.
MongoDB	Document storage in BSON format on disk.	Uses data files and journal logs; supports replica sets and sharding for durability and scale.

Additional Mechanisms

Checkpointing: Periodically writes all in-memory changes to disk to reduce recovery time.
Crash Recovery: Uses WAL and checkpoints to restore data to a consistent state after system failures.
Memory Management: Storage engines manage caching and eviction policies to balance performance and durability.

Row-Oriented Storage

Data is stored row by row: All attribute values for a single record are stored together on disk.
Storage: Alice,23,Mumbai | Bob,28,Delhi | ...

Pros

Efficient for OLTP (Online Transaction Processing) workloads:
Reading/writing entire records is fast (e.g., retrieving or updating a customer profile).
Easy to insert/update/delete full records.
Great for scenarios where queries touch most or all columns of a row.

Cons

Inefficient for analytics: Scanning a single column requires loading all column data (wastes I/O bandwidth).
Compression less effective: Since different data types and ranges are mixed together.

Use Cases

Transaction-heavy applications:
Banking systems
E-commerce order management
User profile management (retrieve full user records)
Any CRUD (Create, Read, Update, Delete) heavy workload

Column-Oriented Storage

Data is stored column by column: All values for a given column are stored together on disk, separate from other columns.
Storage:
name: Alice, Bob, ...
age: 23, 28, ...
city: Mumbai, Delhi, ...

Pros

Efficient for OLAP (Online Analytical Processing) workloads:
Fast aggregation (e.g., sum, avg) on single columns — no need to load unnecessary data.
Can scan only the needed columns, skipping others, reducing disk I/O.
Highly compressible: Similar data types in a column compress extremely well (e.g., bitmaps, run-length encoding).
Faster columnar analytics and reports.

Cons

Slowe for row-level operations: Reading/updating a full record requires multiple disk reads (one per column).
Not ideal for heavy transactional or highly selective single-record updates/inserts.

Use Cases

Analytical and reporting applications:
Data warehouses, business intelligence
Real-time analytics and dashboards
Large-scale time series engines
Log analytics (e.g., clickstream, telemetry)
Any workload where queries read few columns but many rows

Comparison Table

Feature/Aspect	Row-Oriented Storage	Column-Oriented Storage
Data layout	Row by row	Column by column
Best for	OLTP (transactions, CRUD)	OLAP (analytics, aggregates, scan)
Full row read/write	Fast	Slow (needs to stitch columns)
Single column read	Slow (load all columns per row)	Very fast (scan only relevant column)
Compression	Moderate	Very high (similar values)
Example DBs	MySQL, PostgreSQL, Oracle	Apache Parquet, Amazon Redshift, ClickHouse, Google BigQuery, Apache Cassandra (hybrid)

Mixed/Hybrid Approaches

Some modern systems (e.g., SQL Server with columnstore indexes, Cassandra) allow both row and column-oriented storage or indexes to optimize mixed-use workloads.

Summary:
- Row-oriented storage = best for transactional workloads with frequent full-row accesses.
- Column-oriented storage = best for analytical workloads with queries on selective columns across many rows.

Choose the storage model that matches your primary query and update patterns for optimal performance.

Relational Databases

1. Keys in RDBMS

Keys uniquely identify rows and establish relationships.

Primary Key

Definition: Uniquely identifies each record in a table.
Properties:
Must be unique and NOT NULL.
Only one per table.

Foreign Key

Definition: A field (or set) in one table that refers to the primary key of another table.
Purpose: Maintains referential integrity between tables.

Candidate Key

Definition: Any field or minimal set of fields that can uniquely identify a record; a table can have multiple.

Alternate Key

Definition: Candidate keys not chosen as the primary key.

Composite Key

Definition: Combines two or more columns to uniquely identify a record.

Super Key

Definition: Union of Candidate key and other attributes.

2. Integrity Constraints

Ensure data is accurate and consistent.

Entity Integrity: No primary key can have NULL values.
Referential Integrity: Foreign key values must match primary key values in the referenced table or be NULL.
Unique Constraint: Ensures all values in a column or set of columns are unique.
Not Null Constraint: Field must have a value (cannot be NULL).
Check Constraint: Restricts values based on a condition (e.g., age >= 18).

3. ON DELETE modes

ON DELETE CASCADE:
What it does: Deletes all dependent rows in the child table when the parent row is deleted.
Use case: Ensure referential integrity by removing related data automatically.
Example: In an orders (child) and customers (parent) table, deleting a customer removes all their orders.
SQL: FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE CASCADE
Pros: Maintains consistency; no orphaned records.
Cons: Can unintentionally delete large amounts of data.
ON DELETE SET NULL:
What it does: Sets the foreign key column in the child table to NULL when the parent row is deleted.
Use case: Preserve child records but remove the parent reference if optional.
Example: Deleting a customer sets customer_id to NULL in orders (if customer_id allows NULL).
SQL: FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE SET NULL
Pros: Keeps child data; useful for optional relationships.
Cons: Requires nullable foreign key; may complicate queries.
ON DELETE SET DEFAULT:
What it does: Sets the foreign key column in the child table to its default value when the parent row is deleted.
Use case: Assign a fallback reference (e.g., a default parent ID).
Example: Deleting a customer sets customer_id in orders to a default customer ID (e.g., a generic "unknown" customer).
SQL: FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE SET DEFAULT
Pros: Maintains reference to a valid parent; preserves data.
Cons: Requires a valid default value; less common.
ON DELETE RESTRICT (or NO ACTION):
What it does: Prevents deletion of the parent row if dependent rows exist in the child table.
Use case: Enforce strict referential integrity by blocking deletes.
Example: Cannot delete a customer if they have existing orders.
SQL: FOREIGN KEY (customer_id) REFERENCES customers(id) ON DELETE RESTRICT
Pros: Protects data integrity; prevents accidental deletions.
Cons: Requires manual cleanup of child rows before deleting parent.

Additional Notes

Default Behavior: If no ON DELETE clause is specified, databases like MySQL default to RESTRICT or NO ACTION, while others may require explicit declaration.
Database Support: Not all databases support all modes (e.g., SET DEFAULT is less common in MySQL).
Performance: CASCADE can be slower for large datasets; RESTRICT/NO ACTION adds checks.
Choosing a Mode: Depends on business logic (e.g., CASCADE for tightly coupled data, SET NULL for optional links).

4. Types of Attributes

Simple Attribute

Cannot be divided further.
Example: age

Composite Attribute

Can be divided.
Example: address → {street, city, state}

Single-valued Attribute

One value per entity.
Example: date_of_birth

Multi-valued Attribute

Multiple values per entity.
Example: phone_numbers

Derived Attribute

Computed from other attributes.
Example: age (derived from date_of_birth)

Prime Attribute

An attribute that is part of any candidate key of a relation (table). Always a proper subset of candidate key.

Non-Prime Attribute

An attribute that is not part of any candidate key.

5. Relationships in Databases

Define associations between entities.

One-to-One (1:1): Each entity in A relates to a single entity in B.
Example: Each person has one passport.
One-to-Many (1:N): One entity in A relates to multiple in B.
Example: One teacher teaches many students.
Many-to-Many (M:N): Entities in A relate to multiple in B and vice versa.
Example: Students and courses.

6. Participation & Mapping

Concept	Description	Diagram Notation	Impact on Table Design
Total Participation	Every entity instance participates in relationship	Double line	Enforce NOT NULL on foreign key
Partial Participation	Some instances participate, others may not	Single line	Foreign key may be nullable
Entity Mapping	Entity → Table	—	Table with attributes, primary key
1:N Relationship	"One" PK as FK in "Many" table	—	Foreign key in "Many" table
M:N Relationship	New table for relationship, includes both PKs as FKs	—	Join table with composite PK

Normalization in Databases

What is Normalization?

Normalization is the process of organizing data in a database to:
Minimize redundancy (duplicate data).
Minimize number of tables.
Improve data integrity.
Achieved by dividing large tables into smaller, related tables and defining relationships among them.

Why is it Required?

Prevents anomalies (insertion, update, deletion problems).
Saves storage by removing duplicate data.
Ensures data is logically stored and consistent.
Simplifies maintaining, updating, and querying databases.

Normal Forms (Short Explanations with Examples)

1NF (First Normal Form)

Rule: Each column contains only atomic (indivisible) values; no repeating groups or arrays.
Why: Ensures each field holds single values—no sets or lists.
Example:
| Student | Subjects | ❌ (not 1NF) | |---------|------------------|---------------| | Alice | Math, Physics | (violates 1NF)|
Fix: Split into separate rows per subject.

2NF (Second Normal Form)

Rule: Table is in 1NF AND all non-key attributes are fully functionally dependent on the entire primary key (no partial dependency).
- Partial Dependency: If the prime attribute determines the non-prime attribute.
Why: Removes partial dependency (especially in tables with composite keys).
Example:
Enrollment(student_id, course_id, student_name, grade)
student_name depends only on student_id, not on course.
Fix: Move student_name to separate Student table.

3NF (Third Normal Form)

Rule: Table is in 2NF AND all columns are directly dependent ONLY on the primary key (no transitive dependency).
Why: Removes columns dependent on other non-key columns.
Example:
Employee(emp_id, dept_id, dept_name, emp_name)
dept_name depends on dept_id, not directly on emp_id.
Fix: Separate Department(dept_id, dept_name) table.

BCNF (Boyce-Codd Normal Form)

Rule: Table is in 3NF AND for every (non-trivial) functional dependency X → Y, X is a super key.
Why: Handles situations where 3NF still leaves certain types of anomalies due to composite or overlapping candidate keys.
Example:
A table with columns (course, instructor, room) where instructors and rooms each uniquely determine the course, but neither is a full primary key.
Fix: Restructure tables so that only super keys determine other attributes.

4NF (Fourth Normal Form)

Rule: Table is in BCNF AND no multi-valued dependencies (MVDs) exist other than a candidate key.
Why: Prevents two or more independent multi-valued facts about an entity from coexisting in one table.
Example:
| Student | Hobby | Language |
|---------|--------------|--------------|
| Alice | Painting | English |
| Alice | Painting | French |
| Alice | Music | English |
Hobby and Language are independent—should be split into separate tables.

Summary

Normal Form	Main Rule	Problem Solved
1NF	Atomic values, no repeating groups	No arrays/multiple values
2NF	1NF + Prime attribute shouldn't determine non-prime attribute	No partial dependency
3NF	2NF + No non-prime attribute should determine other non-prime attribute	No indirect dependency
BCNF	Every determinant is a super key	Stronger than 3NF
4NF	BCNF + No multi-valued dependencies	No multi-valued facts overlap

Transactions and Concurrency Control in Databases

Transaction

Definition:
A transaction is a sequence of one or more database operations (like insert, update, delete) executed as a single logical unit of work.

Example:
Transferring money from Account A to Account B requires: 1. Deduct amount from Account A. 2. Add amount to Account B.

Both actions must succeed together, or neither should occur.

ACID Properties

ACID ensures database transactions are processed reliably.

1. Atomicity

All-or-nothing execution:
All operations in a transaction succeed, or none do.
If failure occurs, all interim changes are rolled back.
Example: If failure happens between deducting from A and crediting B, the deduction is undone.

2. Consistency

The database must always move from one valid state to another.
All integrity constraints (e.g., unique, foreign key) must be preserved.
If a transaction violates a constraint, it is rolled back.

3. Isolation

Concurrent transactions do not affect each other’s intermediate state.
Ensure that DB state would be same if each transaction was executed in isolation and sequentially without interleaving.
Prevents problems like "dirty reads," "non-repeatable reads," and "phantom reads."
Isolation Levels:
- Read Uncommitted
- Read Committed
- Repeatable Read
- Serializable (most strict)
Example: Two simultaneous transfers don't see each other's partial changes.

4. Durability

Once committed, changes are permanent—even in the event of system crashes or power failures.
Achieved via transaction logs (WAL) and secure disk writes.
Example: After receiving a success confirmation, bank transfers can't be "lost" due to crash.

Visual Table: ACID Properties

Property	Purpose	Guarantee	Example
Atomicity	All-or-nothing	No partial results saved	Both debit & credit in transfer both happen
Consistency	Preserves rules/constraints	Only valid data states	Balance never goes negative
Isolation	Transactions don't interfere	No dirty or partial reads	Two transfers don’t corrupt each other
Durability	Once done, always persisted	Survives crashes/failures	Money transfer is never lost

Transaction Isolation Levels in Databases

Isolation levels control how and when the effects of one transaction become visible to others, balancing data consistency and concurrency.

Key Phenomena in Concurrency

Dirty Read: Reading data modified by another transaction but not yet committed.
Non-Repeatable Read: Reading the same row twice yields different results due to another transaction’s modifications.
Phantom Read: A transaction re-executes a query and finds new rows (matching the criteria) added or deleted by another transaction.

Standard SQL Isolation Levels

1. Read Uncommitted

Description: - Lowest isolation. - Transactions can read uncommitted changes from others (dirty reads allowed).

Problems/Anomalies: - Dirty reads: Possible. - Non-repeatable reads: Possible. - Phantom reads: Possible.

Example: - T1 updates a row but hasn't committed. T2 reads this uncommitted value. T1 rolls back, so T2 used a value that never existed.

Advantages: - Maximum concurrency and performance. - Minimal locking and wait times.

Disadvantages: - Data can be inconsistent and unreliable. - Should be used only when accuracy is non-critical (e.g., big analytic scans).

2. Read Committed

Description: - A transaction can only read committed data. - Default in many DBMS (e.g., SQL Server, Oracle).

Problems/Anomalies: - Dirty reads: Prevented. - Non-repeatable reads: Possible. - Phantom reads: Possible.

Example: - T1 reads a row. T2 updates and commits the row. If T1 reads again, it sees a new value (non-repeatable read).

Advantages: - Prevents viewing uncommitted ("dirty") data. - Good balance between concurrency and data integrity.

Disadvantages: - Value of repeated reads can change. - Phantom rows can appear/disappear between queries.

3. Repeatable Read

Description: - Ensures that all rows read by a transaction remain stable and cannot be changed by other concurrent transactions until the transaction completes.

This means every read of the same row within the transaction will return the same data (i.e., reads are repeatable).
However, depending on the database implementation, Repeatable Read may or may not provide a full consistent snapshot of the entire database state throughout the transaction.
In some systems, it guarantees repeatable reads only for rows already read (leaving the possibility of phantom reads), while in others (especially MVCC-based databases), it behaves more like a full snapshot isolation.
Default in MySQL InnoDB.

Problems/Anomalies: - Dirty reads: Prevented. - Non-repeatable reads: Prevented. - Phantom reads: Possible (unless further mechanisms like next-key locking are used).

Example: - T1 reads row X. T2 attempts to update or delete row X and must wait (blocked) until T1 ends. - T1 still may see new rows matching a WHERE clause (phantom reads) if another transaction inserts them.

Advantages: - Ensures repeated reads of the same row always return the same data. - Good for workflows needing strong consistency per row.

Disadvantages: - Higher locking overhead. - Phantom reads remain possible.

4. What is Snapshot Isolation?

Definition:
Snapshot Isolation is an advanced isolation level where each transaction operates on a "snapshot" (a consistent view) of the database as it existed at the start of the transaction.
Key characteristics:
All reads within a transaction see only committed data as it was at the transaction’s start time — later changes by other transactions are hidden until commit.
Transactions do not block each other for reads/writes using heavy locks; instead, the DB engine maintains multiple versions of rows (“row versioning”) to support snapshots.
Write–write conflicts: If two concurrent transactions try to update the same row, only the first committer “wins” and the other is rolled back (to prevent lost updates).
Eliminates dirty reads, non-repeatable reads, and most phantom reads, but is not fully serializable (write skew anomalies are possible).
Write skew is a concurrency anomaly that can occur when:
- Two or more concurrent transactions read overlapping data (i.e., some shared rows).
- Each transaction makes decisions based on the data it read.
- Then, they update different (non-overlapping) parts of the data.
- Because they work on snapshots taken at the transaction start, neither transaction sees the other's concurrent update.
- When both commit, the overall database state can become inconsistent or violate integrity constraints that should be enforced.
Example: Suppose,
T1 and T2 start at the same time (their snapshot: DB state at t0).
T1 updates row A; T2 updates row B.
Both commit — allowed, as they're updating different rows.

But if both try to update row A, one will be forced to roll back.

5. Serializable

Description: - Highest (strictest) standard isolation. - Serializes access: transactions are executed with the effect as if run one after another, never concurrently.

Problems/Anomalies: - Dirty reads: Prevented. - Non-repeatable reads: Prevented. - Phantom reads: Prevented.

Example: - T1 reads all accounts with balance >₹1000. T2 cannot insert/delete qualifying rows until T1 completes.

Advantages: - Complete isolation, no concurrency anomalies at all. - Ideal for critical financial/accounting systems.

Disadvantages: - Lowest concurrency—may block or serialize many transactions, reducing performance. - Deadlocks or transaction retries can increase.

Comparison Table

Isolation Level	Dirty Reads	Non-Repeatable Reads	Phantom Reads	Performance	Typical Use Case
Read Uncommitted	Possible	Possible	Possible	Fastest	Low-accuracy analytics
Read Committed	Prevented	Possible	Possible	High	General OLTP; moderate business
Repeatable Read	Prevented	Prevented	Possible	Moderate	Financial reads, inventory
Snapshot Isolation	Prevented	Prevented	Prevented	Moderate	Financial reads, inventory
Serializable	Prevented	Prevented	Prevented	Slowest	Critical integrity/finance apps

6. Multi-Version Concurrency Control (MVCC)

MVCC (Multi-Version Concurrency Control) is a database concurrency control method that allows multiple transactions to access data simultaneously without interfering with each other by maintaining several versions of each data record.
It is widely used in modern relational (e.g., PostgreSQL, MySQL/InnoDB) and NoSQL databases to maximize concurrent access and system throughput.
It is foundational for modern high-traffic, multi-user systems.

How MVCC Works: Core Principles

Versioning of Records:
Every row (record) in the database has multiple versions, each tagged with metadata (like a transaction ID, timestamp, etc.).
When a transaction changes a row, the DBMS creates a new version rather than modifying the row in place.
Snapshots for Transactions:
When a transaction begins, it sees a snapshot of the database reflecting the committed state at that start moment.
Reads are always performed on the snapshot, ensuring a consistent view even as other transactions are modifying data.
Non-Blocking Reads and Writes:
Readers never block writers and writers never block readers.
While a write is in progress, readers continue to see the old version of data; once the transaction commits, new transactions see the new version.
Only lightweight locks or internal synchronization may be used for operations like index updates (not for the row data itself).
Version Clean-up (Garbage Collection):
As old versions of data become unneeded (when no active transaction could see them), a background process removes them to reclaim space (e.g., VACUUM in PostgreSQL).

Benefits of MVCC

High concurrency: Readers don't wait for writers and vice versa.
Reduced lock contention: Fewer deadlocks and waiting compared to strict locking.
Consistent, repeatable reads: Each transaction sees a stable snapshot, making logic simpler for developers.
Improved user experience: Fast, non-blocking reads in heavy workloads.

MVCC Types and Details

Timestamp-based MVCC: Each transaction/version is timestamped; what you read depends on the timestamp.
Snapshot-based MVCC: Each transaction sees a snapshot of the database as it was at its start time.
History-based MVCC: Every modification is recorded, simplifying rollbacks.
Hybrid MVCC: Combines approaches for flexibility and performance.

Write Conflicts

When two concurrent transactions update the same row, one might be aborted or asked to retry to avoid lost updates. MVCC handles these via version/timestamp checks at commit.

Trade-offs & Limitations

Storage Overhead: Extra disk/memory required to store multiple versions of rows.
Cleanup Required: Old/dead row versions accumulate and must be periodically cleaned.
Not Always Strictly Serializable: MVCC commonly provides snapshot isolation, which is strong but not fully serializable (write skew is possible).

Real-World Example

Suppose two bank employees are working: - Alice reads Bob's balance ($100) and starts a transaction to withdraw $50. - Meanwhile, Bob deposits $40; his transaction creates a new version ($140) and commits. - Alice's transaction, on commit, checks her snapshot (still $100), subtracts $50, and creates a new version ($50). - Any new transaction now sees Bob's balance as $50, though Alice read $100, never seeing Bob's deposit until she finishes.

MVCC vs. Traditional Locking

Aspect	MVCC	Pessimistic Locking
Reads	Non-blocking, consistent snapshot	May block on write locks
Writes	Create new versions; may abort on conflict	Must acquire locks; blocks others
Concurrency	High	Lower (due to lock contention)
Deadlocks	Rare (except for metadata/index ops)	More frequent with many writers/readers

MVCC in Practice

PostgreSQL: Requires vacuuming to reclaim dead tuples.
MySQL/InnoDB: Uses undo logs to store old versions.
NoSQL (Cassandra, etc.): Uses similar multi-version logic for distributed consistency and high-throughput.

Additional Notes

Isolation vs. Performance Tradeoff:
Higher isolation = fewer anomalies, but lower concurrency and higher resource usage.
MVCC (Multi-Version Concurrency Control):
Many DBs (Postgres, MySQL) use MVCC so read locks are minimized, allowing high concurrency at higher isolation.
Practical Use:
Use the lowest isolation level that fits your data consistency needs, to maximize performance.

Recoverability and Serializability of Schedules in DBMS

Database schedule management balances concurrent execution (performance) with data consistency and recoverability.

1. Recoverability of Schedules

Recoverability ensures that if a transaction fails, no other transaction commits based on the failed (and rolled back) transaction’s output; thus, the database can be safely recovered to a consistent state.

Types of Schedules Based on Recoverability

a. Recoverable Schedule

A schedule is recoverable if every transaction that reads data written by another transaction commits only after the writer transaction commits.
Example:

  T1: W(x) ... C1
  T2: R(x) ... C2

Here, T2 reads x written by T1 and C2 happens after C1. If T1 fails and is rolled back, T2 can be rolled back safely.

Why Needed: To prevent scenarios where a committed transaction depends on uncommitted (possibly rolled-back) data.

b. Cascading Rollbacks

If a transaction is rolled back, all dependent transactions (which read uncommitted changes) must also be rolled back, potentially causing a cascade of aborts.
Example:
T1 writes x.
T2 reads x and writes y.
T3 reads y.
If T1 aborts, T2 and T3 must also abort.
Problem: Leads to complex recovery and data loss.

c. Cascadeless/Avoid Cascading Abort Schedule

A schedule avoids cascading aborts if transactions only read values written by committed transactions.
This ensures that rollbacks do not propagate, making recovery easier.
How to Avoid: Use strict two-phase locking or delay reads until commit.

d. Strict Schedule

A stricter form where a transaction can neither read nor write an item until the previous transaction that last wrote to it has committed or aborted.

2. Serializability

Serializability is the gold standard for determining if a concurrent schedule maintains database consistency by yielding the same effect as some serial (one-at-a-time) execution.

a. Serial and Non-serial Schedules

Serial Schedule: Transactions are executed one after the other; always safe but inefficient.
Non-serial Schedule: Transactions' operations are interleaved for efficiency; needs to be serializable for correctness.

b. Conflict Serializability

Definition: A schedule is conflict-serializable if by swapping non-conflicting operations, it can be transformed into a serial schedule.
Conflicting Operations:
Both operations shouldn't belong to the same transaction
Both operations must be on same data
Atleast one of the operations must be write.
ex: R–W, W–R, W–W.
Non-Conflicting Pairs: Same data item, both reads, or different data items.
Testing Conflict Serializability:
For each conflict, create a precedence graph (nodes: transactions; edges: Ti → Tj if Ti’s operation precedes and conflicts with Tj’s).
If the graph has no cycles, the schedule is conflict-serializable.
Benefit: Fast checking, avoids most anomalies, but some correct non-conflict serializable schedules are missed.
Example: T1: W(A) T2: R(A), W(A)
W1(A) and R2(A): conflict (write-read, different Tx, same item)
W1(A) and W2(A): conflict (write-write)
If all conflicts can be reordered to match a serial execution, it is conflict-serializable.

c. View Serializability

Definition: A schedule is view-serializable if it is view-equivalent to some serial schedule, i.e., every transaction "sees" database changes as in some serial execution.
Conditions for View Equivalence:
Initial Reads: For each data item, both schedules must have the same transaction performing the first read.
Final Writes: For each data item, both must have the same transaction perform the last write.
Update Reads: If Tj reads data written by Ti, both schedules must preserve this read-from relationship.
Advantage: Captures all correct schedules; more powerful than conflict-serializability but harder to check algorithmically.

Example:

Suppose in both the actual and serial schedule, T2 reads X as written by T1, and final writes match. The schedule is view-serializable even if not conflict-serializable.

Concept	What It Ensures / Description	Advantages	Disadvantages / Issues
Recoverable Schedule	Commits only after all dependent commits	Prevents unrecoverable states	May allow cascading rollbacks
Cascading Rollback	Rollbacks propagate through dependencies	—	Data loss, complex recovery
Cascadeless Schedule	Only reads committed data	No cascading aborts	May require stricter protocols
Conflict Serializability	Can reorder conflicts to serial schedule	Easy to check	Doesn’t capture all 'good' schedules
View Serializability	Same reads/writes as serial schedule	More comprehensive	Harder to verify
Non-Conflicting Operations	Same transaction, or different data items, or both reads	Allow flexible reordering	—

4. Practical Guidance

Always prefer cascadeless and strict schedules in production systems for easier recovery and integrity.
Aim for conflict-serializable schedules via two-phase locking or logical timestamp protocols for high concurrency but correct outcomes.
For advanced correctness and optimization (especially in distributed systems), check for view-serializability.

Lock-Based Concurrency Control

Lock-based protocols manage access to data in concurrent transactions to ensure consistency and isolation.

1. Lock Types

Shared Lock (S-Lock)

Purpose: Allows read-only access to a data item.
Behavior: Multiple transactions can hold shared locks on the same item simultaneously, but cannot write.
Example: If T1 and T2 both hold shared locks on row A, both can read but neither can write until all shared locks are released.

Exclusive Lock (X-Lock)

Purpose: Allows read & write operations.
Behavior: Only one transaction can hold an exclusive lock on a data item; no other shared or exclusive locks are allowed on the item during this time.
Example: If T1 holds an exclusive lock on row B, no other transaction can read or write B until T1 releases the lock.

Lock Type	Allows Read?	Allows Write?	Multiple Holders?
Shared	Yes	No	Yes
Exclusive	Yes	Yes	No

2. Two-Phase Locking Protocol (2PL)

Protocol Overview

Goal: Guarantees serializability (correctness) of schedules.
Phases:
Growing Phase: Transaction can acquire new locks but cannot release any.
Shrinking Phase: Transaction releases locks but cannot acquire new ones.
Lock Point: The moment where the last lock is acquired—afterward, only unlocks are allowed. All lock operations must precede all unlocks.

3. Strict Two-Phase Locking

Stricter variant: Transaction holds all exclusive locks until it commits or aborts.
Advantage: Ensures no cascading rollbacks—if a transaction fails, no other committed transaction has seen its uncommitted data, making recovery simpler and safer.
Characteristic: Shared locks may be released earlier, but exclusive locks are held until commit.

4. Lock Upgradation (Lock Conversion)

Definition: The process of converting a shared lock to an exclusive lock (S → X) when a transaction that was reading decides it now needs to write.
Use Case: A transaction reads a value, then wants to update it without releasing and reacquiring the lock (which could allow other transactions to slip in).
Constraint: Upgradation is only possible if no other transaction holds a shared lock on the same item.

5. Deadlock & Starvation

Deadlock

Occurs when: Two or more transactions are waiting indefinitely for resources locked by each other, forming a cycle.
Classic Example: T1 locks A, requests B; T2 locks B, requests A — both wait forever.
Necessary conditions: Mutual exclusion, hold and wait, no preemption, circular wait.
Management: Deadlock detection & victim selection for rollback are required.

Starvation

Occurs when: A transaction waits indefinitely because other transactions repeatedly preempt its required resources.
Causes: Higher-priority or frequent transactions keep grabbing locks before the waiting one can proceed.
Solution: Fair scheduling, resource allocation strategies, or queue-based locking to ensure fairness.

6. Timestamp-Based Protocols

Principle: Assigns a unique timestamp to each transaction upon arrival.
Ordering: Transactions are ordered and executed logically in timestamp order, using timestamps for validation (older → newer).
If a younger transaction tries to overwrite a value read/written by an older, it may be delayed or aborted.
Advantages: Avoids deadlocks (no waiting for locks), suitable for high-concurrency environments.
Drawbacks: May cause more transaction restarts/rollbacks if conflicts occur.

7. Summary Table

Concept	Purpose/Behavior	Pros	Cons
Shared Lock	Read-only by many; no writes allowed	High concurrency	May block writers
Exclusive Lock	Single holder; allows read & write	Data integrity	More locking waits
2PL	Split into 'growing' and 'shrinking' phases	Ensures serializability	Deadlock possible
Strict 2PL	Hold X-locks until commit/abort	No cascading aborts	Lower concurrency
Lock Upgradation	Convert S-lock to X-lock without unlock/lock	Efficient for read→write	Potential deadlock
Deadlock	Cyclic wait for locks, nobody can proceed	–	Must detect/handle
Starvation	Some transactions never proceed	–	Unfairness
Timestamp Protocol	Timestamps decide execution order (no locks)	No deadlock	More aborts possible

Files and Indexing Concepts in DBMS

Efficient file organization and indexing are critical for fast data access in databases.

1. File Organization

a. Heap (Unordered) File

Records are placed wherever space is available.
Fast inserts, slow for searches unless scanned entirely.

b. Sequential (Ordered) File

Records are sorted by key.
Fast for range/quicksort queries, slower inserts and updates.

c. Hash File

Records are placed based on a hash of the search key.
Very fast for equality lookups but not suited for range queries.

2. Indexing

Indexes improve query speed by allowing direct access to records.

a. Dense Index

Definition: Index entry exists for every search key value in the data file.
Feature: Every record is pointed to; fast searches; larger index size.
Example: If a table has 1,000 records, the dense index will have 1,000 entries.

b. Sparse Index

Definition: Index entries exist only for some search key values (typically, one per block/page of the data file).
Feature: Smaller index; entries point to data blocks; for not-finding exact matches, search nearest lower key then scan within the block.
Example: If each page holds 50 records and there are 1,000 records, the sparse index will have 20 entries.

Dense vs. Sparse Index Comparison

Aspect	Dense Index	Sparse Index
Entries	Every search key	Only some keys (per page)
Space	Larger	Smaller
Search Time	Fastest	Slightly slower
Maintenance	More updates needed	Fewer updates needed

3. Multi-way Trees for Indexing (m-way Search Tree)

Each node can have up to m children and (m–1) keys.
Generalization of binary search trees; used in DBMS for efficient indexing.

4. B-Tree

A balanced m-way search tree (usually, m is large: 50–200).
Properties:
All leaves at the same level (balanced).
Each internal node (except root) has between ⎡m⁄2⎤ and m children.
Fast search, insertion, and deletion (O(log n)).
Each node can have multiple keys and pointers.
Keys are kept sorted inside nodes.
Used for indexing on disk-based large data (minimizes disk reads by maximizing data in each node).

5. B+ Tree

A variant of B-tree used in most practical DBMS implementations.
Differences from B-Tree:
Only leaf nodes contain actual data record pointers; internal nodes store only keys for navigation.
Leaf nodes are linked in a linked-list fashion for fast range queries.
More keys per node—shallow tree—fewer disk accesses.

B-Tree vs B+ Tree Comparison

Feature	B-Tree	B+ Tree
Record Storage	In all nodes	Only in leaf nodes
Internal Nodes	Keys + pointers + data	Only keys + pointers (no actual data)
Range Query	Slower (data spread in tree)	Faster (all data in sequential leaf nodes)
Leaf Node Link	No	Yes (doubly or singly linked)

LSM Indexing and SSTables

LSM (Log-Structured Merge) tree is a disk-friendly indexing structure optimized for very high write throughput, used in many modern NoSQL and distributed key-value databases (e.g., Cassandra, ScyllaDB, LevelDB, RocksDB).
Core Idea:
Writes are first buffered and stored sequentially in memory, then periodically flushed to disk in batches.
Unlike B-tree indexes (which use in-place modifications), LSM trees use append-only and merge-based strategies to minimize random writes and maximize disk throughput.

LSM Tree Architecture Components

Memtable (in-memory):
An in-memory, mutable, usually sorted data structure (like a Skip List or Red-Black Tree).
Fast for inserts/updates/deletes.
All new writes go here first.
Write-Ahead Log (WAL):
A sequential log to ensure durability before the Memtable is flushed to disk.
SSTables (Sorted String Tables, on disk):
When Memtable is full, its contents are flushed to disk as a new immutable, sorted file (SSTable).
Multiple SSTables exist on disk, often organized into levels.
Compaction:
Periodically, SSTables are merged/compacted to remove obsolete or duplicate entries and optimize disk usage.
Compaction strategies (like leveled or tiered) affect performance trade-offs.

LSM Operations

Write:
Add new data to Memtable.
Log to WAL for durability.
When Memtable full, flush as new SSTable (sorted, immutable).
Read:
Check Memtable.
If not found, check SSTables (newest to oldest).
Use Bloom filters and indexes to speed up reads.
Delete:
Insert a "tombstone" indicating data was deleted; real deletion occurs during compaction.

SSTable (Sorted String Table)

SSTable: A disk file storing immutable, sorted (by key) key-value pairs—used as the basic building block of an LSM tree's on-disk structure.
Immutable: Once written, SSTables are never updated in place.
Sorted: Allows for fast binary search, range queries, and efficient merge (compaction) operations.
Structure
Index: Contains pointers to the location of keys in data blocks, allowing for binary search and fast lookups.
Data Blocks:
- The actual key-value entries, sorted by key.
- May store metadata, timestamps, or tombstones (for marking deletes).

3. Compaction in LSM Trees

Compaction is the process of merging multiple SSTables, dropping obsolete keys (including old values and tombstones), and reducing disk space and read amplification.
Benefits: Keeps read performance stable, minimizes space used, and ensures deletes/updates propagate.

4. LSM vs. Traditional B-trees

Aspect	LSM Tree	B-tree
Writes	Fast, sequential, batched	Slow, random in-place
Reads	Potentially slower, multiple files	Fast, single structure
Best For	Write-heavy, log, analytics	Read-heavy, OLTP
Structure	Multiple levels of SSTables	Single balanced tree

5. Summary Table: LSM Components

Component	Where?	Role
Memtable	In-memory	Buffer, fast writes/updates
WAL	Disk	Durability of in-memory data
SSTable	Disk	Immutable, on-disk, sorted data files
Bloom Filter	In-memory	Fast surf for non-existence in SSTables
Compaction	Disk	Merge/optimize SSTables, drop obsolete

In Practice

Cassandra, LevelDB, RocksDB, and HBase all use LSM indexing and SSTable files.
Advantages: Handles massive write loads and supports high-throughput ingestion.
Trade-off: Reads may check several SSTables (mitigated with Bloom filters and frequent compaction).

Geo Indexing

Goal: Efficiently answer "Find all points near X" in 2D space (latitude/longitude).

Why it’s hard: - 2D data doesn’t fit well in B-trees - Naïve scan = O(n) → too slow for millions of points - Need logarithmic query time

Two dominant solutions: 1. Quad Tree → Tree-based, dynamic 2. Geohash → String-based, fixed-grid

Quad Tree – The Tree That Splits the World

Concept - Recursively divides 2D space into 4 quadrants - Each node represents a bounding box - Stops when ≤ N points per node (usually 4)

                [Earth]
            /    |    \    \
         NW     NE     SW     SE
        /|      /|     |\      |\

Types | Type | Use Case | |-------------------|-----------------------------------| | Point Quad Tree | Store individual points (cities, users) | | Region Quad Tree | Raster images, terrain data | | PR Quad Tree | Fixed subdivision rules |

How Insertion Works?

if node has < 4 points:
    add point
else:
    subdivide into 4 children
    re-insert all 5 points into correct child

Pros

Perfect for nearest neighbor and range queries
Adapts to data density (dense cities → deep tree)
Natural clustering (points in same node = close)
Works with circles, polygons

Cons - High memory usage (pointers everywhere) - Hard to persist in databases - Poor cache performance - Not distributed-friendly - Imbalanced trees possible

Quad Tree Query

def find_in_radius(self, center, radius):
    if not self.boundary.intersects_circle(center, radius):
        return []
    results = [p for p in self.points if p.distance(center) <= radius]
    if self.divided:
        for child in [self.nw, self.ne, self.sw, self.se]:
            results += child.find_in_radius(center, radius)
    return results

Geohash – Turn Location into a String

Concept

Converts (lat, lng) → short string using base-32 encoding

40.7484, -73.9857  →  dr5ru6jz1
(Empire State Building)

How It Works (Bit Interleaving)

Latitude:  01001011... (32 bits)
Longitude: 10110100... (32 bits)
Interleave:  01100101... → base32 → dr5ru6jz1

Precision Table

Length	Area Size	Accuracy	Example
5	4.9km × 4.9km	~2.4km	`dr5ru`
7	610m × 610m	~300m	`dr5ru6j`
9	76m × 76m	~38m	`dr5ru6jz1`
11	9.5m × 9.5m	~5m	`dr5ru6jz1j3`

The 9-Cell Trick

One geohash = one box
But user is in center → need 8 neighbors

[7][0][1]
[6][C][2]   ← C = center geohash
[5][4][3]

Query 9 geohash cells → perfect 500m radius coverage

Why Redis Loves Geohash

Redis Feature	Geohash Provides
Fast string keys	9-char strings
Sorted Sets (ZSET)	Store user IDs with timestamp
Prefix search	Same prefix = nearby
Memory efficient	9 bytes per location

ZADD geo:dr5ru6jz1 1700000000 user123
ZADD geo:dr5ru6jz3 1700000001 user456

Geohash + Redis (Python)

import redis, geohash, time

r = redis.Redis()

def add_user(user_id, lat, lng):
    gh = geohash.encode(lat, lng, 9)
    r.zadd(f"geo:{gh}", {user_id: int(time.time())})

def nearby(lat, lng):
    center = geohash.encode(lat, lng, 9)
    cells = [center] + geohash.neighbors(center)
    users = []
    for cell in cells:
        users.extend(r.zrange(f"geo:{cell}", 0, -1))
    return users

Comparison

Feature	Quad Tree	Geohash
Data Structure	Tree (dynamic)	String (fixed)
Query Speed	O(log n)	O(1) with Redis
Memory Usage	High (pointers)	Tiny (9 bytes)
Database Indexing	Hard	Easy (B-tree on string)
Distributed Systems	Hard to shard	Perfect sharding
Nearest Neighbor	Excellent	Good (with 9-cell)
Circle Queries	Perfect	Approximate (square grid)
Implementation	Complex	10 lines of code
Production Scalability	Medium	100M+ users (Uber, Tinder)

Real Production Systems

Geohash Winners

Company	Implementation
Uber	Redis GEO + GEORADIUS for driver matching
Instagram	Nearby posts → geohash in Cassandra
Tinder	Match discovery → Redis GEO
Snapchat	Geo-filters → Redis geospatial
Foursquare	Venue search → Redis ZSETs per geohash

Quad Tree Winners

Company	Use Case
Google Maps	Frontend marker clustering
Pokémon GO	Creature spawning density
Mapbox	Vector tile clustering

When to Use Which?

Use Case	Winner	Reason
Find drivers within 1km	Geohash + Redis	1ms response
Show 1M map markers without lag	Quad Tree	Zoom clustering
Store user locations in DB	Geohash	Indexable string
2D game collision detection	Quad Tree	Real-time
Shard data across 100 servers	Geohash	Natural partitions
Precise circular range queries	Quad Tree	True circles
Mobile app with 10M users	Geohash + Redis	Scalability

Best Practice: Use Both!

Modern geo apps do exactly this:

Frontend (Map)     → Quad Tree     → Beautiful clustering
Backend (API)      → Geohash + Redis → Lightning-fast queries
Database (Persist) → Geohash column → Easy indexing

Example: Uber - Drivers stored with Geohash in Redis - Map shows Quad Tree clustered markers - Database has geohash VARCHAR(12) INDEX

The winners use both — Geohash for backend, Quad Tree for frontend.

This is exactly how Uber, Instagram, Tinder, and Google Maps work in 2025.

Common Database Systems

MySQL (Relational, SQL-based)

Type: RDBMS (Relational Database Management System)
ACID compliant: Guarantees reliable transactions (Atomicity, Consistency, Isolation, Durability).
Multiple Storage Engines: Choose per-table (e.g., InnoDB, MyISAM), each with different features.
Scalability: Vertical scaling natively; horizontal scaling via replication & sharding (MySQL Cluster).
Performance: Fast for read-heavy workloads with indexes.
Replication: Asynchronous or semi-synchronous, enabling high availability.
JSON: Supports limited JSON fields (since 5.7), semi-structured data.
Protocol: Native TCP (port 3306).
Works better in case of more conflicting writes.

Key Storage Engines:

Engine	Features	Use Cases
InnoDB	Default engine. Supports transactions, row-level locking, foreign keys, crash recovery.	Transactional apps (e.g., e-commerce, banking)
MyISAM	Fast reads, table-level locking, no transaction support, smaller disk use.	Read-heavy, analytics, logging
Memory	Data in RAM (non-persistent), ultra-fast.	Temp tables, caching
Archive	Compressed storage for archival, no indexing	Historical data, infrequent queries
CSV	Stores data as .csv files, no indexes	Data exchange, interoperability with other tools

Mix-and-match: Tables in the same DB can use different engines.
Engine affects: Transaction support, concurrency, performance, features.

Index Types:

B-Tree Index: Default, for quick lookups, range queries, sorts.
Full-Text Index: For keyword searches in text data.
Hash Index: Fast equality checks, not for ranges (mainly Memory tables).
Spatial Index: For geographic/spatial data queries.
Unique/Composite Index: Enforces uniqueness or indexes multiple columns.

PostgreSQL (Relational, Object-Relational, SQL-based)

Type: RDBMS, object-relational, ACID compliant.
Advanced Data Types: Arrays, JSON/JSONB, hstore, XML, custom types.
Extensible: User-defined functions, custom operators, extensions (e.g., PostGIS for geography).
Concurrency: MVCC (Multi-Version Concurrency Control) avoids reader locks, supports high concurrency.
SQL: Full ANSI SQL + advanced features (window functions, CTEs, etc.).
Replication: Streaming (sync/async), logical replication.
Full-text Search: Built-in, indexed.
Scalability: Vertical and—via tools/extensions—horizontal (e.g., Citus).
Protocol: Custom (port 5432).
Works better in case of more non-conflicting writes.

Architecture:

Single Storage Engine (Heap + Extensions): Data is stored in "heap" tables; extensible via pluggable table access methods (since v12+).
MVCC: Supports concurrent reads/writes by storing row versions.
WAL (Write-Ahead Log): Ensures crash recovery, replication.

Index Types:

B-Tree: Default, widely used.
GiST, GIN, BRIN: For full-text search, arrays, JSON, spatial data.
Hash, Unique, Composite: For specific use cases, complex queries.

Example: - Ideal for complex transactional, analytical, geospatial, and semi-structured data workloads.

Steps for Updating a Row in a PostgreSQL Table

Sequential Steps for Updating a Row

Query Parsing and Planning:
The UPDATE query (e.g., UPDATE users SET name = 'Alice' WHERE id = 123;) is received via a client (e.g., psql, JDBC).
PostgreSQL’s parser validates the query syntax and creates an abstract syntax tree (AST).
The query planner/optimizer generates an execution plan, choosing the most efficient path (e.g., index scan or sequential scan) based on table statistics and indexes.
Transaction Initiation:
A transaction is started (explicitly with BEGIN or implicitly for a single statement).
PostgreSQL assigns a transaction ID (XID) to track changes and ensure MVCC.
Row Lookup:
The executor uses the query plan to locate the target row(s) based on the WHERE condition.
If an index exists (e.g., B-tree on id), it’s used for faster lookup; otherwise, a sequential scan is performed.
The row’s tuple identifier (TID) is retrieved from the heap table.
MVCC Check:
PostgreSQL checks the row’s visibility using MVCC:
- Verifies if the row is visible to the current transaction (based on Xmin/Xmax fields in the tuple header).
- Ensures no conflicting transactions are modifying the same row (e.g., using locks if needed).
Row Update (New Tuple Creation):
Instead of overwriting the existing row, PostgreSQL creates a new tuple with updated values (e.g., name = 'Alice').
The new tuple is marked with the current transaction’s XID (Xmin) and the old tuple’s XID is updated (Xmax) to mark it as “dead” for MVCC.
The new tuple is written to the heap table, typically in the same or a new page (8KB default).
Index Updates:
If the table has indexes, PostgreSQL updates them to reflect the new tuple:
- Removes the old tuple’s index entry (or marks it invalid).
- Adds a new index entry pointing to the new tuple’s TID.
This applies to all index types (e.g., B-tree, GIN, GiST).
Write-Ahead Logging (WAL):
The update operation (new tuple and index changes) is logged to the WAL before committing to disk.
WAL ensures durability by recording changes in a sequential log, allowing recovery in case of crashes.
Locking (if Needed):
If concurrent transactions access the same row, PostgreSQL may acquire row-level locks (e.g., FOR UPDATE) to prevent conflicts.
Table-level locks are rare unless explicitly requested (e.g., LOCK TABLE).
Commit or Rollback:
On COMMIT, the transaction is finalized, and the new tuple becomes visible to subsequent transactions (based on MVCC rules).
WAL records are marked as committed, ensuring durability.
On ROLLBACK, changes are discarded, and the new tuple is ignored.
Storage and Maintenance:
- The old tuple remains in the heap as a “dead tuple” for MVCC (visible to older transactions).
- PostgreSQL’s autovacuum process later runs VACUUM to mark dead tuples for reuse and reclaim space.
- ANALYZE may update table statistics for future query optimization.

Example

For UPDATE users SET name = 'Alice' WHERE id = 123;: 1. Parse query and plan (use index on id). 2. Start transaction, assign XID. 3. Find row with id = 123 via index. 4. Check MVCC visibility. 5. Create new tuple with name = 'Alice', mark old tuple as dead. 6. Update index to point to new tuple. 7. Log changes to WAL. 8. Apply any necessary locks. 9. Commit transaction, making new tuple visible. 10. Autovacuum later cleans up dead tuple.

MongoDB (NoSQL, Document-Oriented)

Type: NoSQL, document-based database.
Document Model: Data stored as BSON documents (flexible JSON-like structure).
Example: { "name": "Alice", "age": 30, "address": { "city": "Delhi" } }
Schema-less: Data structure can change over time with no strict schema enforcement.
Scalability: Built-in sharding for horizontal scale, replica sets for availability.
Query Language: MQL (Mongo Query Language), powerful aggregation framework.
ACID Transactions: Supported (since v4.0), but designed for eventual consistency.
Performance: Very fast for write-heavy and dynamic-schema apps.
Protocol: Custom TCP protocol (port 27017).

Architecture Highlights:

Sharding: Splits data based on a shard key, distributes across multiple servers.
Replica Sets: Primary node for writes, secondary nodes for redundancy & failover.

Index Types:

B-Tree: Default for most queries.
Compound: Multiple fields.
Text: Full-text search across text fields.
Geospatial: For location queries.
Hashed: For uniform sharding.
Unique: Ensures uniqueness.
TTL (Time-To-Live): Auto-delete documents after specified interval.

Example Use Cases: Real-time analytics, IoT, content management, apps with fast-evolving schemas.

Additional Important Points

Consistency, scalability, and performance needs determine database and storage engine choice.
SQL vs. NoSQL: Choose SQL for strong consistency, transactions, structured data; NoSQL for flexible, large-scale, and dynamic use cases.
All major systems offer powerful indexing, but index choice and design directly affect performance.
Backups, failover, replication, and associated tooling are essential for production deployments in any stack.

RDBMS vs. NoSQL:

System	Data Model	Best For	Scaling	Transactions	Schema
MySQL	Table/Rows	Traditional, structured data	Vert/Horiz	Strong/ACID	Rigid
PostgreSQL	Table/Rows	Complex & extensible workloads	Vert/Horiz (ext)	Strong/ACID	Rigid
MongoDB	Document	Unstructured, dynamic schemas	Native Horiz	Eventual	Flexible

Distributed Systems

Database Sharding

Database sharding splits large datasets into smaller, manageable pieces called shards, stored on separate database instances.
Each shard holds a subset of the total data, determined by a shard key (an attribute used for partitioning).
Sharding improves scalability, performance, and availability for large, high-traffic applications.

Types of Sharding Architectures

1. Range-Based Sharding

How it works: Data is divided based on value ranges of the shard key.
Pros: Simple to implement.
Cons: Can lead to uneven data distribution (hot spots).
Example: userID 000 - 199 ➔ database 1 userID 200 - 399 ➔ database 2 userID 400 - 599 ➔ database 3 userID 600 - 799 ➔ database 4 userID 800 - 999 ➔ database 5

2. Key-Based (Hash-Based) Sharding

How it works: Applies a hash function to the key (e.g., userID), and assigns the result to a shard.
Pros: Tends to distribute data more evenly than range sharding.
Cons: Simple hash functions can still create imbalances; consistent hashing can address this.
Example: h(x) = x % 3, where x = userID Distributes userIDs 1–6 across 3 shards

3. Directory-Based Sharding

How it works: Maintains a lookup table that maps each key to its shard.
Pros: Highly flexible, as any key can be mapped to any shard.
Cons: The lookup table is a single point of failure and can be a bottleneck.

Horizontal Partitioning vs. Sharding

Horizontal Partitioning: Splits a table into many tables within the same database instance (partitions must have unique table names).
Sharding: Distributes data across multiple database instances (each can use the same table names).

Advantages

High Availability: Other shards stay online even if one fails.
Security: Access can be controlled at the shard level.
Faster Queries: Each shard manages less data, making indexes and queries more efficient.
Increased Throughput: Reads and writes are isolated per shard, increasing performance.
High Scalability: Additional shards can be added to handle growing data and traffic.

Disadvantages

Increased Complexity: Query routing, data distribution, and cross-shard coordination add complexity.
Transactions/Rollbacks across Shards: Difficult or impossible without distributed transaction protocols.
Expensive Joins: Joining data across shards requires costly, cross-network operations.
Higher Infrastructure Cost: Requires multiple servers, raising costs compared to single-instance solutions.

Hierarchical Sharding

Problem: Fixed number of shards may lead to unbalanced data distribution (hot spots).
Solution: Hierarchical sharding shards a large shard further into smaller mini-shards, managed by a shard manager.
Helps manage growth and balance data distribution within problematic shards.

Master-Slave Architecture for High Availability

Approach: Each shard has a master instance (handles writes) and one or more slave replicas (handle reads).
Failover: On master failure, a slave can be promoted to master.
Benefit: Supports continuous reads and maintains data redundancy for disaster recovery.

Consistency

Consistency measures how up-to-date and synchronized distributed data copies are.

1. Linearizable Consistency

Definition: Every read returns the latest written value; all operations appear instantaneous and in order.
How: Typically achieved with a single-threaded server (easy to reason about, but low efficiency).

2. Eventual Consistency

Definition: All copies will converge to the same value eventually, though reads may return stale data in the short term.
How: Reads and writes can occur in parallel across replicas.
Trade-off: Higher efficiency at the expense of short-term consistency.

3. Causal Consistency

Definition: Ensures that causally related operations are seen in the same order by all processes, but unrelated operations may arrive out of order.
Benefit: Better than eventual, lower consistency than linearizable. Suitable for operations that are mostly independent.
Limitation: Fails for aggregation queries, since these require knowledge of the complete (global) order.

4. Quorum Consistency

Definition: Reads and writes require responses from a minimum number of replicas. Follows the rule:
R + W > N where:
R = number of replicas to read from
W = number of replicas to write to
N = total number of replicas
Configurability: Can tune for stronger or weaker consistency as needed.
Drawbacks: Requires more servers, and even numbers can lead to split-brain scenarios.

Data Consistency Levels Tradeoff Table

Level	Consistency Level	Efficiency
Linearizable	High	Low
Eventual	Lowest	Highest
Causal	Between Eventual & Linearizable	Medium
Quorum	Configurable	Configurable

Linearizability in Databases

What is Linearizability?

Linearizability is a strong consistency guarantee for distributed systems and databases. It requires that all operations (reads and writes) on a data item appear to execute atomically and in some real-time, total order—that is, every operation takes effect instantly between its invocation and its response.

Key Properties

Real-Time Ordering: If operation A completes before operation B starts, then A must be visible to B and all subsequent operations.
Atomicity: Every operation appears to take effect instantly at a single, unique point within its begin/end interval.
Single Global Timeline: All clients see changes in the same order; there are no “stale” reads or causal gaps.
User Intuition: Each read always returns the most recent written value (as observed in real time), which is what most users expect in strong consistency systems.

Visual Example

Suppose two clients interact with a key-value store:

At 08:00, Client 1 writes X = 10.
At 08:01, Client 2 writes X = 20.
At 08:02, Client 3 reads X.

Linearizability guarantees:

Client 3’s read will definitely return 20, since the write of 20 happened before the read started.

Time axis → |---W(X=10)---|---W(X=20)---|---R(X)---|

C1 writes C2 writes C3 reads

Comparison: Linearizability vs Other Guarantees

Guarantee	Real-Time Order?	Atomic Visibility	Possible Stale Reads?
Linearizability	Yes	Yes	No
Sequential Consistency	No	Yes	Yes (within session)
Eventual Consistency	No	No	Yes

Linearizability is the strictest; all reads/writes appear instantly at a single point on the timeline.

Where is Linearizability Useful?

Financial transactions (bank transfers)
Distributed locking and coordination services
User-facing systems where stale reads cannot be tolerated
Leader election and consensus algorithms (e.g., Raft, Paxos)

Trade-Offs

Linearizability often comes with higher latency and lower availability during partitions or node failures (see CAP theorem).
Many cloud or distributed databases provide linearizable consistency as an option, but may default to weaker consistency for higher performance/scalability.

Replication Topologies

Replication topologies define how data is copied and synchronized across nodes in a distributed database system.

1. Single-Leader Replication (Master-Slave / Primary-Secondary)

One node is designated as the leader (or master/primary) and handles all write operations.
Multiple follower nodes (slaves/secondaries) replicate data from the leader and can serve read requests.
In the event of leader failure, a follower is promoted to leader.

Pros: - Strong consistency: All writes go through one node, simplifying state management and eliminating write conflicts. - Simpler conflict resolution: Easy to reason about ordering of updates. - High read scalability: Followers can serve read-only traffic, easing load on the leader. - Operational simplicity: More straightforward to implement and manage.

Cons: - Single point of failure: If the leader fails, writes are unavailable until failover. - Write bottleneck: Write throughput is limited to leader’s capacity. - Potential lag: Followers may be somewhat behind the leader (especially with asynchronous replication). - Failover complexity: Promoting a new leader adds operational challenges.

How is a New Leader Elected When the Leader Fails?

Single-leader replication relies on exactly one node (the leader/master/primary) for all writes. If this node fails, the system uses a standardized "failover" process to restore availability:

1. Detecting Leader Failure

Followers or external monitors (like health checks or heartbeats) detect that the leader is unresponsive.
A time-out is used: if the leader doesn’t respond within a set period (e.g., 30 sec), it is presumed dead.

2. Leader Election

Once failure is detected, remaining followers coordinate to elect a new leader.
The most common algorithm is to select the follower with the most up-to-date data log, to minimize data loss.
- This ensures that committed but not-yet-replicated data is preserved as much as possible.
Election can be handled via:
- Simple coordination (e.g., by priority/ID)
- Consensus protocols like Raft or Paxos for distributed agreement
- External systems (e.g., ZooKeeper) for coordination and atomic reassignment

3. Reconfiguration

The cluster is updated:
- Clients are notified to send new writes to the elected leader.
- Followers "follow" the new leader for replication.
If the old leader comes back, it rejoins as a follower to prevent "split-brain" (two leaders at once).

Main Challenges in Leader Election and Failover

a. Detection Delay and False Positives

Accurate, fast failure detection is difficult (network delays, slow nodes can be mistaken for failure).
Too-short timeouts may cause unnecessary failovers ("flapping"), triggering repeated elections and instability.

b. Consensus Complexity

All (or most) nodes must agree on the new leader; this is a "consensus" problem, especially tricky during network partitions or message loss.
Algorithms such as Raft/Paxos or coordination services (ZooKeeper) are established solutions, but add complexity and require careful tuning.

c. Data Consistency

Choosing a follower that is out-of-date risks data loss (writes confirmed by old leader but not yet replicated are lost).
System must choose as leader the follower with the most complete data log, and must resolve any partial transactions safely.

d. Split-Brain Scenario

If two nodes believe they’re both leaders (because of a network partition, etc.), both can accept writes, causing data divergence and conflict. Robust leader election algorithms must prevent this.

e. Client Reconfiguration

Clients must reliably and quickly discover and start sending writes to the new leader; lag can cause failed writes or inconsistent system state until all clients update their configurations.

f. Service Downtime

Time to detect failure, run consensus, and promote a new leader = a period where no writes are possible.
Goal is to minimize downtime, but cannot eliminate it completely.

2. Multi-Leader Replication (Multi-Master / Active-Active)

Multiple nodes act as leaders; each can independently accept write operations.
These leaders replicate data among themselves.
Network topology can be circular, star, or all-to-all. Topology impacts fault tolerance and replication overhead.

Pros: - High availability: System remains writable even if some leaders fail. - Low latency for writes: Writes can be accepted at the nearest leader (useful for geo-distributed systems). - Geographic distribution: Clients connect to the closest leader, benefiting global applications. - No single write bottleneck: Write load is distributed

Cons: - Conflict resolution required: Concurrent updates to the same data from different leaders must be resolved (e.g., last-writer-wins, custom rules). - Implementation complexity: Conflict detection and resolution logic is required. - Increased chance of inconsistencies: Temporary data divergence is possible before conflicts are reconciled. - Topology management: Certain setups (circular/star) may have single points of failure in the data propagation path if not well-designed.

Common Conflict Resolution Strategies

When multiple leader nodes (servers) accept simultaneous writes to the same data in different locations, conflicts can arise when updates are exchanged and merged. Handling these conflicts efficiently and predictably is a key challenge:

Last Write Wins (LWW):
Each write is assigned a unique identifier (e.g., timestamp); the version with the highest ID is kept.
Pros: Simple to implement.
Cons: Risks losing valuable updates as earlier writes are discarded, potentially leading to data loss.
Multi-Value Conflict Resolution:
Both conflicting versions are stored.
The application must later merge or choose between them, possibly involving user input or custom merging logic.
Custom/Application-Level Logic:
Developers provide conflict resolution code, either:
- On Write: The conflict handler resolves conflicts during replication.
- On Read: All versions are returned, and the application or user merges data.
Avoidance:
Partition the data-space so that all writes for a given record go to one specific leader, reducing conflict risk.

CRDTs: Conflict-free Replicated Data Types (Not clear)

CRDTs are special data structures that enable multiple nodes to update data independently and concurrently. They are designed so that all copies converge to the same, correct state automatically, with no need for central coordination or manual conflict resolution.

Mergeability: When two replicas diverge (make different changes), merging their states always results in the same eventual value, regardless of the order of operations.
Operations: All operations are commutative (order doesn't matter), associative, and idempotent (applying multiple times makes no difference).
Types:
Counters: Increments can be merged from multiple places (e.g., distributed "likes" count).
Sets: Add, remove, or update elements, ensuring all valid changes appear in the merged result.
Maps, Lists, Registers: More complex types for real-world apps (e.g., collaborative documents).
Use Cases: Distributed collaborative editing, messaging, notes apps, cloud-based multiplayer games, and scalable NoSQL databases.

Lamport Clocks

What is a Lamport Clock?

A Lamport clock is a simple, monotonic counter that helps order events in a distributed system. How it works: - Every node keeps a local counter.

When a node performs an operation (local event), it increments its counter.
When a node sends a message (e.g., a replicated write), it includes its counter value.
When a node receives a message, it updates its own counter to be one higher than the maximum of its own and the incoming clock.
This allows the system to order all events in a way that's consistent with possible causality, even if events happen concurrently on different nodes.
Example:
Node A (counter 5) writes and sends to Node B.
Node B (counter 3) receives the message and sets its counter to max(3,5)+1=6.
With every write having a Lamport timestamp, conflicts can be resolved by comparing timestamps; the higher timestamp "wins" (this is sometimes called last-write-wins or LWW).
This mechanism ensures all nodes converge to the same state but does not guarantee true linearizability because Lamport clocks provide only logical (not real-time) ordering. Physical clocks or hybrid clocks are needed for real-time order.
Lamport clocks alone do not capture real-time (wall-clock) order—two concurrent writes with different values may need additional tie-breakers (node ID, etc.).

Why using real timestamp is not reliable?

Physical clocks on separate machines almost never remain perfectly synchronized, even with protocols like NTP. Small skews (milliseconds) can result in incorrect ordering of updates—especially during network partitions, heavy loads, or following system clock corrections.
Relying solely on physical clocks can create “time gaps” where a write that has happened is not yet visible to other nodes due to network delays, allowing later writes (with higher timestamps) to be accepted and possibly overwriting genuine updates.

Merkle Trees

Merkle Trees are hash-based tree data structures used for efficient content verification and synchronization in distributed systems, blockchains, and databases.

Structure:
Leaf nodes: Contain hashes of individual data blocks (e.g., database rows, transactions).
Internal nodes: Each is a hash of its children's hashes.
Root node (Merkle root): Contains a single hash summarizing all the data beneath it.
Properties:
Any change to a single data block changes its hash, which affects all parent hashes, up to the root.
To compare/synchronize two datasets, only the minimal set of inconsistent blocks (where hashes differ) need to be checked, rather than the whole dataset.
Applications:
Used to efficiently detect and sync only changed data in replication or backup scenarios.
Blockchains use Merkle roots to verify the integrity of entire sets of transactions within each block.

Summary Table

Topic	Purpose / Role	Key Advantage
Conflict Handling in Multi-Leader	Ensures consistency when updates collide	Multiple strategies suit various needs
CRDTs	Merges concurrent updates automatically, no user conflict resolution	Guaranteed eventual convergence, high availability
Merkle Trees	Verifies/synchronizes large datasets efficiently and securely	Fast change detection, minimal data movement

Note: For applications that absolutely require linearizability, multi-leader replication is generally not used, or extra coordination (like distributed locking or consensus) is added at the cost of performance and availability.

3. Leaderless Replication

No single leader exists; all nodes are equal—any node can accept reads or writes.
Writes are sent to several replicas; quorum (N, R, W) consistency (e.g., “write must succeed on W out of N nodes”).
Quorum reads ensure clients get the most recent or majority-backed values.

Pros: - Highest availability: Any node can handle requests even if others are down. - No single point of failure: True resilience (as long as quorum is preserved). - Horizontal scalability: Both reads and writes can be scaled by adding more nodes. - Geographic flexibility: Well-suited for global architectures.

Cons: - Eventual consistency (not strong): Reads may return stale data until nodes converge. - Conflict management complexity: Requires version vectors, Merkle trees, or similar to resolve conflicts and reconcile divergent histories. - Operational complexity: More client logic needed for things like read repair, handling conflicting writes, and node resurrection. - Stronger consistency costs: Achieving strong consistency (by using strict quorums) can add latency, lower availability, or both.

Read Repair

Read Repair is a mechanism in distributed databases and leaderless replication systems (like Cassandra, ScyllaDB, DynamoDB) used to detect and correct data inconsistencies among replicas during read operations.
When a client issues a read, the system checks data from multiple replicas, If inconsistencies (stale or divergent values) are detected, the system sends updates to out-of-date replicas, ensuring they converge towards the latest, correct value.
Common in leaderless/ eventually consistent databases: Cassandra, ScyllaDB, DynamoDB, etc.
Used together with quorum reads/writes and background anti-entropy processes.
Relies on version vectors, timestamps, or digests to track and merge divergent data

How Read Repair Works

Read Request: A client reads data with a given consistency level (e.g., quorum or majority) from multiple replicas.
Data Comparison: The database compares the data (or digests/hashes) returned from each replica.
Detect Inconsistency: If some replicas return different versions, an inconsistency is detected.
Repair: The system identifies the correct (latest) version (often based on version/timestamp, "last write wins", or quorum rules).
- The up-to-date value is written back to the stale replicas.
- Repairs can be blocking (client waits for the repair before getting a response) or asynchronous (repair continues after returning the response).
Return Data: The client receives the correct, most recent value.

When and Why Is Read Repair Used?

Triggered during reads: Only replicas involved in a specific read operation are repaired (incrementally; the rest are left for later).
Complements background anti-entropy repairs: Read repair does not scan all data but is opportunistically triggered by real reads, reducing the need for heavy background repair jobs.
Improves eventual consistency: Since leaderless databases can diverge due to network partitions or delayed replication, read repair helps replicas converge on the correct state over time.

Types of Read Repair

Blocking (Foreground) Read Repair:
The response to the client is delayed until repairs are completed for the data involved in the read.
Background (Asynchronous) Read Repair:
The client quickly receives data, and repairs to out-of-date replicas are done in the background to avoid extra read latency.

Pros

Enhances data consistency incrementally, without needing heavy system-wide repair.
Reduces time windows of inconsistency after network partitions or failed writes.
Improves reliability and user experience by repairing as data is accessed.

Cons

Only fixes data touched by reads; rarely/never-read data may remain inconsistent unless proactively repaired.
Can add latency to reads if repairs are blocking/synchronous.
Not a substitute for full (anti-entropy) background repair when many divergent replicas exist.

Visual Comparison Table

Topology	Who Writes?	Scalability	Consistency	Fault Tolerance	Complexity	Example Databases
Single-Leader	Only the leader node	High (reads only)	Strong	OK (leader failover)	Low	MySQL, PostgreSQL, MongoDB, SQL Server
Multi-Leader	Any leader node	High (reads/writes)	Eventual / Custom	High	Medium-High	MySQL (Group Replication), MariaDB Galera, CouchDB, AWS Aurora (multi-master)
Leaderless	Any node	Highest (R/W)	Eventual*	Highest	High	Cassandra, DynamoDB, Riak

*Note: Consistency in leaderless setups can be customized; strong consistency can be achieved at the cost of read/write performance or availability.

Quick Usage Guidance

Single-Leader: Use when you need simple, strong consistency, manageable failovers, and high read scalability.
Multi-Leader: Best for global low-latency writes, office branch synchronization, collaborative/real-time apps, but requires careful conflict handling.
Leaderless: For ultra-high availability, no single point of failure, and massive scale (e.g., IoT, CDN, big data apps)—but with eventual consistency and extra complexity.

Synchronization Between Leader and Follower Nodes in Database Replication

1. Change Propagation

Writes happen on the leader: All client writes (inserts, updates, deletes) are sent to the leader node only.
Change log: The leader records these changes in a replication log (also called write-ahead log, binlog, or change stream) as soon as it commits the change to its own data store.
Sending changes: After committing, the leader sends the same changes (or the log entries) to all followers, so they can apply the exact sequence of operations.

2. Replication Methods

a. Synchronous Replication

Process: Leader waits for followers to acknowledge/confirm that they’ve applied the change before confirming success to the client.
Pros: Followers are always up-to-date; system is strongly consistent.
Cons: Increased write latency (system stalls if followers are slow or offline); rarely are all followers set to synchronous, usually just one “synchronous” follower for high durability.
Typical setup: Used for mission-critical data needing minimal lag.

b. Asynchronous Replication

Process: Leader sends changes to followers but does NOT wait for acknowledgments; confirms success to the client once change is written locally.
Pros: Fast writes, higher throughput.
Cons: Followers can lag behind, leading to replication lag and brief inconsistency if leader fails and a lagging follower is promoted.
Typical setup: Default for most web-scale databases; most followers are asynchronous for scalability.

c. Semi-Synchronous Replication

Hybrid: One (or more) follower(s) is synchronous, the rest are asynchronous. Guarantees data is in at least two places before success is reported.
Benefit: Balances safety and performance.

3. Application of Replicated Changes

Log Positioning: Each follower knows the log sequence number or offset up to which it has processed changes (e.g., LSN in PostgreSQL, binlog coordinates in MySQL).
Pull-based Synchronization: Followers typically pull new log entries from the leader starting from their last processed point.
Exact Order: Followers apply changes in the same order as processed on the leader, ensuring consistency.

4. Setting Up New Followers or Catching Up Lagging Followers

Snapshot: When a new follower is added, it receives a consistent snapshot of the leader's data.
Replay Logs: After the snapshot, the follower requests and applies all changes that happened since the snapshot to catch up fully.
Continuous Sync: Once synced, the follower starts applying new real-time changes from the replication log.

5. Special Cases: Log Shipping

Some systems (e.g., SQL Server, PostgreSQL) use log shipping: the leader ships transaction log files to followers, who replay these logs to apply data modifications, often with a delay or schedule for disaster recovery scenarios.

6. Failure & Resynchronization

Follower failure: If a follower goes offline, on recovery, it requests all missed changes since its last sync position, replays them, and then rejoins real-time replication.
Leader failure: Remaining followers with the most up-to-date data can be elected as the new leader, using the latest log position to minimize data loss.

7. Heartbeat and Monitoring

Leader and followers often exchange periodic heartbeat messages (e.g., Redis, MySQL) to confirm each other’s availability and track replication health.
Followers report their replication position; the leader can identify lagging nodes and handle failover when needed.

Summary Table

Process	Description
Change log/stream	Leader records all writes in an ordered log
Log shipping/replication	Followers fetch (or receive) change logs and apply in order
Synchronous replication	Leader waits for follower confirmation before acknowledging client
Asynchronous replication	Leader does NOT wait; followers catch up in background
Semi-synchronous	At least one follower is sync; others async
Recovery/catch-up	Followers replay missed logs on rejoin to become consistent

Distributed System Protocols

Two-Phase Commit (2PC) Explained

Two-Phase Commit (2PC) is a distributed transaction protocol used to ensure that a transaction spanning multiple databases or nodes is either committed everywhere or rolled back everywhere, so all nodes remain consistent.

1. Coordinator and Participants

Coordinator: A special node (often called the transaction manager) that orchestrates the commit process.
Participants: The databases or resource managers involved in the transaction.

2. Phases

Phase 1: Prepare (Voting Phase)

The coordinator sends a “prepare to commit?” message to all participants after executing all transaction operations.
Each participant tries to prepare to commit (by writing to a log, acquiring locks, etc).
Each participant replies with:
“Yes” (ready to commit), or
“No” (cannot commit).

Phase 2: Commit (Decision Phase)

If all participants replied “Yes”:
The coordinator sends a “commit” message to all participants, instructing them to finalize the transaction and release locks.
If any participant replied “No”:
The coordinator sends a “rollback” (abort) message to all, instructing them to undo any changes.
After committing or rolling back, each participant sends an acknowledgment to the coordinator.

Properties

Atomicity: The transaction either completes everywhere or nowhere—no partial commits.
Consistency: Keeps distributed databases consistent.
Durability: Uses logs (often Write-Ahead Log), so the decision survives crashes and can be recovered.

Limitations

If the coordinator fails after collecting votes and before sending the commit/abort message, participants may be left waiting indefinitely (“blocking”).
Not optimal for scalability or fault-tolerance in large or high-latency systems. Some modern systems use alternatives or enhancements (e.g., Three-Phase Commit, Saga patterns).

Summary

Two-Phase Commit ensures that distributed transactions are all-or-nothing and all involved nodes end up in a consistent state. However, it is prone to blocking under certain failure conditions.

Distributed Systems Consensus Algorithms

Raft Consensus Algorithm Explained

Raft is a leader-based consensus algorithm widely used in distributed systems to ensure that all nodes agree on the same sequence of operations, even in the presence of failures.
It is used for building fault-tolerant, strongly consistent distributed systems, such as replicated databases.

Key Ideas and Workflow

1. Roles

Each server in the cluster can be in one of three roles at any time: - Leader: Handles all client requests that modify data and manages log replication to followers. - Follower: Passive; responds to requests from the leader or candidates. - Candidate: Seeks election as leader if it doesn't hear from one.

2. Terms

Time is divided into terms (numbered intervals).
Each term starts with a leader election.
If a leader is chosen, that term continues until the next election is needed.

3. Leader Election

If a follower doesn't hear from a leader within an election timeout, it becomes a candidate, increments its term, and asks for votes.
Servers vote for the first candidate they hear from in the term.
If a candidate gets votes from a majority, it becomes the new leader.
If there’s a tie (split vote), a new term is started and the process repeats (randomized timeouts help avoid repeated ties).

4. Log Replication

Only the leader can accept writes. It appends the operation to its log and sends AppendEntries (log updates) to followers.
All log entries are stored locally on every server in the cluster—including leaders and followers. Each Raft server maintains its own persistent log (usually on disk).
When a client requests to perform an operation (such as a write), the current leader appends a new entry to its own log and then sends AppendEntries RPCs to followers, asking them to append the same entries to their logs.
Once a log entry is confirmed by a majority, it’s considered committed. All followers then apply it to their local state.
Followers copy the leader’s log to stay up to date, ensuring strong consistency.
If the leader crashes before successfully replicating the new entry to a majority, the entry may not be considered committed and might be overwritten by a new leader during the recovery/election process.
But the uncommitted entry remains available on the leader’s local persistent log. As part of leader election and log reconciliation, Raft ensures that any "dangling" or uncommitted entries that are not part of the committed majority will be overwritten or rolled back as clusters converge on a consistent log after a new leader is elected.

5. Safety

Once a log entry is committed, Raft guarantees it will not be lost, even if the leader crashes and another leader is elected.
Only a node with all committed entries can become leader, ensuring new leaders can’t “forget” already-committed operations.

6. Membership Changes

Raft safely adds/removes nodes using a two-step process (joint consensus), so there’s no ambiguity or data loss during reconfiguration.

Advantages

Simplicity: Easier to understand and implement than Paxos.
Linearizability: Every operation appears atomic and totally ordered (single-key strong consistency).
Fault Tolerance: A cluster of N servers can tolerate up to (N-1)/2 failures. For example, a 5-node cluster can remain available if 2 servers crash.

Adoption

Used in: Consul, etcd, TiDB, YugabyteDB, and many other distributed databases and key-value stores.

Summary

Raft brings reliability and strong consistency to distributed systems through clear leader election, log replication, and robust safety mechanisms, making it a de-facto standard for consensus in modern distributed architectures.