HDFS (Hadoop Distributed File System)

HDFS is a distributed file system designed to store and manage large volumes of data across commodity hardware clusters. It provides high-throughput access to application data, fault tolerance through replication, and scalability for petabyte-scale storage.

Key Features

HDFS Architecture: Ensuring Fault Tolerance & High Availability

Core Components

NameNode: Master node managing metadata (file locations, permissions); single point but highly available in federated setups. DataNodes: Slave nodes storing actual data blocks (default 128MB); report to NameNode. Block Storage: Files split into blocks, replicated across DataNodes. Secondary NameNode: Checkpoints NameNode metadata for recovery. Federation & HA: Multiple NameNodes for scalability; Zookeeper for failover.

Practical Use Cases

Data Lakes: Storing raw logs, sensor data, or user interactions for later analysis. Batch Analytics: Backend for Hive/Spark queries on historical data (e.g., e-commerce sales trends). Machine Learning: Storing training datasets for distributed ML pipelines. Backup/Archiving: Cost-effective long-term storage for compliance (e.g., financial records). Streaming Ingestion: With tools like Flume, ingesting real-time data into HDFS.

Fault Tolerance

High Availability

HBase?

HBase is a distributed, scalable, NoSQL column-family database built on top of HDFS, modeled after Google’s Bigtable.

HBase is a key-value store for fast, random access to sparse, large-scale data, not a file system explorer. It uses HDFS for persistence, not for indexing HDFS files.

Use Cases

How HBase Improves on HDFS

Aspect HDFS Strengths HBase Strengths built on HDFS
Bulk Storage Efficient for large, sequential data Adds random reads/writes capability
File Updates Append-only In-memory buffering → no in-place modification
Query Type Full scans Instant key-based get/put/scan
Latency Seconds to minutes Milliseconds

In short, HDFS alone is great for batch and streaming, but not for low-latency access. HBase layers a database engine on top to support random writes and queries, while still leveraging HDFS’s scalability and durability.

How HBase Achieves Frequent Random Writes with Millisecond Latency

HDFS files are immutable, so in-place editing for small updates is impossible. HBase solves this with a write-path architecture:

1. Write Path

  1. Write-Ahead Log (WAL):
  2. Every write first appends to a WAL for durability.
  3. WAL is in HDFS (or local + replicated) and records all updates before applying them to memory.

  4. MemStore (In-Memory Buffer):

  5. The data is immediately added to the MemStore for the target column family.
  6. Being in RAM makes the write extremely fast.
  7. The client gets an acknowledgment as soon as WAL and MemStore are updated — no immediate disk rewrite in HDFS.

2. Flushing and HFiles

3. Reads and Compactions

4. Why This Works

Indexing in HBase

Summary Table

Feature HDFS HBase
Purpose Distributed file storage NoSQL store for real-time access on HDFS data
Data Model Files → Blocks Tables → Rows → Column Families → Columns
Fault Tolerance Block replication, rack awareness Inherits from HDFS
Updates Append-only Buffered (MemStore) + immutable file creation
Random Reads/Writes Inefficient Optimized, low latency
Indexing Directory tree Primary row key, Bloom filters, optional indexes
Best For Batch analytics, archival storage Real-time lookup/update of structured data

In essence: - HDFS provides robust, fault-tolerant distributed storage. - HBase builds on HDFS to offer a low-latency, high-throughput database for random access, using WAL + MemStore to make writes fast and never rewriting in-place HDFS files. This combination allows organizations to handle both batch and real-time big data workloads efficiently.

Hive

Apache Hive: A Detailed Explanation

Apache Hive is an open-source data warehouse infrastructure built on top of the Hadoop ecosystem, designed to enable efficient querying, analysis, and management of large-scale datasets stored in distributed file systems like Hadoop Distributed File System (HDFS). Hive provides a SQL-like query language called HiveQL (or HQL), allowing users familiar with SQL to interact with big data without needing to write complex MapReduce jobs.

Hive translates HiveQL queries into MapReduce, Tez, or Spark jobs for execution, making it a bridge between traditional SQL workflows and Hadoop's distributed processing. It supports data summarization, ad-hoc querying, and ETL (Extract, Transform, Load) operations, while handling structured and semi-structured data in formats like Parquet, ORC, and Avro.

What is Hive Used For?

Hive is primarily used for: - Data Warehousing and Analytics: Storing, querying, and analyzing petabyte-scale datasets in a distributed environment. - SQL-Like Querying: Enabling non-programmers (e.g., analysts) to perform complex joins, aggregations, and filtering on big data using HiveQL, which compiles to Hadoop jobs. - ETL Processes: Transforming raw data into structured formats for reporting and machine learning. - Metadata Management: Maintaining a metastore (e.g., using MySQL or PostgreSQL) to catalog tables, schemas, and partitions, abstracting the underlying HDFS structure. - Integration with Ecosystems: Working with tools like Spark, Presto, or Hive on Tez for faster execution, and cloud services for scalability.

Hive excels in batch processing for historical data analysis but has evolved to support interactive querying via engines like Hive LLAP (Live Long and Process). Without Hive, big data analytics would remain siloed to developers, limiting adoption in organizations with diverse skill sets. In 2025, Hive is often augmented with ACID tables and vectorized execution for better performance, but for low-latency or transactional workloads, alternatives like Trino or Snowflake are preferred.

Architecture Overview

Hive's architecture includes: - Query Parser and Optimizer: Converts HiveQL to an abstract syntax tree, optimizes (e.g., predicate pushdown), and generates execution plans. - Metastore: Centralized repository for metadata (tables, partitions, schemas) stored in relational DBs like Derby or PostgreSQL. - Execution Engine: Compiles queries to MapReduce, Tez (DAG-based), or Spark jobs for distributed execution on Hadoop clusters. - Storage Layer: Integrates with HDFS/S3 for data, supporting formats like ORC (optimized for compression) and Parquet (columnar for analytics). - SerDe (Serialization/Deserialization): Handles data formats for reading/writing.

This layered design ensures fault tolerance and scalability, with HiveServer2 providing JDBC/ODBC access for clients.

Practical Use Cases

  1. Log Analysis and ETL: Processing web server logs in HDFS to extract user behavior insights, using HiveQL for aggregations (e.g., daily active users).
  2. Business Intelligence Reporting: Querying terabytes of sales data for dashboards, integrating with tools like Tableau for visualizations.
  3. Data Lake Analytics: In cloud setups (e.g., AWS Athena-like), querying semi-structured JSON/CSV in S3 for ad-hoc analysis.
  4. Machine Learning Feature Engineering: Transforming raw data into features for models, e.g., aggregating customer transactions by region.
  5. Fraud Detection: Analyzing transaction logs for patterns, with Hive partitioning by date for efficient queries.

Hive is widely used in industries like finance (e.g., transaction analysis), e-commerce (recommendation data prep), and telecom (call detail records).

Advantages

Limitations and When Not to Use It

When Not to Use: - Small-scale data (<1TB) where a relational DB like PostgreSQL suffices. - Real-time streaming (use Kafka + Flink). - High-velocity OLTP (use NoSQL like Cassandra). - Simple scripting (use Pig or direct MapReduce).