Kubernetes (K8s)

1. Why Kubernetes? The Problem Docker Solves… and Doesn’t

Problem Docker Kubernetes
Run app in container Yes Yes
Run 100 containers Manual Automated
Auto-restart failed container No Yes
Scale to 1000 containers No Yes
Rolling updates No Yes
Self-healing No Yes
Multi-host deployment No Yes

Docker = "Run one container"
Kubernetes = "Orchestrate 10,000 containers across 100 machines"

2. What Kubernetes Offers on Top of Docker

Feature What It Does
Orchestration Manages 1000s of containers across nodes
Self-healing Auto-restart, reschedule failed pods
Auto-scaling Scale up/down based on CPU/load
Rolling Updates Zero-downtime deployments
Service Discovery api.service → auto DNS
Load Balancing Spread traffic across pods
Secret/Config Management Inject env vars, files securely
Multi-cloud Run same app on AWS, GCP, Azure, on-prem

3. Kubernetes Architecture – Master vs Worker Nodes

+------------------+     gRPC/HTTP     +------------------+
|   MASTER NODE    | ◄───────────────► |   WORKER NODE    |
| (Control Plane)  |                   | (Runs Pods)      |
+------------------+                   +------------------+

MASTER NODE (Control Plane) – The Brain of K8s

Runs on 1 or 3+ nodes (HA)
Never runs user workloads
All components talk via kube-apiserver

+------------------+
|   MASTER NODE    |
|                  |
|  ┌─────────────┐ |
|  │ API Server  │ ← All communication
|  └─────▲───────┘ |
|        │         |
|  ┌─────▼───────┐ |
|  │ etcd        │ ← Single source of truth
|  └─────▲───────┘ |
|        │         |
|  ┌─────▼───────┐ |
|  │ Scheduler   │ ← "Where to run?"
|  └─────▲───────┘ |
|        │         |
|  ┌─────▼───────┐ |
|  │ Controller  │ ← "Make it match desired state"
|  │ Manager     │
|  └─────────────┘ |
+------------------+

1. kube-apiserver – The Front Door

Role Details
Central API All kubectl, controllers, kubelet → talk to this
REST API GET /api/v1/pods, POST /api/v1/namespaces
Authentication JWT, certificates, OIDC, webhook
Authorization RBAC, ABAC, Node, Webhook
Validation Rejects invalid YAML
Scaling Horizontal (multiple replicas behind LB)
# You talk to this
kubectl get pods --server=https://master:6443

2. etcd – The Database (Single Source of Truth)

Role Details
Key-value store Only stores cluster state (pods, services, secrets)
Consistent & HA Uses Raft consensus
Watched by all Controllers react to changes
Backup critical etcdctl snapshot save
# See raw data
kubectl exec -n kube-system etcd-master -- etcdctl get /registry/pods/default/myapp

If etcd dies → cluster is brain-dead
Always 3-node etcd cluster in production

3. kube-scheduler – The Matchmaker

Role Details
Watches Unscheduled pods (nodeName: null)
Scores nodes CPU, memory, taints, affinity, topology
Assigns Sets pod.spec.nodeName

Scoring Example

# Pod wants SSD
nodeSelector:
  disktype: ssd

→ Scheduler picks node with label disktype=ssd

4. kube-controller-manager – The Robot Army

Runs multiple controllers in one process:

Controller Job
ReplicaSet Ensure 3 pods → if 2, create 1
Deployment Manage rollouts, rollback
StatefulSet Ordered pods (db-0, db-1)
DaemonSet Run on every node (logging, monitoring)
Job/CronJob Run to completion
Node Mark node NotReady if kubelet stops
Endpoint Update Service → Pod IP mapping
# See controllers in action
kubectl get rs,deployments,statefulsets -A

5. cloud-controller-manager

Role Cloud Integration
Node Sync cloud node metadata
LoadBalancer Create AWS ELB, GCP LB
Route Cloud network routes
Service Manage cloud-specific services

Only runs in cloud environments

WORKER NODE – The Muscle

Runs user workloads (pods)
Multiple per cluster

+------------------+
|   WORKER NODE    |
|                  |
|  ┌─────────────┐ |
|  │ kubelet     │ ← Talks to API server
|  └─────▲───────┘ |
|        │         |
|  ┌─────▼───────┐ |
|  │ kube-proxy  │ ← Load balances
|  └─────▲───────┘ |
|        │         |
|  ┌─────▼───────┐ |
|  │ containerd  │ ← Runs containers
|  └─────▲───────┘ |
|        │         |
|  ┌─────▼───────┐ |
|  │ Pods        │ ← Your apps
|  └─────────────┘ |
+------------------+

1. kubelet – The Node Agent

Role Details
Watches API server Gets assigned pods
Talks to container runtime Starts/stops containers
Reports status CPU, memory, pod phase
Exec, logs, port-forward kubectl exec, logs
cAdvisor Built-in metrics
# See what kubelet sees
journalctl -u kubelet

2. kube-proxy – The Network Cop

Role Details
Watches Services & Endpoints When pod IP changes
Programs iptables / IPVS Routes traffic
Load balances Round-robin across pods

Service Types Handled

type: ClusterIP  → 10.96.0.1 → iptables DNAT
type: NodePort   → 30080 → iptables
type: LoadBalancer → cloud LB

3. Container Runtime – The Engine

Runtime Status
containerd Default since K8s 1.24
CRI-O Red Hat, lightweight
Docker Deprecated (shim removed)

Docker still works via dockershimcontainerd

# Check runtime
kubectl get nodes -o wide
# → container-runtime: containerd://1.7.0

Real-World Flow

graph TD
    A[User: kubectl apply] --> B[API Server]
    B --> C[etcd: store desired state]
    C --> D[Scheduler: pick node]
    D --> E[kubelet on node]
    E --> F[containerd: pull image]
    F --> G[Start containers]
    G --> H[kube-proxy: update iptables]
    H --> I[Service ready]

High Availability (HA) Setup

Component HA Strategy
API Server 3+ replicas → LB (keepalived, cloud LB)
etcd 3-node cluster (Raft)
Scheduler / Controller Run on all masters (leader election)
Worker Nodes 3+ for redundancy

Summary Table

Node Component Job
Master kube-apiserver API gateway
etcd Cluster database
scheduler Assign pods to nodes
controller-manager Run control loops
Worker kubelet Run pods on node
kube-proxy Network proxy
containerd Run containers

Golden Rule:

Master = Think, Store, Schedule
Worker = Run, Report, Route

Now you understand how Kubernetes turns 100 machines into one logical supercomputer.
Try:

kubectl get componentstatuses
kubectl -n kube-system get pods

And see the control plane in action!

Kubernetes Resources

Pods, Deployments, Services, DaemonSets, Secrets, ConfigMaps, StatefulSets & More

Kubernetes (K8s) resources are the declarative building blocks of your cluster. You define desired state in YAML/JSON, and K8s makes it reality through controllers and reconciliation loops.

Key Principle:

Imperative (kubectl run) → temporary
Declarative (YAML) → persistent, version-controlled

1. Pod – The Atomic Unit

Definition

  • Smallest deployable unit in K8s.
  • Runs 1+ containers that share network, storage, and lifecycle.
  • Ephemeral – pods die; don't manage directly (use Deployment).

Key Features

  • Shared Resources: Containers in one pod communicate via localhost.
  • Lifecycle: Scheduled to nodes, runs until completion/failure.
  • Probes: Readiness/liveness to control traffic/health.

Kubernetes Probes

Kubernetes probes are health checks that determine pod behavior. They use HTTP, TCP, or command-based tests with configurable thresholds (e.g., initial delay, period, timeout, success/failure counts).

  1. Liveness Probe
  2. Purpose: Detects if the pod is alive and healthy. If it fails, Kubernetes restarts the pod (self-healing).
    • When Used: For apps that can deadlock or crash (e.g., memory leaks).
    • Behavior: Failure → pod restarts; doesn't affect traffic routing.
    • YAML Example: yaml livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 # Restart after 3 failures
  3. Readiness Probe
  4. Purpose: Detects if the pod is ready to serve traffic. If it fails, Kubernetes removes it from Service endpoints (no traffic sent) but doesn't restart.
    • When Used: For apps that need warmup time or become temporarily unhealthy (e.g., during DB connection).
    • Behavior: Failure → pod excluded from load balancing; restarts only if liveness fails.
    • YAML Example: yaml readinessProbe: tcpSocket: port: 8080 initialDelaySeconds: 5 periodSeconds: 10 timeoutSeconds: 1 successThreshold: 1 failureThreshold: 3

Key Differences

Aspect Liveness Probe Readiness Probe
Failure Action Restart pod Exclude from traffic
Impact on Service No (traffic continues to healthy pods) Yes (removes from endpoints)
Default None None
Use Case Crash recovery Traffic routing

YAML Example

apiVersion: v1
kind: Pod
metadata:
  name: nginx-pod
  labels:
    app: web
spec:
  containers:
  - name: nginx
    image: nginx:1.25
    ports:
    - containerPort: 80
    resources:
      limits:
        cpu: "100m"
        memory: "128Mi"
  - name: sidecar-logger
    image: fluentd:v1.14
    volumeMounts:
    - name: logs
      mountPath: /var/log/nginx
  volumes:
  - name: logs
    emptyDir: {}

Use Cases

  • Simple apps (single container).
  • Sidecar pattern (app + logger/monitor).

Pros/Cons

  • Pros: Fine-grained control.
  • Cons: No auto-restart; use with Deployment.

2. Deployment – The Workhorse for Stateless Apps

Definition

  • Manages ReplicaSets to ensure desired pod replicas.
  • Handles rolling updates, rollbacks, and scaling.
  • Stateless – assumes pods are interchangeable.

Key Features

  • Strategy: RollingUpdate (default) or Recreate.
  • Selectors: Matches pods via labels.
  • Revision History: Tracks changes for rollback.

YAML Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%     # Extra pods during update
      maxUnavailable: 25%  # Allowed downtime
  template:  # Pod template
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
        ports:
        - containerPort: 80
        livenessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 30
          periodSeconds: 10

Use Cases

  • Web servers, APIs, microservices.
  • Scaling: kubectl scale deployment web --replicas=5.

Pros/Cons

  • Pros: Zero-downtime updates, self-healing.
  • Cons: Not for stateful apps (use StatefulSet).

Rolling Updates

Rolling update is a zero-downtime deployment strategy in Kubernetes that gradually replaces old pods with new ones in a Deployment or StatefulSet. It ensures service availability by maintaining the desired number of replicas during updates (e.g., image version change), avoiding full outages.

  • Why Used?: Minimizes disruption, supports canary/blue-green-like deployments, and auto-rollbacks on failures.
  • When Applied?: Triggered by changes in Deployment spec (e.g., image: v1.0 → v1.1).

Strategies

Kubernetes supports 2 strategies in Deployment/StatefulSet .spec.strategy.type:

Strategy Description Use Case
RollingUpdate (default) Gradually scales down old pods while scaling up new ones, maintaining availability. Production apps needing zero-downtime.
Recreate Kills all old pods first, then creates new ones. Simple apps where downtime is acceptable (e.g., batch jobs).

RollingUpdate Parameters

  • **.maxSurge**: Max extra pods allowed during update (e.g., 25% or 1 → temporary surge).
  • **.maxUnavailable**: Max pods that can be unavailable (e.g., 25% or 1 → controlled downtime).
  • Default: maxSurge: 25%, maxUnavailable: 25%.

Monitor: kubectl rollout status deployment/web
Rollback: kubectl rollout undo deployment/web

3. Service – Load Balancer & Service Discovery

Definition

  • Stable endpoint for pods (abstracts pod IPs).
  • Load balances traffic across matching pods.
  • DNS-based discovery (e.g., web.default.svc.cluster.local).
  • To hit the deployment, we need to create a service and attach that to the deployment using labels and selectors. Then the internal url would be {serviceName}.{namespace}.svc.cluter.local:{servicePort}.

Types

Type Description Use Case
ClusterIP (default) Internal IP (10.96.x.x) Internal services
NodePort Exposes on node port (30000-32767) Basic external access
LoadBalancer Cloud LB (AWS ELB) Production external
ExternalName CNAME to external service Integrate with legacy

YAML Example

apiVersion: v1
kind: Service
metadata:
  name: web-service
spec:
  selector:
    app: web  # Matches deployment labels
  ports:
    - protocol: TCP
      port: 80      # Service port
      targetPort: 80  # Pod port
  type: LoadBalancer

U se Cases

  • Expose Deployment: kubectl get svc → external IP.
  • Internal: Pods call web-service:80.

Pros/Cons

  • Pros: Automatic load balancing, health checks.
  • Cons: ClusterIP not external-facing.

How Services Identify Pods/Deployments

Services discover and route traffic to Pods using label selectors in the Service spec. They don't directly reference Deployments but match Pods created by Deployments/StatefulSets via shared labels.

  • Mechanism:
  • Pods (from Deployment) get labels (e.g., app: web).
  • Service .spec.selector matches these labels.
  • Kubernetes watches Pods and updates Service endpoints (Pod IPs) dynamically.
  • YAML Example: ```yaml # Deployment (creates Pods with labels) apiVersion: apps/v1 kind: Deployment metadata: name: web spec: selector: matchLabels: app: web template: metadata: labels: app: web # Pod label spec: containers: - name: nginx image: nginx

# Service (matches via selector) apiVersion: v1 kind: Service metadata: name: web-service spec: selector: app: web # Matches Pod labels ports: - port: 80 `` - **Discovery**: Pods atweb-service:80(DNS:web-service.default.svc.cluster.local`).

containerPort vs targetPort

  • containerPort (Pod spec): The port the container listens on (documentation only; doesn't publish traffic).
  • targetPort (Service spec): The port on the Pod that receives Service traffic (maps to containerPort; defaults to Service port if omitted).
Field Location Purpose Example
containerPort Pod template (Deployment) Container's listening port (info only) containerPort: 8080 (app binds to 8080)
targetPort Service spec Pod port for incoming traffic targetPort: 8080 (Service sends to Pod:8080)
  • YAML Example: ```yaml # In Deployment Pod spec containers:
  • name: app ports:
    • containerPort: 8080 # App listens here

    In Service

    spec: ports: - port: 80 # Service port (e.g., DNS:80) targetPort: 8080 # Routes to Pod:8080 ```

  • Flow: Client → Service:80 → Pod:8080 (containerPort).

Other Critical Fields for Service Discovery & Health Checks

Ensure seamless discovery (stable endpoints) and health (traffic routing) with these fields:

Field Location Purpose Best Practice
selector Service spec Matches Pod labels for discovery Use unique labels (e.g., app: web, tier: frontend).
labels Pod template (Deployment) Enables selector matching Consistent across Deployment/Service (e.g., app: web).
readinessProbe Pod template Checks if Pod is ready for traffic; failure removes from endpoints HTTP/TCP/exec probe; e.g., initialDelaySeconds: 30 for warmup.
livenessProbe Pod template Checks if Pod is alive; failure restarts Pod (affects discovery indirectly) Less frequent than readiness; e.g., periodSeconds: 60.
port Service spec Service's listening port (e.g., for DNS) Match app needs; use name: http for multiple ports.
protocol Service/Port spec Traffic protocol (TCP/UDP/SCTP) TCP default; UDP for streaming.
sessionAffinity Service spec Sticky sessions (client IP-based) ClientIP for stateful apps; timeout configurable.
publishNotReadyAddresses Service spec Include unready Pods in endpoints true for pre-warmup traffic (rare).
annotations Service metadata Metadata (e.g., for Ingress controllers) e.g., nginx.ingress.kubernetes.io/rewrite-target: /.

YAML Snippet (Full Example)

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: app
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          periodSeconds: 30

apiVersion: v1
kind: Service
spec:
  selector:
    app: myapp
  ports:
  - name: http
    protocol: TCP
    port: 80
    targetPort: 8080
  sessionAffinity: ClientIP
  publishNotReadyAddresses: false

Summary: Selectors enable discovery; probes ensure health; tune ports/probes for reliability. Misconfigured selectors cause "no endpoints" errors.

4. DaemonSet – Run on Every Node

Definition

  • Ensures one pod per node (or selected nodes).
  • Node-specific – ideal for agents.

Key Features

  • Scheduling: Runs on all (or tainted) nodes.
  • Rolling Updates: Similar to Deployment.

YAML Example

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd-logging
spec:
  selector:
    matchLabels:
      name: fluentd
  template:
    metadata:
      labels:
        name: fluentd
    spec:
      containers:
      - name: fluentd
        image: fluent/fluentd:v1.14
        volumeMounts:
        - name: varlog
          mountPath: /var/log
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      tolerations:  # Run on tainted nodes
      - operator: Exists

Use Cases

  • Logging (Fluentd), monitoring (Prometheus Node Exporter), storage (CSI drivers).

Pros/Cons

  • Pros: Automatic per-node deployment.
  • Cons: Scales with nodes; resource-heavy.

5. Secrets – Secure Data Management

Definition

  • Stores sensitive data (passwords, tokens, keys) as base64-encoded strings.
  • Mounts as volumes or env vars (encrypted at rest in etcd).

Key Features

  • Base64 Encoding: Not encryption (use external tools for strong secrets).
  • Access Control: RBAC for reading.

Security Aspects of Secrets

Kubernetes Secrets store sensitive data (e.g., passwords, API keys, tokens) as base64-encoded strings in etcd (the cluster's key-value store). Key security features:

  • Access Control: Protected by RBAC (Role-Based Access Control) policies; only authorized pods/services can read them.
  • Encryption at Rest: etcd can be configured for encryption (e.g., via EncryptionProvider). Secrets are not encrypted by default but can be with external tools like Vault.
  • Transmission: Data is transmitted over TLS (via API server); not logged in plaintext

Mounts

Mounting injects data into Pods as environment variables (env) or volumes (volumeMounts). Volumes are preferred for files; env vars for simple values. Defined in Deployment's Pod template (.spec.template.spec).

  • As Environment Variables: Injects keys as vars (e.g., DB_PASSWORD).
  • As Volumes: Mounts as files in a directory (e.g., /etc/secrets/); updates require Pod restart.

NOTE: When mounting Secrets and ConfigMaps as volumes (volumeMounts in Pod spec), updates propagate automatically without Pod restart. The mounted files (e.g., /etc/secrets/token) are symlinks to etcd, so changes in the Secret/ConfigMap update the files in-place.

YAML Example

apiVersion: v1
kind: Secret
metadata:
  name: db-secret
type: Opaque
data:
  username: YXBwdXNlcg==  # base64: "appuser"
  password: U3VwZXJTZWNyZXQxMjM=  # base64: "SuperSecret123"

apiVersion: v1
kind: Pod
metadata:
  name: secret-pod
spec:
  containers:
  - name: app
    image: myapp
    env:
    - name: DB_USER
      valueFrom:
        secretKeyRef:
          name: db-secret
          key: username
    volumeMounts:
    - name: secret-volume
      mountPath: /etc/secrets
  volumes:
  - name: secret-volume
    secret:
      secretName: db-secret

Use Cases

  • DB credentials, API keys, TLS certs.

Pros/Cons

  • Pros: Avoids hardcoding secrets.
  • Cons: Base64 is reversible; use Vault for advanced.

6. ConfigMap – Non-Sensitive Configuration

Definition

  • Stores config data (env vars, files) as key-value pairs.
  • Mounts dynamically without rebuilding images.

Key Features

  • Data Sources: Key-value, files, or literals.
  • Updates: Reload pods without restart (for some apps).

Mounts

  • Same as secrets

YAML Example

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  database_url: "postgres://localhost:5432/myapp"
  log_level: "INFO"
  app_name: "MyApp v1.0"

apiVersion: v1
kind: Pod
metadata:
  name: config-pod
spec:
  containers:
  - name: app
    image: myapp
    env:
    - name: DB_URL
      valueFrom:
        configMapKeyRef:
          name: app-config
          key: database_url
    volumeMounts:
    - name: config-volume
      mountPath: /etc/config
  volumes:
  - name: config-volume
    configMap:
      name: app-config

Use Cases

  • App configs, feature flags, env-specific settings.

Pros/Cons

  • Pros: Decouples config from code.
  • Cons: Not encrypted (use Secrets for sensitive).

7. StatefulSet – For Stateful Apps

Definition

  • Manages stateful workloads (e.g., databases) with stable identities.
  • Ordered deployment/scaling, persistent storage.

Stateful Workload

A stateful workload is an application or service that maintains persistent state (data, configuration, or identity) across restarts, updates, or failures. It requires stable, ordered, and persistent storage to function correctly, unlike stateless workloads where instances are interchangeable.

Key Characteristics

  • Persistent Data: Relies on durable storage (e.g., databases with user records).
  • Stable Identity: Needs consistent naming/ordering (e.g., db-0, db-1).
  • Ordered Operations: Scaling/updates must follow sequence (e.g., primary replica before secondary).

Examples

  • Stateful: Databases (MySQL, MongoDB), message queues (Kafka), clustered apps (ZooKeeper).
  • Stateless: Web servers (Nginx), APIs (FastAPI), simple microservices (no local data).

Why It Matters in Kubernetes

  • Deployment: Use StatefulSet for stable Pods, headless Services, and PersistentVolumes (PVs).
  • Challenges: Scaling requires coordination; failures need data migration.
  • vs. Stateless: Deployments handle stateless apps easily (interchangeable replicas).

Summary: Stateful = "remembers who it is and what it knows" (e.g., your bank account balance). Use for data-heavy apps; stateless = "doesn't care" (e.g., a calculator).

Key Features

  • Stable Names: Pods named db-0, db-1 (not random).
  • Headless Service: Direct pod access via DNS.
  • A Headless Service is a Kubernetes Service with clusterIP: None.
  • It does NOT get a single virtual IP — instead, it returns direct DNS A records for each Pod.
  • Persistent Volumes: Binds storage to pod identity.

YAML Example

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
spec:
  serviceName: "mysql-headless"
  replicas: 3
  selector:
    matchLabels:
      app: mysql
  template:
    metadata:
      labels:
        app: mysql
    spec:
      containers:
      - name: mysql
        image: mysql:8.0
        env:
        - name: MYSQL_ROOT_PASSWORD
          value: "password"
        volumeMounts:
        - name: mysql-storage
          mountPath: /var/lib/mysql
  volumeClaimTemplates:
  - metadata:
      name: mysql-storage
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi

apiVersion: v1
kind: Service
metadata:
  name: mysql-headless
spec:
  clusterIP: None  # Headless
  selector:
    app: mysql
  ports:
  - port: 3306

Use Cases

  • Databases (MySQL, MongoDB), message queues (Kafka), clustered apps.

Pros/Cons

  • Pros: Ordered scaling, stable storage.
  • Cons: Slower scaling than Deployment.

8. Other Key Resources

ReplicaSet

  • Ensures exact replica count (used by Deployment).
  • YAML: Similar to Deployment but no strategy.

Job & CronJob

  • Job: Run to completion (e.g., batch processing).
  • CronJob: Scheduled jobs (e.g., daily backups).
  • Example: yaml apiVersion: batch/v1 kind: Job metadata: name: backup-job spec: template: spec: containers: - name: backup image: backup-tool restartPolicy: Never

Resource Relationships

User (YAML) → API Server → etcd
                    ↓
            Controller Loop
                    ↓
Deployment → ReplicaSet → Pod → Container
                    ↓
                  Service → Load Balance

Summary Table

Resource Use Case Key Feature
Pod Basic unit 1+ containers
Deployment Stateless apps Rolling updates
Service Exposure Load balancing
DaemonSet Node agents Per-node pods
Secrets Sensitive data Encrypted env/files
ConfigMap Config Dynamic injection
StatefulSet Databases Ordered, stable

Golden Rule:

Declarative YAML + Controllers = Self-healing cluster
Define desired state → K8s makes it real.

Now deploy a Deployment + Service and watch K8s orchestrate!

Kubernetes Autoscalers: HPA vs VPA

Kubernetes autoscalers dynamically adjust resources based on workload demands. HPA scales horizontally (more/fewer pods), while VPA scales vertically (CPU/memory allocation). Neither attaches to Services (Services route traffic to existing pods); they target Deployments, StatefulSets, or ReplicaSets (for HPA) or Pods (for VPA).

Horizontal Pod Autoscaler (HPA)

HPA automatically scales the number of pods in a target resource (e.g., Deployment) based on observed metrics like CPU utilization, memory, or custom metrics (via Metrics Server or Prometheus Adapter).

  • Monitors metrics (default: 80% CPU threshold).
  • Scales up/down to maintain target (e.g., replicas = current load / target utilization).
  • Min/max replicas configurable.

Attachment

  • Targets: Deployment, StatefulSet, ReplicaSet.
  • YAML: Reference via .spec.scaleTargetRef (e.g., kind: Deployment, name: web).

YAML Example

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web  # Attaches to Deployment "web"
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50  # Scale at 50% CPU
  - type: Resource
    resource:
      name: memory
      target:
        type: AverageValue
        averageValue: 500Mi  # Scale at 500Mi memory

Commands

  • Apply: kubectl apply -f hpa.yaml.
  • Monitor: kubectl get hpa, kubectl describe hpa web-hpa.
  • Delete: kubectl delete hpa web-hpa.

When to Use

  • High-traffic apps (e.g., web servers) with variable load.
  • Cost optimization: Scale down during low traffic.
  • Not for: Stateful apps (use StatefulSet HPA cautiously) or fixed-size workloads.

Important Details

  • Requires Metrics Server (kubectl top pods works).
  • Cooldown: 5min default between scales.
  • Pros: Simple, reactive scaling.
  • Cons: Doesn't predict spikes; may overprovision.

How HPA Works with StatefulSets

  1. Scaling Up:
  2. HPA increases replicas (e.g., from 3 to 5).
  3. StatefulSet controller creates new Pods in order (e.g., web-3, then web-4).
  4. Each new Pod gets a stable hostname (web-3.<statefulset-name>.<namespace>.svc.cluster.local) and attaches to its corresponding PersistentVolumeClaim (PVC) (e.g., data-web-3).
  5. Pods join the cluster (e.g., as replicas in a database like etcd).
  6. Scaling Down:
  7. HPA decreases replicas (e.g., from 5 to 3).
    • StatefulSet controller deletes highest-indexed Pods first (e.g., web-4, then web-3).
    • Deleted Pods are terminated gracefully (with termination grace period, default 30s).
    • Data Persistence on Downscale
    • Yes, data persists: StatefulSets bind PVCs to Pod identities (ordinal index).
    • When scaling down, only the Pod is deleted; the PVC (and its bound PersistentVolume/PV) remains.
    • Example: Scaling from 3 to 2 deletes web-2; data-web-2 PVC persists.
    • Re-attach on Scale-Up: If scaled back to 3, web-2 re-creates and re-mounts data-web-2 PVC, preserving data.
    • No Data Loss: Unlike Deployments (ephemeral storage), StatefulSets ensure ordered persistence.
  8. Metrics & Triggers:
  9. Same as Deployments: Monitors CPU/memory/custom metrics.
  10. HPA calculates: desiredReplicas = ceil[currentReplicas × (currentMetricValue / desiredMetricValue)].
  11. Cooldown: 5min default between scales.

Nuances & Considerations for StatefulSets

  • Ordered Scaling: Unlike Deployments (random Pod deletion), StatefulSets scales down from the end (-n first). Use .spec.updateStrategy.rollingUpdate.partition for canary-like control.
  • Headless Service: Required for StatefulSet discovery (DNS: web-2.web-headless.default.svc.cluster.local); HPA doesn't affect it.
  • Storage Coordination: Ensure PVs are zone-aware (topology keys) for multi-zone clusters to avoid data locality issues.
  • Metrics Challenges: Stateful Pods may have uneven load (e.g., primary replica); use custom metrics (e.g., via Prometheus Adapter) for accurate scaling.
  • Downtime Risk: Downscaling may disrupt state (e.g., lose quorum in 3-node etcd); set minReplicas high and use PodDisruptionBudgets (PDBs) to limit evictions.
  • Not for All: HPA works but test thoroughly; for databases, prefer Vertical Scaling (VPA) or manual control.
  • Limits: Max replicas capped by cluster capacity; HPA ignores PVC provisioning.

Best Practice: Combine with PDBs (kubectl create pdb web-pdb --min-available=2) to prevent too many simultaneous downscales.

Summary: HPA scales StatefulSets like Deployments but preserves data via PVCs and ordered identities. Use for elastic stateful apps (e.g., Kafka replicas); monitor for state consistency.

Vertical Pod Autoscaler (VPA)

VPA automatically adjusts Pod resource requests/limits (CPU/memory) based on historical usage, recommending or enforcing changes. It performs vertical scaling (resizing existing pods).

  • Analyzes metrics (via Metrics Server/Prometheus).
  • Recommends (mode: Off) or updates (mode: Auto) resources.
  • Evicts/recreates pods for changes (downtime risk).

Attachment

  • Targets: Pods, Deployments, StatefulSets (via Pod template).
  • YAML: No direct "attachment"; VPA watches via .spec.targetRef (e.g., Deployment name).

YAML Example

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web  # Targets Deployment "web"
  updatePolicy:
    updateMode: "Auto"  # "Off" for recommendations only
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: 100m
        memory: 50Mi
      maxAllowed:
        cpu: 1
        memory: 500Mi

Commands

  • Apply: kubectl apply -f vpa.yaml.
  • View Recommendations: kubectl get vpa web-vpa -o yaml (under .status.recommendation).
  • Evict for Update: kubectl evict pod <pod-name> (if Auto mode).

When to Use

  • Resource-inefficient apps (e.g., over/under-provisioned pods).
  • Cost savings: Right-size based on actual usage.
  • Not for: Apps with bursty loads (use HPA) or strict limits (manual tuning better).

Important Details

  • Requires VPA admission controller.
  • Modes: Off (recommendations), Initial (set on create), Auto (enforce, with eviction).
  • Pros: Optimizes resources; learns from usage.
  • Cons: Causes restarts; not for all apps (e.g., databases).

Key Differences & Best Practices

Aspect HPA VPA
Scaling Type Horizontal (# pods) Vertical (CPU/memory)
Target Deployment/StatefulSet Deployment/Pod
Downtime Minimal (rolling) Potential (eviction/recreate)
Metrics CPU/memory/custom Historical usage
  • Combine: Use HPA for traffic spikes, VPA for baseline optimization.
  • Monitor: kubectl top nodes/pods for metrics.
  • When: HPA for dynamic load; VPA for static apps.
  • Caution: VPA in Auto mode can disrupt; start with Off.

Kubernetes Pod Scheduling

Pod scheduling in Kubernetes involves the Scheduler deciding which node runs a Pod based on resource availability, constraints, and preferences. Key mechanisms ensure Pods land on suitable nodes while avoiding unsuitable ones. Below is a concise explanation of the core concepts.

1. Node Taints

  • Definition: Taints are repellent labels applied to nodes (via kubectl taint nodes) that prevent Pods from scheduling unless they tolerate the taint. They act as "do not disturb" signals.
  • Purpose: Reserve nodes for specific workloads (e.g., dedicated DB nodes) or mark unhealthy nodes.
  • Types:
  • NoSchedule: Prevents new Pods from scheduling.
  • PreferNoSchedule: Soft repellent (scheduler prefers avoidance).
  • NoExecute: Evicts existing Pods + prevents new ones.
  • YAML Example (Apply Taint): bash kubectl taint nodes worker-1 key=value:NoSchedule
  • Effect: Untolerated Pods are rejected; e.g., taint dedicated=db:NoSchedule reserves for DB Pods only.

2. Tolerations

  • Definition: Tolerations are Pod-level settings (in .spec.tolerations) that allow Pods to ignore specific taints and schedule on tainted nodes.
  • Purpose: Enables Pods to run on reserved/tainted nodes (e.g., high-CPU nodes).
  • Matching: Toleration must match taint's key, value, and effect (operator: Exists for any value, Equal for exact).
  • YAML Example (in Pod/Deployment spec): ```yaml spec: tolerations:
    • key: "key" operator: "Equal" value: "value" effect: "NoSchedule"
    • key: "dedicated" operator: "Exists" effect: "NoExecute" # Tolerates any value ```
  • Nuance: Tolerations don't prefer tainted nodes; they just allow scheduling.

3. Pod Affinity

  • Definition: Affinity rules in Pod spec (.spec.affinity) prefer or require Pods to schedule on nodes matching certain conditions (e.g., labels).
  • Purpose: Co-locate Pods for performance (e.g., app near its DB).
  • Types:
  • RequiredDuringSchedulingIgnoredDuringExecution: Hard requirement (fail if no match).
  • PreferredDuringSchedulingIgnoredDuringExecution: Soft preference (score-based).
  • Pod Affinity: Co-locate with other Pods (e.g., topologyKey: kubernetes.io/hostname for same node). ```yaml podAffinity: requiredDuringSchedulingIgnoredDuringExecution:
    • labelSelector: matchExpressions:
      • key: app operator: In values: ["cache"] topologyKey: kubernetes.io/hostname ```

4. Pod Anti-Affinity

  • Definition: Opposite of affinity; avoids scheduling Pods on nodes with matching conditions.
  • Purpose: Spread Pods for high availability (e.g., replicas on different nodes/zones).
  • Types: Required (hard) or Preferred (soft).
  • YAML Example: yaml spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 # Higher = stronger preference podAffinityTerm: labelSelector: matchLabels: app: web topologyKey: kubernetes.io/hostname # Avoid same node
  • Nuance: Use topologyKey: failure-domain.beta.kubernetes.io/zone for zone spreading.

5. Pod Disruption Budget (PDB)

  • Definition: PDBs (via kubectl create pdb) limit voluntary disruptions (e.g., node drains, scaling) to ensure minimum available Pods.
  • Purpose: Prevents too many Pods from going down simultaneously (e.g., during upgrades).
  • Fields:
  • minAvailable: Min Pods that must be available (e.g., 2 or 50%).
  • maxUnavailable: Max Pods that can be unavailable (e.g., 1 or 25%).
  • YAML Example: yaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: web-pdb spec: minAvailable: 2 # At least 2 Pods up selector: matchLabels: app: web
  • Nuance: Applies to Deployments/StatefulSets; ignored during involuntary disruptions (e.g., node failure).

6. Node Selectors

  • Definition: Simple, declarative way to constrain Pod scheduling to nodes matching specific labels (key-value pairs on nodes). It's a hard filter—Pods only schedule on matching nodes.
  • Purpose: Basic node affinity without complex expressions (e.g., target high-CPU nodes).
  • How It Works: Defined in Pod spec (.spec.nodeSelector); Scheduler filters nodes where all key-value pairs match.
  • YAML Example: ```yaml apiVersion: v1 kind: Pod metadata: name: high-cpu-pod spec: nodeSelector: cpu-type: high-performance # Matches nodes labeled 'cpu-type=high-performance' containers:
    • name: app image: myapp:1.0 ```
  • Apply Label to Node: kubectl label nodes worker-1 cpu-type=high-performance.
  • Nuances:
  • Ignores taints (combine with tolerations).
  • Simple but limited (no OR logic; use nodeAffinity for advanced).
  • When to Use: Simple zoning (e.g., dev/prod nodes); not for dynamic rules.

7. Topology Spread Constraints

  • Definition: Ensures Pods are evenly distributed across topology domains (e.g., zones, nodes, regions) to improve availability and resource utilization.
  • Purpose: Prevents all Pods from landing on one node/zone (e.g., for HA).
  • How It Works: Scheduler scores based on whenUnsatisfiable (ScheduleAnyway/DoNotSchedule) and maxSkew (max imbalance). Uses topologyKey (e.g., topology.kubernetes.io/zone).
  • YAML Example: yaml apiVersion: apps/v1 kind: Deployment spec: template: spec: topologySpreadConstraints: - maxSkew: 1 # Max 1 Pod difference per zone topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule # Hard constraint labelSelector: matchLabels: app: web
  • Nuances:
  • Applies to Pods matching the selector.
  • Combines with affinity (e.g., spread replicas across AZs).
  • When to Use: Multi-zone clusters for fault tolerance; avoids single points of failure.

8. Priority and Preemption

  • Definition: Assigns priority levels to Pods via PriorityClasses, enabling preemption (eviction of lower-priority Pods when resources are scarce).
  • Purpose: Ensures critical workloads (e.g., system Pods) run first by evicting non-critical ones.
  • How It Works:
  • PriorityClass: Global resource defining priority (e.g., 1000 for high, -1 for low).
  • Preemption: Scheduler evicts lower-priority Pods if a higher one can't schedule.
  • YAML: Reference in Pod spec (.spec.priorityClassName).
  • YAML Example: ```yaml # PriorityClass (cluster-wide) apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: high-priority value: 1000 # Higher = more important globalDefault: false description: "Critical workloads" # Pod using it apiVersion: v1 kind: Pod spec: priorityClassName: high-priority containers:
    • name: critical-app image: critical:1.0 ```
  • Nuances:
  • Eviction uses PDBs to limit impact.
  • System Pods (e.g., kube-system) have high defaults.
  • When to Use: Resource-constrained clusters; prioritize monitoring over dev workloads.

9. Scheduler Plugins

  • Definition: Extensible components in the kube-scheduler that perform filtering (eliminate unfit nodes) and scoring (rank remaining nodes).
  • Purpose: Customizes scheduling logic (e.g., for GPU affinity or cost optimization).
  • How It Works:
  • Filter Plugins: Hard checks (e.g., NodeAffinity, TaintToleration).
  • Score Plugins: Weighted scoring (e.g., ImageLocality for faster pulls).
  • Configured via SchedulerConfig (e.g., kube-scheduler.yaml).
  • YAML Example (Custom Config Snippet): ```yaml apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration profiles:
  • schedulerName: my-scheduler plugins: filter: enabled: - name: NodeAffinity - name: TaintToleration score: enabled: - name: ImageLocality # Prefer nodes with cached images weight: 10 ```
  • Nuances:
  • Default scheduler has ~20 plugins; extend via custom scheduler (e.g., Volcano for batch).
  • Order matters (early filters prune faster).
  • When to Use: Advanced needs (e.g., gang scheduling for ML jobs); default suffices for most.

10. Node Affinity

Node Affinity is a scheduling constraint that allows Pods to prefer or require specific nodes based on node labels (key-value pairs). It's part of the broader Affinity mechanism (.spec.affinity.nodeAffinity in Pod spec) and extends simple Node Selectors with more flexible expressions (e.g., OR logic, operators).

Definition & Purpose

  • Hard Requirement: Ensures Pods only schedule on matching nodes (e.g., nodes with SSDs).
  • Soft Preference: Scores nodes for better placement (e.g., prefer low-latency zones).
  • Use Case: Resource optimization (e.g., GPU nodes for ML), zoning (dev/prod separation), or performance (local storage nodes).

Types

Type Description Enforcement
RequiredDuringSchedulingIgnoredDuringExecution Hard rule: Fail if no match. Must satisfy for scheduling.
PreferredDuringSchedulingIgnoredDuringExecution Soft rule: Score nodes (0-100); schedule anywhere if no match. Weighted preference.

Key Fields

  • nodeSelectorTerms: Array of terms (OR logic across terms; AND within expressions).
  • matchExpressions: Operators like In, NotIn, Exists, DoesNotExist, Gt, Lt.
  • matchFields: Matches node fields (e.g., spec.unschedulable); less common.

YAML Example

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:  # Hard: Must have GPU
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu-type
            operator: In
            values: ["nvidia-a100"]
      preferredDuringSchedulingIgnoredDuringExecution:  # Soft: Prefer zone
      - weight: 80  # Higher = stronger preference
        preference:
          matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values: ["us-west-2a"]
  containers:
  - name: ml-app
    image: ml-app:1.0

Nuances

  • Labels: Apply to nodes via kubectl label nodes worker-1 gpu-type=nvidia-a100.
  • vs Node Selectors: Affinity is more expressive (multiple terms, operators); Selectors are simple equality.
  • Dynamic: Labels can change post-scheduling (no re-evaluation).
  • Performance: Soft rules add scoring overhead; use sparingly.

When to Use

  • Hard: Critical hardware (e.g., GPUs).
  • Soft: Optimization (e.g., zone preference for latency).
  • Avoid: Overly restrictive rules causing scheduling failures.

Summary: Node Affinity refines node selection with flexible matching—hard for requirements, soft for preferences. Tune with labels for targeted scheduling.

Overall Scheduling Flow

  1. Filtering: Apply selectors, taints/tolerations, resources, affinity (hard rules).
  2. Scoring: Rank survivors (affinity weights, spread, plugins).
  3. Binding: Assign Pod to best node.
  4. Preemption: If no fit, evict lower-priority Pods (respects PDBs).

Summary: Taints repel, tolerations allow, affinity attracts/repels, PDB protects availability. Tune for HA, performance, and cost. Selectors filter basically, topology spreads evenly, priority preempts, plugins customize. Use for balanced, resilient clusters.

Kubernetes Storage

Kubernetes storage enables Pods to access persistent data across restarts, nodes, and clusters. Unlike ephemeral container storage, it uses ephemeral volumes (temporary) and persistent storage (durable).

1. Volumes

  • Definition: A directory accessible to Pods, providing storage inside containers. Volumes outlive container lifecycle but tie to Pod lifecycle (deleted when Pod dies).
  • Purpose: Share data between containers in a Pod or persist temporary data.
  • Types (Ephemeral):
Type Description Use Case
emptyDir Temporary, node-local (deleted on Pod eviction). Scratch space, logs.
hostPath Mounts host directory (e.g., /var/log). Access host files (insecure).
configMap/Secret Mounts ConfigMap/Secret as files. Config injection.
  • YAML Example (in Pod spec): ```yaml spec: volumes:
    • name: temp-storage emptyDir: {}
    • name: host-logs hostPath: path: /var/log type: DirectoryOrCreate ```
  • Nuances: Ephemeral; for persistence, use PV/PVC.

2. VolumeMounts

  • Definition: Specifies how a Volume is mounted into a container (path and read-only flag).
  • Purpose: Injects storage into specific containers within a Pod.
  • YAML Example: ```yaml spec: containers:
    • name: app volumeMounts:
    • name: temp-storage # References volume mountPath: /app/tmp # Inside container readOnly: false ```
  • Nuances: Multiple mounts per volume; subPath for selective files (e.g., subPath: config.yaml).

3. PersistentVolume (PV)

  • Definition: A cluster-wide storage resource representing physical storage (e.g., AWS EBS volume, NFS share). It's a "piece of storage in the cluster."
  • Purpose: Abstracts backend storage; provisioned manually or dynamically.
  • Key Fields:
  • Capacity: Size (e.g., storage: 10Gi).
  • AccessModes: How it's mounted (e.g., ReadWriteOnce (RWO): single node; ReadWriteMany (RWX): multi-node; ReadOnlyMany (ROX)).
  • Reclaim Policy: What happens on PVC deletion (Retain: keep PV; Delete: destroy; Recycle: scrub).
  • YAML Example (Static PV): ```yaml apiVersion: v1 kind: PersistentVolume metadata: name: my-pv spec: capacity: storage: 10Gi accessModes:
    • ReadWriteOnce persistentVolumeReclaimPolicy: Retain storageClassName: standard hostPath: path: /data ```
  • Nuances: Static (manual) vs dynamic (StorageClass provisions); bound to one PVC at a time.

4. PersistentVolumeClaim (PVC)

  • Definition: A Pod's request for storage, like a "storage ticket." It binds to a matching PV and is used in Pod specs.
  • Purpose: Decouples Pods from storage details; Pods request "10Gi RWO" without knowing the backend.
  • Key Fields:
  • Requests: Desired capacity/access modes.
  • StorageClassName: Matches PV's class for dynamic provisioning.
  • YAML Example: ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: my-pvc spec: accessModes:
    • ReadWriteOnce resources: requests: storage: 10Gi storageClassName: standard ```
  • Usage in Pod/Deployment: ```yaml spec: volumes:
    • name: persistent-storage persistentVolumeClaim: claimName: my-pvc # References PVC containers:
    • name: app volumeMounts:
    • name: persistent-storage mountPath: /data ```
  • Nuances: Namespace-scoped; unbound PVCs wait for PV; dynamic provisioning creates PV if no match.

5. StorageClasses

  • Definition: Defines storage "classes" (e.g., fast SSD vs cheap HDD) for dynamic provisioning. Acts as a template for PV creation.
  • Purpose: Abstracts storage backends (e.g., AWS EBS, GCE PD); enables policy-based provisioning.
  • Key Fields:
  • Provisioner: Backend driver (e.g., ebs.csi.aws.com).
  • Parameters: Options (e.g., volume type: gp3).
  • AllowVolumeExpansion: Resize PVCs.
  • Default: Marked for auto-use if unspecified.
  • YAML Example: yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: fast-ssd provisioner: ebs.csi.aws.com parameters: type: gp3 allowVolumeExpansion: true reclaimPolicy: Delete volumeBindingMode: WaitForFirstConsumer # Delay binding until Pod schedules
  • Nuances: CSI (Container Storage Interface) drivers for modern backends; multiple classes for tiered storage.

Other Fundamental Storage Concepts

  • Access Modes:
Mode Description Example
RWO Read/Write by single node EBS volumes
RWX Read/Write by multiple nodes NFS, CephFS
ROX Read-only by multiple nodes CD-ROM images
  • Reclaim Policies (PV spec):
Policy Effect on PVC Delete
Retain PV persists; manual cleanup needed
Delete PV and storage destroyed
Recycle PV scrubbed and reused (deprecated)
  • Dynamic Provisioning: StorageClass + provisioner auto-creates PVs when PVC requests match (e.g., unbound PVC triggers EBS volume creation).
  • Volume Expansion: Resize PVCs online (if StorageClass allows); e.g., kubectl edit pvc my-pvc → increase requests.storage: 20Gi.
  • CSI Drivers: Modern standard for storage plugins (e.g., AWS EBS CSI); replaces in-tree drivers.
  • Storage Ephemerality: Without PV/PVC, data lost on Pod restart; use for caches (emptyDir) vs databases (PV).

Flow

  1. Create StorageClass (template).
  2. Create PV -> bind to a storage class
  3. Create PVC (request) → binds to PV (storage).
  4. Pod/Deployment references PVC via volumes/volumeMounts.
  5. Data persists across Pod restarts/nodes (if RWX).

Summary: PV = storage supply, PVC = demand, StorageClass = provisioning rules. Use for stateful apps; ephemeral volumes for temp data.

Custom Resources (CRs) in Kubernetes

Custom Resources (CRs) are user-defined extensions to the Kubernetes API that allow you to create your own objects (like Pod, Deployment) with custom behavior. They are the foundation of Kubernetes Operators and extensibility. Without CRs, you'd be limited to generic resources — forcing complex logic into ConfigMaps, annotations, or external systems.

What is a Custom Resource?

A Custom Resource is a user-defined object stored in Kubernetes etcd that extends the Kubernetes API.

  • Example: Instead of only managing Pod, you can define Database, Backup, GameServer, etc.
  • Analogy:

    Built-in resources = int, string
    Custom Resources = class Database { ... }

apiVersion: mycompany.com/v1
kind: Database
metadata:
  name: prod-db
spec:
  size: 100Gi
  engine: postgres

Core Concepts of Custom Resources

Concept Explanation
1. CRD (Custom Resource Definition) The schema that defines your new object type (like a database table schema).
2. Custom Resource (CR) An instance of the CRD (like a row in the table).
3. API Group & Version CRs live in custom API groups (e.g., stable.example.com/v1, databases.mycompany.com/v1alpha1).
4. Controller A reconciler (usually in an Operator) that watches CRs and makes the world match the desired state.
5. Validation OpenAPI v3 schema in CRD to enforce structure (e.g., size > 0).
6. Storage CRs are stored in etcd just like built-in objects.
7. Namespacing Can be namespaced or cluster-scoped.

1. CRD – The Blueprint

CRD YAML Structure

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.mycompany.com  # <plural>.<group>
spec:
  group: mycompany.com
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                size:
                  type: integer
                  minimum: 1
                engine:
                  type: string
                  enum: [postgres, mysql]
  scope: Namespaced  # or Cluster
  names:
    plural: databases
    singular: database
    kind: Database
    shortNames: [db]

Key Fields

Field Purpose
group Your domain (reverse DNS)
versions Supports multiple (like v1, v1beta1)
storage: true Only one version stores data
scope Namespaced or Cluster
names.kind The object type in YAML (kind: Database)
shortNames CLI shortcuts (kubectl get db)

2. Custom Resource (CR) – The Instance

apiVersion: mycompany.com/v1
kind: Database
metadata:
  name: prod-postgres
  namespace: production
spec:
  size: 100
  engine: postgres
  backupPolicy: daily

Apply:

kubectl apply -f database.yaml

View:

kubectl get databases
kubectl get db prod-postgres -o yaml

3. Controller – The Brain (Reconciliation Loop)

A controller watches CRs and makes the actual state match the desired state.

Reconciliation Loop

1. Watch CR events (create/update/delete)
2. Read current state (from cluster)
3. Read desired state (from CR spec)
4. Compare
5. Take action (create PVC, deploy StatefulSet, etc.)
6. Update status
7. Repeat

4. Status Subresource

CRs have two parts:

  • .specdesired state (input)
  • .statusobserved state (output)
status:
  phase: Running
  replicas: 3
  conditions:
  - type: Ready
    status: "True"
    lastUpdate: "2025-04-05T10:00:00Z"

Controller owns .status, user owns .spec.

5. Validation & Defaulting

OpenAPI v3 Schema in CRD

schema:
  openAPIV3Schema:
    type: object
    required: [spec]
    properties:
      spec:
        type: object
        required: [size, engine]
        properties:
          size:
            type: integer
            minimum: 1
            maximum: 1000
          engine:
            type: string
            enum: [postgres, mysql]

Default Values (via Webhook) Use mutating webhook to set defaults:

spec:
  size: 10  → webhook sets to 50 if omitted

Real-World Examples

Project CR Purpose
Cert-Manager Certificate Auto TLS
ArgoCD Application GitOps sync
Prometheus Operator ServiceMonitor Auto scraping
Istio VirtualService Traffic routing
Crossplane PostgreSQLInstance Cloud DB provisioning

Kubernetes Ingress & Ingress Controllers

Ingress is not a built-in resource — it's an API object that defines HTTP(S) routing rules.
An Ingress Controller is the actual software (NGINX, Traefik, HAProxy, etc.) that reads Ingress objects and configures a reverse proxy.

Ingress = "Take HTTP/S traffic from outside the cluster and route it to the correct Service (and Pod) based on URL, host, path, and other rules — all declaratively."

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /$1
spec:
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api-service
            port:
              number: 80
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-service
            port:
              number: 80
  tls:
  - hosts:
    - app.example.com
    secretName: app-tls-secret

2. What is an Ingress Controller?

Component Role
Ingress Resource Declarative rules (YAML)
Ingress Controller Reconciles rules → configures reverse proxy

Without a controller, Ingress does nothing.

3. Popular Ingress Controllers (2025)

Controller Type Key Features
NGINX Ingress L7 High perf, rewrite, auth
Traefik L7 Auto service discovery, middleware
HAProxy L7/L4 TCP/UDP, enterprise
Istio Gateway L7 mTLS, traffic splitting
Contour (Envoy) L7 gRPC, observability
Gloo L7 Function-level routing

4. How It Works – Step by Step

graph TD
    A[User: app.example.com/api] --> B[Load Balancer]
    B --> C[Ingress Controller Pod]
    C --> D[Reads Ingress YAML]
    D --> E[Configures NGINX/Traefik]
    E --> F[Routes to Service]
    F --> G[Pod]
  1. Userapp.example.com
  2. Cloud LB → forwards to Ingress Controller
  3. Controller watches Ingress objects
  4. Generates config → reloads proxy
  5. Routes to correct ServicePod

5. Key Ingress Fields

Field Purpose
spec.rules[].host Virtual host (e.g., api.example.com)
spec.rules[].http.paths[].path URL path (/api)
pathType Prefix, Exact, ImplementationSpecific
backend.service Target Service + port
spec.tls[] TLS termination (secret with cert/key)
metadata.annotations Controller-specific config

6. Path Types (Critical!)

Type Behavior
Prefix /api/api, /api/users
Exact /api only
ImplementationSpecific Controller decides (NGINX: regex, Traefik: regex)

7. Real-World Example (NGINX)

# 1. Services
apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  selector:
    app: web
  ports:
  - port: 80

--- 

# 2. Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: main-ingress
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    cert-manager.io/cluster-issuer: "letsencrypt"
spec:
  ingressClassName: nginx  # Points to controller
  tls:
  - hosts: [app.example.com]
    secretName: app-tls
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web
            port:
              number: 80

8. IngressClass – Avoid Conflicts

  1. An Ingress Class in Kubernetes is a resource that defines a specific Ingress controller to handle Ingress resources, allowing administrators to route traffic based on different controller capabilities and configurations.
  2. It enables the use of multiple Ingress controllers—such as NGINX, Traefik, or HAProxy—within the same cluster by associating specific Ingress resources with a designated controller through the ingressClassName field in the Ingress manifest
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: nginx
spec:
  controller: k8s.io/ingress-nginx# In Ingress
spec:
  ingressClassName: nginx

Multiple controllers? Use ingressClassName to route.

9. TLS Termination

  1. Create Secret: yaml apiVersion: v1 kind: Secret metadata: name: app-tls type: kubernetes.io/tls data: tls.crt: base64(cert) tls.key: base64(key)
  2. Auto-TLS with cert-manager: yaml annotations: cert-manager.io/cluster-issuer: "letsencrypt"

10. Advanced Features (Controller-Specific)

Feature Annotation Controller
Rate limiting nginx.ingress.kubernetes.io/limit-rps: "10" NGINX
Auth nginx.ingress.kubernetes.io/auth-url: ... NGINX
Canary nginx.ingress.kubernetes.io/canary-weight: "20" NGINX
Middleware traefik.ingress.kubernetes.io/router.middlewares: ... Traefik

11. Architecture Diagram

architecture

Summary Table

Component Role
Ingress YAML rules
Ingress Controller Proxy (NGINX/Traefik)
IngressClass Route to correct controller
Service Backend target
Secret TLS certs

Golden Rule:

Ingress = Rules
Ingress Controller = Engine
No controller = No routing

Kubernetes RBAC

Role-Based Access Control (RBAC) is Kubernetes’ default authorization system that controls who (user/service) can do what (verbs) on which resources in which namespace.

RBAC = "Who → Can do → What → Where"

2. Core RBAC Resources

Resource Purpose
**Role / ClusterRole** Define permissions (verbs on resources)
**RoleBinding / ClusterRoleBinding** Bind permissions to users/groups/service accounts
**Subject** Who gets access: User, Group, ServiceAccount

3. Role vs ClusterRole

Role ClusterRole
Scope Namespaced Cluster-wide
Use default namespace only All namespaces + cluster resources
Example Edit Pods in dev View Nodes cluster-wide

4. RoleBinding vs ClusterRoleBinding

RoleBinding ClusterRoleBinding
Binds Role → subject in one namespace ClusterRole → subject cluster-wide
Can bind Only Role ClusterRole or Role (namespaced)

5. Verbs (Actions)

Verb Meaning
get Read one resource
list Read many
watch Stream changes
create Make new
update / patch Modify
delete Remove
deletecollection Bulk delete

6. Resources & API Groups

Resource API Group
pods, services "" (core)
deployments, ingresses apps, networking.k8s.io
nodes, persistentvolumes cluster-level
* All resources

7. Full Example: Dev Can Edit Pods in dev Namespace

# 1. Role: What can be done
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: dev
  name: pod-editor
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list"]
# 2. RoleBinding: Who gets the role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: dev-pod-access
  namespace: dev
subjects:
- kind: User
  name: alice
  apiGroup: rbac.authorization.k8s.io
- kind: ServiceAccount
  name: deployer-sa
  namespace: dev
roleRef:
  kind: Role
  name: pod-editor
  apiGroup: rbac.authorization.k8s.io

8. Cluster-Wide: View All Nodes

# ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-viewer
rules:
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list", "watch"]
# ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: node-viewer-global
subjects:
- kind: User
  name: monitoring-bot
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: node-viewer
  apiGroup: rbac.authorization.k8s.io

9. Built-in ClusterRoles (Use These!)

ClusterRole Permissions
cluster-admin Everything
admin Most in a namespace
edit Create/update most resources
view Read-only
# Give admin in namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: namespace-admin
  namespace: staging
subjects:
- kind: User
  name: bob
roleRef:
  kind: ClusterRole
  name: admin
  apiGroup: rbac.authorization.k8s.io

10. Service Accounts & RBAC

  • A Service Account (SA) is a Kubernetes identity for non-human (applications, pods, processes) to authenticate and be authorized in the cluster.
  • A Service Account is scoped at namespace level.
# SA
apiVersion: v1
kind: ServiceAccount
metadata:
  name: backup-sa
  namespace: tools
# Bind to ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: backup-access
subjects:
- kind: ServiceAccount
  name: backup-sa
  namespace: tools
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io

Use in Pod:

spec:
  serviceAccountName: backup-sa

11. Testing RBAC

# Impersonate user
kubectl auth can-i create pods --as=alice -n dev
# → yes

kubectl auth can-i delete nodes --as=alice
# → no

12. Common Patterns

Goal Use
Dev team edits in namespace Role + RoleBinding
CI/CD deploys ServiceAccount + RoleBinding
Monitoring reads all ClusterRole(view) + ClusterRoleBinding
Admin per namespace ClusterRole(admin) + RoleBinding

13. Best Practices

Practice Why
Least privilege Only needed verbs/resources
Use groups system:developers
Avoid cluster-admin Except for admins
Use ServiceAccounts For apps, not users
Audit regularly kubectl get rolebindings -A

Golden Rule:

Never give cluster-admin unless absolutely needed.
Always bind ClusterRole with RoleBinding for namespace isolation.

Kubernetes Monitoring

1. Why Monitor Kubernetes?

Need What You Track
Reliability Pod restarts, OOM kills
Performance CPU, memory, latency
Capacity Node saturation
Security Anomalies, failed logins
SLOs 99.9% uptime

2. Core Monitoring Stack (2025 Standard)

Kubernetes
   ↓
cAdvisor (built-in) → Metrics Server → kube-state-metrics → Prometheus
   ↓
Grafana (dashboards) + Alertmanager + Kiali (Istio)

3. In-Built Components

Component Role Built-in?
**cAdvisor** Collects container metrics (CPU, memory, disk, network) Yes (in kubelet)
**Metrics Server** Aggregates cAdvisorkubectl top Yes (installable)
**kube-state-metrics** Exposes cluster state (Pods, Deployments, Nodes) No (install)

4. Metrics Server – kubectl top

What It Does

  • Lightweight, in-memory aggregator of metrics from all cAdvisors
  • Enables: bash kubectl top nodes kubectl top pods -n prod

Limits

  • No long-term storage
  • No alerting
  • No custom metrics

5. Prometheus – The Gold Standard

Feature Details
Pull-based Scrapes /metrics endpoints from various sources
Time-series DB Stores years of data
PromQL Powerful query language
Service Discovery.

Key Targets

Target Endpoint Metrics
kubelet /metrics, /metrics/cadvisor Container CPU/memory
API server /metrics Request latency
Nodes 10250 System stats
kube-state-metrics /metrics Pod count, phase
Your app /metrics (expose via client lib) HTTP requests, errors

6. Grafana – Visualization

  • Dashboards for Prometheus
  • Pre-built: Node Exporter, Kubernetes Cluster, Apps
  • Alerting via Prometheus rules
# Example Panel
sum(rate(container_cpu_usage_seconds_total{namespace="prod"}[5m])) by (pod)

7. Kiali – Service Mesh Observability (Istio)

Feature Use
Service Graph Visual traffic flow
Metrics Golden signals per service
Traces Distributed tracing
Config Validation Istio config errors

Only with Istio

8. Expose Application Metrics

Go Example

import "github.com/prometheus/client_golang/prometheus/promhttp"
http.Handle("/metrics", promhttp.Handler())

Python

from prometheus_client import start_http_server
start_http_server(8000)

Annotation (Auto-scrape)

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"

9. Alertmanager – Handle Alerts

# alert.rules
groups:
- name: node-alerts
  rules:
  - alert: NodeDown
    expr: up{job="node"} == 0
    for: 5m
    labels:
      severity: critical

Routes to Slack, PagerDuty, email.

10. Full Stack Overview

fullstack

11. Summary Table

Tool Type Must-Have?
cAdvisor Container metrics Yes (built-in)
Metrics Server kubectl top Yes
Prometheus Storage + query Yes
Grafana Dashboards Yes
Kiali Service mesh Yes (with Istio)
Alertmanager Alerts Yes

Golden Rule:

"If it’s not in Prometheus, it doesn’t exist."
Instrument everything. Alert on SLOs. Visualize trends.

Other Concepts

Annotations

Annotations are arbitrary key-value metadata attached to any Kubernetes object (Pod, Service, Deployment, etc.) — but they are NOT used for selecting or filtering.

metadata:
  annotations:
    app.kubernetes.io/version: "v1.2.3"
    prometheus.io/scrape: "true"
    backup.velero.io/backup-at: "2025-04-05T02:00:00Z"
Feature Labels Annotations
Purpose Identify & select objects Attach non-identifying metadata
Used by kubectl get pod -l app=web Not used in selectors
Size Small, indexed Up to 256KB
Example app: web, env: prod description, contact, backup-policy

Why Use Annotations?

Use Case Example
Tooling Integration prometheus.io/scrape: "true" → Prometheus auto-scrapes
Operators & Controllers helm.sh/hook: pre-install → Helm runs job
Backup & Restore velero.io/exclude-from-backup: "true"
Ingress Rules nginx.ingress.kubernetes.io/rewrite-target: /$1
CI/CD Metadata build-id: 12345, git-commit: abc123
Documentation owner: team-data@company.com
Custom Automation reloader.stakater.com/auto: "true" → ConfigMap reload

Real-World Examples

# 1. Prometheus
annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8080"

# 2. Helm
annotations:
  meta.helm.sh/release-name: my-app
  meta.helm.sh/release-namespace: prod

# 3. Cert-Manager
annotations:
  cert-manager.io/cluster-issuer: "letsencrypt"

# 4. Custom Operator
annotations:
  database.mycompany.com/backup-policy: daily

Best Practices

Do Don’t
Use structured prefixes (prometheus.io/, app.example.com/) Use random keys
Store non-identifying data Use for selectors
Keep under 256KB Store large logs
Use for automation hooks Hardcode in code

How Tools Use Annotations

Tool Reads Annotations For
Prometheus Scraping config
Helm Release tracking
ArgoCD Sync waves
Kubelet Pod behavior
Custom Controllers Triggers, policies

Summary:

  • Labels = Who is this?
  • Annotations = Extra info about this; metadata for tools and automation.
  • Not for filtering
  • Perfect for integration, hooks, and context

Istio

Istio = Service Mesh → Adds traffic control, security, observability to apps without code changes.

Core Architecture

Your App Pods
   ↓
Envoy Sidecar (auto-injected) to every container
   ↓
Istiod (Control Plane)
  • Envoy:
  • The Envoy proxy is deployed alongside each service instance as a sidecar container, intercepting all inbound and outbound traffic for that service.
  • This sidecar model allows Istio to enforce policies, collect telemetry, and manage traffic without requiring changes to the application code itself.
  • Istiod: Configures Envoy, certs, policies

1. Traffic Management

Feature How
Path-based routing GET /api → api-v1, POST /api → api-v2
Ratio-based (Canary) 90% → v1, 10% → v2
Header-based x-user-type: beta → canary
Fault Injection Delay 2s, abort 5%
Timeouts/Retries Auto retry on 5xx
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
  hosts: [api.example.com]
  http:
  - match:
    - uri: {prefix: /api}
      headers:
        x-user: {exact: beta}
    route:
    - destination: {host: api-v2, subset: v2}
      weight: 100
  - route:
    - {host: api-v1, subset: v1, weight: 90}
    - {host: api-v2, subset: v2, weight: 10}

2. mTLS Encryption (Mutual TLS)

  • Automatic between all services
  • Zero-trust: Every call encrypted + authenticated
  • Istiod issues short-lived certs (SPIFFE)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
spec:
  mtls:
    mode: STRICT  # Enforce mTLS

3. Access Control (Authorization)

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
spec:
  action: ALLOW
  rules:
  - from:
      - source: {principals: ["cluster.local/ns/prod/sa/api"]}
    to:
      - operation: {methods: ["GET"], paths: ["/public/*"]}

4. Observability (Golden Signals)

Tool What
Kiali Service graph, health
Prometheus Metrics (istio_requests_total)
Jaeger/Zipkin Traces
Grafana Dashboards

5. Key Resources

Resource Purpose
VirtualService Routing rules
DestinationRule Subsets, load balancing, circuit breaker
Gateway Ingress (L7 LB)
ServiceEntry External services (e.g., api.google.com)
PeerAuthentication mTLS mode
AuthorizationPolicy RBAC for traffic

6. Example: Canary + mTLS + Auth

# 1. Subsets
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
spec:
  host: reviews
  subsets:
  - name: v1
    labels: {version: v1}
  - name: v2
    labels: {version: v2}

# 2. 90/10 routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
  host: reviews
  http:
  - route:
    - {destination: {host: reviews, subset: v1}, weight: 90}
    - {destination: {host: reviews, subset: v2}, weight: 10}

# 3. Enforce mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
spec:
  mtls: {mode: STRICT}

7. Important Concepts

Concept Meaning
Sidecar Envoy injected into every Pod
Subset Group of Pods by labels (e.g., version: v2)
Gateway Ingress controller (replaces NGINX Ingress)
mTLS End-to-end encryption
Circuit Breaker Stop cascading failures
Fault Injection Test resilience

Golden Rule:

Istio = Envoy + Istiod → Traffic, Security, Observability without app changes.

Use Istio when:

  • Microservices
  • Canary/Blue-Green
  • Zero-trust security
  • Multi-cluster

Skip if:

  • Simple apps
  • Monolith

Now route, secure, and observe your traffic like a pro!
Try: istioctl dashboard kiali