Kubernetes (K8s)
1. Why Kubernetes? The Problem Docker Solves… and Doesn’t
| Problem | Docker | Kubernetes |
|---|---|---|
| Run app in container | Yes | Yes |
| Run 100 containers | Manual | Automated |
| Auto-restart failed container | No | Yes |
| Scale to 1000 containers | No | Yes |
| Rolling updates | No | Yes |
| Self-healing | No | Yes |
| Multi-host deployment | No | Yes |
Docker = "Run one container"
Kubernetes = "Orchestrate 10,000 containers across 100 machines"
2. What Kubernetes Offers on Top of Docker
| Feature | What It Does |
|---|---|
| Orchestration | Manages 1000s of containers across nodes |
| Self-healing | Auto-restart, reschedule failed pods |
| Auto-scaling | Scale up/down based on CPU/load |
| Rolling Updates | Zero-downtime deployments |
| Service Discovery | api.service → auto DNS |
| Load Balancing | Spread traffic across pods |
| Secret/Config Management | Inject env vars, files securely |
| Multi-cloud | Run same app on AWS, GCP, Azure, on-prem |
3. Kubernetes Architecture – Master vs Worker Nodes
+------------------+ gRPC/HTTP +------------------+
| MASTER NODE | ◄───────────────► | WORKER NODE |
| (Control Plane) | | (Runs Pods) |
+------------------+ +------------------+
MASTER NODE (Control Plane) – The Brain of K8s
Runs on 1 or 3+ nodes (HA)
Never runs user workloads
All components talk viakube-apiserver
+------------------+
| MASTER NODE |
| |
| ┌─────────────┐ |
| │ API Server │ ← All communication
| └─────▲───────┘ |
| │ |
| ┌─────▼───────┐ |
| │ etcd │ ← Single source of truth
| └─────▲───────┘ |
| │ |
| ┌─────▼───────┐ |
| │ Scheduler │ ← "Where to run?"
| └─────▲───────┘ |
| │ |
| ┌─────▼───────┐ |
| │ Controller │ ← "Make it match desired state"
| │ Manager │
| └─────────────┘ |
+------------------+
1. kube-apiserver – The Front Door
| Role | Details |
|---|---|
| Central API | All kubectl, controllers, kubelet → talk to this |
| REST API | GET /api/v1/pods, POST /api/v1/namespaces |
| Authentication | JWT, certificates, OIDC, webhook |
| Authorization | RBAC, ABAC, Node, Webhook |
| Validation | Rejects invalid YAML |
| Scaling | Horizontal (multiple replicas behind LB) |
# You talk to this
kubectl get pods --server=https://master:6443
2. etcd – The Database (Single Source of Truth)
| Role | Details |
|---|---|
| Key-value store | Only stores cluster state (pods, services, secrets) |
| Consistent & HA | Uses Raft consensus |
| Watched by all | Controllers react to changes |
| Backup critical | etcdctl snapshot save |
# See raw data
kubectl exec -n kube-system etcd-master -- etcdctl get /registry/pods/default/myapp
If etcd dies → cluster is brain-dead
Always 3-node etcd cluster in production
3. kube-scheduler – The Matchmaker
| Role | Details |
|---|---|
| Watches | Unscheduled pods (nodeName: null) |
| Scores nodes | CPU, memory, taints, affinity, topology |
| Assigns | Sets pod.spec.nodeName |
Scoring Example
# Pod wants SSD
nodeSelector:
disktype: ssd
→ Scheduler picks node with label disktype=ssd
4. kube-controller-manager – The Robot Army
Runs multiple controllers in one process:
| Controller | Job |
|---|---|
| ReplicaSet | Ensure 3 pods → if 2, create 1 |
| Deployment | Manage rollouts, rollback |
| StatefulSet | Ordered pods (db-0, db-1) |
| DaemonSet | Run on every node (logging, monitoring) |
| Job/CronJob | Run to completion |
| Node | Mark node NotReady if kubelet stops |
| Endpoint | Update Service → Pod IP mapping |
# See controllers in action
kubectl get rs,deployments,statefulsets -A
5. cloud-controller-manager
| Role | Cloud Integration |
|---|---|
| Node | Sync cloud node metadata |
| LoadBalancer | Create AWS ELB, GCP LB |
| Route | Cloud network routes |
| Service | Manage cloud-specific services |
Only runs in cloud environments
WORKER NODE – The Muscle
Runs user workloads (pods)
Multiple per cluster
+------------------+
| WORKER NODE |
| |
| ┌─────────────┐ |
| │ kubelet │ ← Talks to API server
| └─────▲───────┘ |
| │ |
| ┌─────▼───────┐ |
| │ kube-proxy │ ← Load balances
| └─────▲───────┘ |
| │ |
| ┌─────▼───────┐ |
| │ containerd │ ← Runs containers
| └─────▲───────┘ |
| │ |
| ┌─────▼───────┐ |
| │ Pods │ ← Your apps
| └─────────────┘ |
+------------------+
1. kubelet – The Node Agent
| Role | Details |
|---|---|
| Watches API server | Gets assigned pods |
| Talks to container runtime | Starts/stops containers |
| Reports status | CPU, memory, pod phase |
| Exec, logs, port-forward | kubectl exec, logs |
| cAdvisor | Built-in metrics |
# See what kubelet sees
journalctl -u kubelet
2. kube-proxy – The Network Cop
| Role | Details |
|---|---|
| Watches Services & Endpoints | When pod IP changes |
| Programs iptables / IPVS | Routes traffic |
| Load balances | Round-robin across pods |
Service Types Handled
type: ClusterIP → 10.96.0.1 → iptables DNAT
type: NodePort → 30080 → iptables
type: LoadBalancer → cloud LB
3. Container Runtime – The Engine
| Runtime | Status |
|---|---|
| containerd | Default since K8s 1.24 |
| CRI-O | Red Hat, lightweight |
| Docker | Deprecated (shim removed) |
Docker still works via
dockershim→containerd
# Check runtime
kubectl get nodes -o wide
# → container-runtime: containerd://1.7.0
Real-World Flow
graph TD
A[User: kubectl apply] --> B[API Server]
B --> C[etcd: store desired state]
C --> D[Scheduler: pick node]
D --> E[kubelet on node]
E --> F[containerd: pull image]
F --> G[Start containers]
G --> H[kube-proxy: update iptables]
H --> I[Service ready]
High Availability (HA) Setup
| Component | HA Strategy |
|---|---|
| API Server | 3+ replicas → LB (keepalived, cloud LB) |
| etcd | 3-node cluster (Raft) |
| Scheduler / Controller | Run on all masters (leader election) |
| Worker Nodes | 3+ for redundancy |
Summary Table
| Node | Component | Job |
|---|---|---|
| Master | kube-apiserver |
API gateway |
etcd |
Cluster database | |
scheduler |
Assign pods to nodes | |
controller-manager |
Run control loops | |
| Worker | kubelet |
Run pods on node |
kube-proxy |
Network proxy | |
containerd |
Run containers |
Golden Rule:
Master = Think, Store, Schedule
Worker = Run, Report, Route
Now you understand how Kubernetes turns 100 machines into one logical supercomputer.
Try:
kubectl get componentstatuses
kubectl -n kube-system get pods
And see the control plane in action!
Kubernetes Resources
Pods, Deployments, Services, DaemonSets, Secrets, ConfigMaps, StatefulSets & More
Kubernetes (K8s) resources are the declarative building blocks of your cluster. You define desired state in YAML/JSON, and K8s makes it reality through controllers and reconciliation loops.
Key Principle:
Imperative (
kubectl run) → temporary
Declarative (YAML) → persistent, version-controlled
1. Pod – The Atomic Unit
Definition
- Smallest deployable unit in K8s.
- Runs 1+ containers that share network, storage, and lifecycle.
- Ephemeral – pods die; don't manage directly (use Deployment).
Key Features
- Shared Resources: Containers in one pod communicate via
localhost. - Lifecycle: Scheduled to nodes, runs until completion/failure.
- Probes: Readiness/liveness to control traffic/health.
Kubernetes Probes
Kubernetes probes are health checks that determine pod behavior. They use HTTP, TCP, or command-based tests with configurable thresholds (e.g., initial delay, period, timeout, success/failure counts).
-
Liveness Probe
- Purpose: Detects if the pod is alive and healthy. If it fails, Kubernetes restarts the pod (self-healing).
- When Used: For apps that can deadlock or crash (e.g., memory leaks).
- Behavior: Failure → pod restarts; doesn't affect traffic routing.
- YAML Example:
yaml livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 # Restart after 3 failures -
Readiness Probe
- Purpose: Detects if the pod is ready to serve traffic. If it fails, Kubernetes removes it from Service endpoints (no traffic sent) but doesn't restart.
- When Used: For apps that need warmup time or become temporarily unhealthy (e.g., during DB connection).
- Behavior: Failure → pod excluded from load balancing; restarts only if liveness fails.
- YAML Example:
yaml readinessProbe: tcpSocket: port: 8080 initialDelaySeconds: 5 periodSeconds: 10 timeoutSeconds: 1 successThreshold: 1 failureThreshold: 3
Key Differences
| Aspect | Liveness Probe | Readiness Probe |
|---|---|---|
| Failure Action | Restart pod | Exclude from traffic |
| Impact on Service | No (traffic continues to healthy pods) | Yes (removes from endpoints) |
| Default | None | None |
| Use Case | Crash recovery | Traffic routing |
YAML Example
apiVersion: v1
kind: Pod
metadata:
name: nginx-pod
labels:
app: web
spec:
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
resources:
limits:
cpu: "100m"
memory: "128Mi"
- name: sidecar-logger
image: fluentd:v1.14
volumeMounts:
- name: logs
mountPath: /var/log/nginx
volumes:
- name: logs
emptyDir: {}
Use Cases
- Simple apps (single container).
- Sidecar pattern (app + logger/monitor).
Pros/Cons
- Pros: Fine-grained control.
- Cons: No auto-restart; use with Deployment.
2. Deployment – The Workhorse for Stateless Apps
Definition
- Manages ReplicaSets to ensure desired pod replicas.
- Handles rolling updates, rollbacks, and scaling.
- Stateless – assumes pods are interchangeable.
Key Features
- Strategy: RollingUpdate (default) or Recreate.
- Selectors: Matches pods via labels.
- Revision History: Tracks changes for rollback.
YAML Example
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-deployment
spec:
replicas: 3
selector:
matchLabels:
app: web
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25% # Extra pods during update
maxUnavailable: 25% # Allowed downtime
template: # Pod template
metadata:
labels:
app: web
spec:
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
livenessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 30
periodSeconds: 10
Use Cases
- Web servers, APIs, microservices.
- Scaling:
kubectl scale deployment web --replicas=5.
Pros/Cons
- Pros: Zero-downtime updates, self-healing.
- Cons: Not for stateful apps (use StatefulSet).
Rolling Updates
Rolling update is a zero-downtime deployment strategy in Kubernetes that gradually replaces old pods with new ones in a Deployment or StatefulSet. It ensures service availability by maintaining the desired number of replicas during updates (e.g., image version change), avoiding full outages.
- Why Used?: Minimizes disruption, supports canary/blue-green-like deployments, and auto-rollbacks on failures.
- When Applied?: Triggered by changes in Deployment spec (e.g.,
image: v1.0 → v1.1).
Strategies
Kubernetes supports 2 strategies in Deployment/StatefulSet .spec.strategy.type:
| Strategy | Description | Use Case |
|---|---|---|
| RollingUpdate (default) | Gradually scales down old pods while scaling up new ones, maintaining availability. | Production apps needing zero-downtime. |
| Recreate | Kills all old pods first, then creates new ones. | Simple apps where downtime is acceptable (e.g., batch jobs). |
RollingUpdate Parameters
.maxSurge: Max extra pods allowed during update (e.g.,25%or1→ temporary surge)..maxUnavailable: Max pods that can be unavailable (e.g.,25%or1→ controlled downtime).- Default:
maxSurge: 25%,maxUnavailable: 25%.
Monitor:
kubectl rollout status deployment/web
Rollback:kubectl rollout undo deployment/web
3. Service – Load Balancer & Service Discovery
Definition
- Stable endpoint for pods (abstracts pod IPs).
- Load balances traffic across matching pods.
- DNS-based discovery (e.g.,
web.default.svc.cluster.local). - To hit the deployment, we need to create a service and attach that to the deployment using labels and selectors. Then the internal url would be
{serviceName}.{namespace}.svc.cluter.local:{servicePort}.
Types
| Type | Description | Use Case |
|---|---|---|
ClusterIP (default) |
Internal IP (10.96.x.x) | Internal services |
NodePort |
Exposes on node port (30000-32767) | Basic external access |
LoadBalancer |
Cloud LB (AWS ELB) | Production external |
ExternalName |
CNAME to external service | Integrate with legacy |
YAML Example
apiVersion: v1
kind: Service
metadata:
name: web-service
spec:
selector:
app: web # Matches deployment labels
ports:
- protocol: TCP
port: 80 # Service port
targetPort: 80 # Pod port
type: LoadBalancer
Use Cases
- Expose Deployment:
kubectl get svc→ external IP. - Internal: Pods call
web-service:80.
Pros/Cons
- Pros: Automatic load balancing, health checks.
- Cons: ClusterIP not external-facing.
How Services Identify Pods/Deployments
Services discover and route traffic to Pods using label selectors in the Service spec. They don't directly reference Deployments but match Pods created by Deployments/StatefulSets via shared labels.
- Mechanism:
- Pods (from Deployment) get labels (e.g.,
app: web). - Service
.spec.selectormatches these labels. -
Kubernetes watches Pods and updates Service endpoints (Pod IPs) dynamically.
-
YAML Example: ```yaml # Deployment (creates Pods with labels) apiVersion: apps/v1 kind: Deployment metadata: name: web spec: selector: matchLabels: app: web template: metadata: labels: app: web # Pod label spec: containers: - name: nginx image: nginx
# Service (matches via selector)
apiVersion: v1
kind: Service
metadata:
name: web-service
spec:
selector:
app: web # Matches Pod labels
ports:
- port: 80
``
- **Discovery**: Pods atweb-service:80(DNS:web-service.default.svc.cluster.local`).
containerPort vs targetPort
- containerPort (Pod spec): The port the container listens on (documentation only; doesn't publish traffic).
- targetPort (Service spec): The port on the Pod that receives Service traffic (maps to containerPort; defaults to Service port if omitted).
| Field | Location | Purpose | Example |
|---|---|---|---|
| containerPort | Pod template (Deployment) | Container's listening port (info only) | containerPort: 8080 (app binds to 8080) |
| targetPort | Service spec | Pod port for incoming traffic | targetPort: 8080 (Service sends to Pod:8080) |
- YAML Example: ```yaml # In Deployment Pod spec containers:
- name: app
ports:
- containerPort: 8080 # App listens here
In Service
spec: ports: - port: 80 # Service port (e.g., DNS:80) targetPort: 8080 # Routes to Pod:8080 ```
- Flow: Client → Service:80 → Pod:8080 (containerPort).
Other Critical Fields for Service Discovery & Health Checks
Ensure seamless discovery (stable endpoints) and health (traffic routing) with these fields:
| Field | Location | Purpose | Best Practice |
|---|---|---|---|
| selector | Service spec | Matches Pod labels for discovery | Use unique labels (e.g., app: web, tier: frontend). |
| labels | Pod template (Deployment) | Enables selector matching | Consistent across Deployment/Service (e.g., app: web). |
| readinessProbe | Pod template | Checks if Pod is ready for traffic; failure removes from endpoints | HTTP/TCP/exec probe; e.g., initialDelaySeconds: 30 for warmup. |
| livenessProbe | Pod template | Checks if Pod is alive; failure restarts Pod (affects discovery indirectly) | Less frequent than readiness; e.g., periodSeconds: 60. |
| port | Service spec | Service's listening port (e.g., for DNS) | Match app needs; use name: http for multiple ports. |
| protocol | Service/Port spec | Traffic protocol (TCP/UDP/SCTP) | TCP default; UDP for streaming. |
| sessionAffinity | Service spec | Sticky sessions (client IP-based) | ClientIP for stateful apps; timeout configurable. |
| publishNotReadyAddresses | Service spec | Include unready Pods in endpoints | true for pre-warmup traffic (rare). |
| annotations | Service metadata | Metadata (e.g., for Ingress controllers) | e.g., nginx.ingress.kubernetes.io/rewrite-target: /. |
YAML Snippet (Full Example)
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: app
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 30
apiVersion: v1
kind: Service
spec:
selector:
app: myapp
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8080
sessionAffinity: ClientIP
publishNotReadyAddresses: false
Summary: Selectors enable discovery; probes ensure health; tune ports/probes for reliability. Misconfigured selectors cause "no endpoints" errors.
4. DaemonSet – Run on Every Node
Definition
- Ensures one pod per node (or selected nodes).
- Node-specific – ideal for agents.
Key Features
- Scheduling: Runs on all (or tainted) nodes.
- Rolling Updates: Similar to Deployment.
YAML Example
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd-logging
spec:
selector:
matchLabels:
name: fluentd
template:
metadata:
labels:
name: fluentd
spec:
containers:
- name: fluentd
image: fluent/fluentd:v1.14
volumeMounts:
- name: varlog
mountPath: /var/log
volumes:
- name: varlog
hostPath:
path: /var/log
tolerations: # Run on tainted nodes
- operator: Exists
Use Cases
- Logging (Fluentd), monitoring (Prometheus Node Exporter), storage (CSI drivers).
Pros/Cons
- Pros: Automatic per-node deployment.
- Cons: Scales with nodes; resource-heavy.
5. Secrets – Secure Data Management
Definition
- Stores sensitive data (passwords, tokens, keys) as base64-encoded strings.
- Mounts as volumes or env vars (encrypted at rest in etcd).
Key Features
- Base64 Encoding: Not encryption (use external tools for strong secrets).
- Access Control: RBAC for reading.
Security Aspects of Secrets
Kubernetes Secrets store sensitive data (e.g., passwords, API keys, tokens) as base64-encoded strings in etcd (the cluster's key-value store). Key security features:
-
Access Control: Protected by RBAC (Role-Based Access Control) policies; only authorized pods/services can read them.
-
Encryption at Rest: etcd can be configured for encryption (e.g., via EncryptionProvider). Secrets are not encrypted by default but can be with external tools like Vault.
- Transmission: Data is transmitted over TLS (via API server); not logged in plaintext
Mounts
Mounting injects data into Pods as environment variables (env) or volumes (volumeMounts). Volumes are preferred for files; env vars for simple values. Defined in Deployment's Pod template (.spec.template.spec).
- As Environment Variables: Injects keys as vars (e.g., DB_PASSWORD).
- As Volumes: Mounts as files in a directory (e.g., /etc/secrets/); updates require Pod restart.
NOTE: When mounting Secrets and ConfigMaps as volumes (volumeMounts in Pod spec), updates propagate automatically without Pod restart. The mounted files (e.g., /etc/secrets/token) are symlinks to etcd, so changes in the Secret/ConfigMap update the files in-place.
YAML Example
apiVersion: v1
kind: Secret
metadata:
name: db-secret
type: Opaque
data:
username: YXBwdXNlcg== # base64: "appuser"
password: U3VwZXJTZWNyZXQxMjM= # base64: "SuperSecret123"
apiVersion: v1
kind: Pod
metadata:
name: secret-pod
spec:
containers:
- name: app
image: myapp
env:
- name: DB_USER
valueFrom:
secretKeyRef:
name: db-secret
key: username
volumeMounts:
- name: secret-volume
mountPath: /etc/secrets
volumes:
- name: secret-volume
secret:
secretName: db-secret
Use Cases
- DB credentials, API keys, TLS certs.
Pros/Cons
- Pros: Avoids hardcoding secrets.
- Cons: Base64 is reversible; use Vault for advanced.
6. ConfigMap – Non-Sensitive Configuration
Definition
- Stores config data (env vars, files) as key-value pairs.
- Mounts dynamically without rebuilding images.
Key Features
- Data Sources: Key-value, files, or literals.
- Updates: Reload pods without restart (for some apps).
Mounts
- Same as secrets
YAML Example
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
database_url: "postgres://localhost:5432/myapp"
log_level: "INFO"
app_name: "MyApp v1.0"
apiVersion: v1
kind: Pod
metadata:
name: config-pod
spec:
containers:
- name: app
image: myapp
env:
- name: DB_URL
valueFrom:
configMapKeyRef:
name: app-config
key: database_url
volumeMounts:
- name: config-volume
mountPath: /etc/config
volumes:
- name: config-volume
configMap:
name: app-config
Use Cases
- App configs, feature flags, env-specific settings.
Pros/Cons
- Pros: Decouples config from code.
- Cons: Not encrypted (use Secrets for sensitive).
7. StatefulSet – For Stateful Apps
Definition
- Manages stateful workloads (e.g., databases) with stable identities.
- Ordered deployment/scaling, persistent storage.
Stateful Workload
A stateful workload is an application or service that maintains persistent state (data, configuration, or identity) across restarts, updates, or failures. It requires stable, ordered, and persistent storage to function correctly, unlike stateless workloads where instances are interchangeable.
Key Characteristics
- Persistent Data: Relies on durable storage (e.g., databases with user records).
- Stable Identity: Needs consistent naming/ordering (e.g., db-0, db-1).
- Ordered Operations: Scaling/updates must follow sequence (e.g., primary replica before secondary).
Examples - Stateful: Databases (MySQL, MongoDB), message queues (Kafka), clustered apps (ZooKeeper). - Stateless: Web servers (Nginx), APIs (FastAPI), simple microservices (no local data).
Why It Matters in Kubernetes
- Deployment: Use StatefulSet for stable Pods, headless Services, and PersistentVolumes (PVs).
- Challenges: Scaling requires coordination; failures need data migration.
- vs. Stateless: Deployments handle stateless apps easily (interchangeable replicas).
Summary: Stateful = "remembers who it is and what it knows" (e.g., your bank account balance). Use for data-heavy apps; stateless = "doesn't care" (e.g., a calculator).
Key Features
- Stable Names: Pods named
db-0,db-1(not random). - Headless Service: Direct pod access via DNS.
- A Headless Service is a Kubernetes Service with clusterIP: None.
- It does NOT get a single virtual IP — instead, it returns direct DNS A records for each Pod.
- Persistent Volumes: Binds storage to pod identity.
YAML Example
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mysql
spec:
serviceName: "mysql-headless"
replicas: 3
selector:
matchLabels:
app: mysql
template:
metadata:
labels:
app: mysql
spec:
containers:
- name: mysql
image: mysql:8.0
env:
- name: MYSQL_ROOT_PASSWORD
value: "password"
volumeMounts:
- name: mysql-storage
mountPath: /var/lib/mysql
volumeClaimTemplates:
- metadata:
name: mysql-storage
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
apiVersion: v1
kind: Service
metadata:
name: mysql-headless
spec:
clusterIP: None # Headless
selector:
app: mysql
ports:
- port: 3306
Use Cases
- Databases (MySQL, MongoDB), message queues (Kafka), clustered apps.
Pros/Cons
- Pros: Ordered scaling, stable storage.
- Cons: Slower scaling than Deployment.
8. Other Key Resources
ReplicaSet
- Ensures exact replica count (used by Deployment).
- YAML: Similar to Deployment but no strategy.
Job & CronJob
- Job: Run to completion (e.g., batch processing).
- CronJob: Scheduled jobs (e.g., daily backups).
- Example:
yaml apiVersion: batch/v1 kind: Job metadata: name: backup-job spec: template: spec: containers: - name: backup image: backup-tool restartPolicy: Never
Resource Relationships
User (YAML) → API Server → etcd
↓
Controller Loop
↓
Deployment → ReplicaSet → Pod → Container
↓
Service → Load Balance
Summary Table
| Resource | Use Case | Key Feature |
|---|---|---|
| Pod | Basic unit | 1+ containers |
| Deployment | Stateless apps | Rolling updates |
| Service | Exposure | Load balancing |
| DaemonSet | Node agents | Per-node pods |
| Secrets | Sensitive data | Encrypted env/files |
| ConfigMap | Config | Dynamic injection |
| StatefulSet | Databases | Ordered, stable |
Golden Rule:
Declarative YAML + Controllers = Self-healing cluster
Define desired state → K8s makes it real.
Now deploy a Deployment + Service and watch K8s orchestrate!
Kubernetes Autoscalers: HPA vs VPA
Kubernetes autoscalers dynamically adjust resources based on workload demands. HPA scales horizontally (more/fewer pods), while VPA scales vertically (CPU/memory allocation). Neither attaches to Services (Services route traffic to existing pods); they target Deployments, StatefulSets, or ReplicaSets (for HPA) or Pods (for VPA).
Horizontal Pod Autoscaler (HPA)
HPA automatically scales the number of pods in a target resource (e.g., Deployment) based on observed metrics like CPU utilization, memory, or custom metrics (via Metrics Server or Prometheus Adapter).
- Monitors metrics (default: 80% CPU threshold).
- Scales up/down to maintain target (e.g., replicas = current load / target utilization).
- Min/max replicas configurable.
Attachment
- Targets: Deployment, StatefulSet, ReplicaSet.
- YAML: Reference via
.spec.scaleTargetRef(e.g.,kind: Deployment,name: web).
YAML Example
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web # Attaches to Deployment "web"
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50 # Scale at 50% CPU
- type: Resource
resource:
name: memory
target:
type: AverageValue
averageValue: 500Mi # Scale at 500Mi memory
Commands
- Apply:
kubectl apply -f hpa.yaml. - Monitor:
kubectl get hpa,kubectl describe hpa web-hpa. - Delete:
kubectl delete hpa web-hpa.
When to Use
- High-traffic apps (e.g., web servers) with variable load.
- Cost optimization: Scale down during low traffic.
- Not for: Stateful apps (use StatefulSet HPA cautiously) or fixed-size workloads.
Important Details
- Requires Metrics Server (
kubectl top podsworks). - Cooldown: 5min default between scales.
- Pros: Simple, reactive scaling.
- Cons: Doesn't predict spikes; may overprovision.
How HPA Works with StatefulSets
- Scaling Up:
- HPA increases replicas (e.g., from 3 to 5).
- StatefulSet controller creates new Pods in order (e.g.,
web-3, thenweb-4). - Each new Pod gets a stable hostname (
web-3.<statefulset-name>.<namespace>.svc.cluster.local) and attaches to its corresponding PersistentVolumeClaim (PVC) (e.g.,data-web-3). -
Pods join the cluster (e.g., as replicas in a database like etcd).
-
Scaling Down:
- HPA decreases replicas (e.g., from 5 to 3).
- StatefulSet controller deletes highest-indexed Pods first (e.g.,
web-4, thenweb-3). -
Deleted Pods are terminated gracefully (with termination grace period, default 30s).
-
Data Persistence on Downscale
- Yes, data persists: StatefulSets bind PVCs to Pod identities (ordinal index).
- When scaling down, only the Pod is deleted; the PVC (and its bound PersistentVolume/PV) remains.
- Example: Scaling from 3 to 2 deletes
web-2;data-web-2PVC persists. - Re-attach on Scale-Up: If scaled back to 3,
web-2re-creates and re-mountsdata-web-2PVC, preserving data. - No Data Loss: Unlike Deployments (ephemeral storage), StatefulSets ensure ordered persistence.
-
Metrics & Triggers:
- Same as Deployments: Monitors CPU/memory/custom metrics.
- HPA calculates:
desiredReplicas = ceil[currentReplicas × (currentMetricValue / desiredMetricValue)]. - Cooldown: 5min default between scales.
Nuances & Considerations for StatefulSets
- Ordered Scaling: Unlike Deployments (random Pod deletion), StatefulSets scales down from the end (
-nfirst). Use.spec.updateStrategy.rollingUpdate.partitionfor canary-like control. - Headless Service: Required for StatefulSet discovery (DNS:
web-2.web-headless.default.svc.cluster.local); HPA doesn't affect it. - Storage Coordination: Ensure PVs are zone-aware (topology keys) for multi-zone clusters to avoid data locality issues.
- Metrics Challenges: Stateful Pods may have uneven load (e.g., primary replica); use custom metrics (e.g., via Prometheus Adapter) for accurate scaling.
- Downtime Risk: Downscaling may disrupt state (e.g., lose quorum in 3-node etcd); set
minReplicashigh and use PodDisruptionBudgets (PDBs) to limit evictions. - Not for All: HPA works but test thoroughly; for databases, prefer Vertical Scaling (VPA) or manual control.
- Limits: Max replicas capped by cluster capacity; HPA ignores PVC provisioning.
Best Practice: Combine with PDBs (kubectl create pdb web-pdb --min-available=2) to prevent too many simultaneous downscales.
Summary: HPA scales StatefulSets like Deployments but preserves data via PVCs and ordered identities. Use for elastic stateful apps (e.g., Kafka replicas); monitor for state consistency.
Vertical Pod Autoscaler (VPA)
VPA automatically adjusts Pod resource requests/limits (CPU/memory) based on historical usage, recommending or enforcing changes. It performs vertical scaling (resizing existing pods).
- Analyzes metrics (via Metrics Server/Prometheus).
- Recommends (
mode: Off) or updates (mode: Auto) resources. - Evicts/recreates pods for changes (downtime risk).
Attachment
- Targets: Pods, Deployments, StatefulSets (via Pod template).
- YAML: No direct "attachment"; VPA watches via
.spec.targetRef(e.g., Deployment name).
YAML Example
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web # Targets Deployment "web"
updatePolicy:
updateMode: "Auto" # "Off" for recommendations only
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 100m
memory: 50Mi
maxAllowed:
cpu: 1
memory: 500Mi
Commands
- Apply:
kubectl apply -f vpa.yaml. - View Recommendations:
kubectl get vpa web-vpa -o yaml(under.status.recommendation). - Evict for Update:
kubectl evict pod <pod-name>(if Auto mode).
When to Use
- Resource-inefficient apps (e.g., over/under-provisioned pods).
- Cost savings: Right-size based on actual usage.
- Not for: Apps with bursty loads (use HPA) or strict limits (manual tuning better).
Important Details
- Requires VPA admission controller.
- Modes:
Off(recommendations),Initial(set on create),Auto(enforce, with eviction). - Pros: Optimizes resources; learns from usage.
- Cons: Causes restarts; not for all apps (e.g., databases).
Key Differences & Best Practices
| Aspect | HPA | VPA |
|---|---|---|
| Scaling Type | Horizontal (# pods) | Vertical (CPU/memory) |
| Target | Deployment/StatefulSet | Deployment/Pod |
| Downtime | Minimal (rolling) | Potential (eviction/recreate) |
| Metrics | CPU/memory/custom | Historical usage |
- Combine: Use HPA for traffic spikes, VPA for baseline optimization.
- Monitor:
kubectl top nodes/podsfor metrics. - When: HPA for dynamic load; VPA for static apps.
- Caution: VPA in Auto mode can disrupt; start with Off.
Kubernetes Pod Scheduling
Pod scheduling in Kubernetes involves the Scheduler deciding which node runs a Pod based on resource availability, constraints, and preferences. Key mechanisms ensure Pods land on suitable nodes while avoiding unsuitable ones. Below is a concise explanation of the core concepts.
1. Node Taints
- Definition: Taints are repellent labels applied to nodes (via
kubectl taint nodes) that prevent Pods from scheduling unless they tolerate the taint. They act as "do not disturb" signals. - Purpose: Reserve nodes for specific workloads (e.g., dedicated DB nodes) or mark unhealthy nodes.
- Types:
- NoSchedule: Prevents new Pods from scheduling.
- PreferNoSchedule: Soft repellent (scheduler prefers avoidance).
- NoExecute: Evicts existing Pods + prevents new ones.
- YAML Example (Apply Taint):
bash kubectl taint nodes worker-1 key=value:NoSchedule - Effect: Untolerated Pods are rejected; e.g., taint
dedicated=db:NoSchedulereserves for DB Pods only.
2. Tolerations
- Definition: Tolerations are Pod-level settings (in
.spec.tolerations) that allow Pods to ignore specific taints and schedule on tainted nodes. - Purpose: Enables Pods to run on reserved/tainted nodes (e.g., high-CPU nodes).
- Matching: Toleration must match taint's key, value, and effect (operator:
Existsfor any value,Equalfor exact). - YAML Example (in Pod/Deployment spec):
```yaml
spec:
tolerations:
- key: "key" operator: "Equal" value: "value" effect: "NoSchedule"
- key: "dedicated" operator: "Exists" effect: "NoExecute" # Tolerates any value ```
- Nuance: Tolerations don't prefer tainted nodes; they just allow scheduling.
3. Pod Affinity
- Definition: Affinity rules in Pod spec (
.spec.affinity) prefer or require Pods to schedule on nodes matching certain conditions (e.g., labels). - Purpose: Co-locate Pods for performance (e.g., app near its DB).
- Types:
- RequiredDuringSchedulingIgnoredDuringExecution: Hard requirement (fail if no match).
-
PreferredDuringSchedulingIgnoredDuringExecution: Soft preference (score-based).
-
Pod Affinity: Co-locate with other Pods (e.g.,
topologyKey: kubernetes.io/hostnamefor same node). ```yaml podAffinity: requiredDuringSchedulingIgnoredDuringExecution:- labelSelector:
matchExpressions:
- key: app operator: In values: ["cache"] topologyKey: kubernetes.io/hostname ```
- labelSelector:
matchExpressions:
4. Pod Anti-Affinity
- Definition: Opposite of affinity; avoids scheduling Pods on nodes with matching conditions.
- Purpose: Spread Pods for high availability (e.g., replicas on different nodes/zones).
- Types: Required (hard) or Preferred (soft).
- YAML Example:
yaml spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 # Higher = stronger preference podAffinityTerm: labelSelector: matchLabels: app: web topologyKey: kubernetes.io/hostname # Avoid same node - Nuance: Use
topologyKey: failure-domain.beta.kubernetes.io/zonefor zone spreading.
5. Pod Disruption Budget (PDB)
- Definition: PDBs (via
kubectl create pdb) limit voluntary disruptions (e.g., node drains, scaling) to ensure minimum available Pods. - Purpose: Prevents too many Pods from going down simultaneously (e.g., during upgrades).
- Fields:
minAvailable: Min Pods that must be available (e.g.,2or50%).maxUnavailable: Max Pods that can be unavailable (e.g.,1or25%).- YAML Example:
yaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: web-pdb spec: minAvailable: 2 # At least 2 Pods up selector: matchLabels: app: web - Nuance: Applies to Deployments/StatefulSets; ignored during involuntary disruptions (e.g., node failure).
6. Node Selectors
- Definition: Simple, declarative way to constrain Pod scheduling to nodes matching specific labels (key-value pairs on nodes). It's a hard filter—Pods only schedule on matching nodes.
- Purpose: Basic node affinity without complex expressions (e.g., target high-CPU nodes).
- How It Works: Defined in Pod spec (
.spec.nodeSelector); Scheduler filters nodes where all key-value pairs match. - YAML Example:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: high-cpu-pod
spec:
nodeSelector:
cpu-type: high-performance # Matches nodes labeled 'cpu-type=high-performance'
containers:
- name: app image: myapp:1.0 ```
- Apply Label to Node:
kubectl label nodes worker-1 cpu-type=high-performance. - Nuances:
- Ignores taints (combine with tolerations).
- Simple but limited (no OR logic; use nodeAffinity for advanced).
- When to Use: Simple zoning (e.g., dev/prod nodes); not for dynamic rules.
7. Topology Spread Constraints
- Definition: Ensures Pods are evenly distributed across topology domains (e.g., zones, nodes, regions) to improve availability and resource utilization.
- Purpose: Prevents all Pods from landing on one node/zone (e.g., for HA).
- How It Works: Scheduler scores based on
whenUnsatisfiable(ScheduleAnyway/DoNotSchedule) andmaxSkew(max imbalance). UsestopologyKey(e.g.,topology.kubernetes.io/zone). - YAML Example:
yaml apiVersion: apps/v1 kind: Deployment spec: template: spec: topologySpreadConstraints: - maxSkew: 1 # Max 1 Pod difference per zone topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule # Hard constraint labelSelector: matchLabels: app: web - Nuances:
- Applies to Pods matching the selector.
- Combines with affinity (e.g., spread replicas across AZs).
- When to Use: Multi-zone clusters for fault tolerance; avoids single points of failure.
8. Priority and Preemption
- Definition: Assigns priority levels to Pods via PriorityClasses, enabling preemption (eviction of lower-priority Pods when resources are scarce).
- Purpose: Ensures critical workloads (e.g., system Pods) run first by evicting non-critical ones.
- How It Works:
- PriorityClass: Global resource defining priority (e.g., 1000 for high, -1 for low).
- Preemption: Scheduler evicts lower-priority Pods if a higher one can't schedule.
- YAML: Reference in Pod spec (
.spec.priorityClassName). - YAML Example:
```yaml
# PriorityClass (cluster-wide)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000 # Higher = more important
globalDefault: false
description: "Critical workloads"
# Pod using it
apiVersion: v1
kind: Pod
spec:
priorityClassName: high-priority
containers:
- name: critical-app image: critical:1.0 ```
- Nuances:
- Eviction uses PDBs to limit impact.
- System Pods (e.g., kube-system) have high defaults.
- When to Use: Resource-constrained clusters; prioritize monitoring over dev workloads.
9. Scheduler Plugins
- Definition: Extensible components in the kube-scheduler that perform filtering (eliminate unfit nodes) and scoring (rank remaining nodes).
- Purpose: Customizes scheduling logic (e.g., for GPU affinity or cost optimization).
- How It Works:
- Filter Plugins: Hard checks (e.g., NodeAffinity, TaintToleration).
- Score Plugins: Weighted scoring (e.g., ImageLocality for faster pulls).
- Configured via SchedulerConfig (e.g.,
kube-scheduler.yaml). - YAML Example (Custom Config Snippet): ```yaml apiVersion: kubescheduler.config.k8s.io/v1 kind: KubeSchedulerConfiguration profiles:
- schedulerName: my-scheduler plugins: filter: enabled: - name: NodeAffinity - name: TaintToleration score: enabled: - name: ImageLocality # Prefer nodes with cached images weight: 10 ```
- Nuances:
- Default scheduler has ~20 plugins; extend via custom scheduler (e.g., Volcano for batch).
- Order matters (early filters prune faster).
- When to Use: Advanced needs (e.g., gang scheduling for ML jobs); default suffices for most.
10. Node Affinity
Node Affinity is a scheduling constraint that allows Pods to prefer or require specific nodes based on node labels (key-value pairs). It's part of the broader Affinity mechanism (.spec.affinity.nodeAffinity in Pod spec) and extends simple Node Selectors with more flexible expressions (e.g., OR logic, operators).
Definition & Purpose
- Hard Requirement: Ensures Pods only schedule on matching nodes (e.g., nodes with SSDs).
- Soft Preference: Scores nodes for better placement (e.g., prefer low-latency zones).
- Use Case: Resource optimization (e.g., GPU nodes for ML), zoning (dev/prod separation), or performance (local storage nodes).
Types
| Type | Description | Enforcement |
|---|---|---|
| RequiredDuringSchedulingIgnoredDuringExecution | Hard rule: Fail if no match. | Must satisfy for scheduling. |
| PreferredDuringSchedulingIgnoredDuringExecution | Soft rule: Score nodes (0-100); schedule anywhere if no match. | Weighted preference. |
Key Fields
- nodeSelectorTerms: Array of terms (OR logic across terms; AND within expressions).
- matchExpressions: Operators like
In,NotIn,Exists,DoesNotExist,Gt,Lt. - matchFields: Matches node fields (e.g.,
spec.unschedulable); less common.
YAML Example
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution: # Hard: Must have GPU
nodeSelectorTerms:
- matchExpressions:
- key: gpu-type
operator: In
values: ["nvidia-a100"]
preferredDuringSchedulingIgnoredDuringExecution: # Soft: Prefer zone
- weight: 80 # Higher = stronger preference
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: ["us-west-2a"]
containers:
- name: ml-app
image: ml-app:1.0
Nuances
- Labels: Apply to nodes via
kubectl label nodes worker-1 gpu-type=nvidia-a100. - vs Node Selectors: Affinity is more expressive (multiple terms, operators); Selectors are simple equality.
- Dynamic: Labels can change post-scheduling (no re-evaluation).
- Performance: Soft rules add scoring overhead; use sparingly.
When to Use
- Hard: Critical hardware (e.g., GPUs).
- Soft: Optimization (e.g., zone preference for latency).
- Avoid: Overly restrictive rules causing scheduling failures.
Summary: Node Affinity refines node selection with flexible matching—hard for requirements, soft for preferences. Tune with labels for targeted scheduling.
Overall Scheduling Flow
- Filtering: Apply selectors, taints/tolerations, resources, affinity (hard rules).
- Scoring: Rank survivors (affinity weights, spread, plugins).
- Binding: Assign Pod to best node.
- Preemption: If no fit, evict lower-priority Pods (respects PDBs).
Summary: Taints repel, tolerations allow, affinity attracts/repels, PDB protects availability. Tune for HA, performance, and cost. Selectors filter basically, topology spreads evenly, priority preempts, plugins customize. Use for balanced, resilient clusters.
Kubernetes Storage
Kubernetes storage enables Pods to access persistent data across restarts, nodes, and clusters. Unlike ephemeral container storage, it uses ephemeral volumes (temporary) and persistent storage (durable).
1. Volumes
- Definition: A directory accessible to Pods, providing storage inside containers. Volumes outlive container lifecycle but tie to Pod lifecycle (deleted when Pod dies).
- Purpose: Share data between containers in a Pod or persist temporary data.
- Types (Ephemeral):
| Type | Description | Use Case |
|------|-------------|---------|
| emptyDir | Temporary, node-local (deleted on Pod eviction). | Scratch space, logs. |
| hostPath | Mounts host directory (e.g.,
/var/log). | Access host files (insecure). | | configMap/Secret | Mounts ConfigMap/Secret as files. | Config injection. | - YAML Example (in Pod spec):
```yaml
spec:
volumes:
- name: temp-storage emptyDir: {}
- name: host-logs hostPath: path: /var/log type: DirectoryOrCreate ```
- Nuances: Ephemeral; for persistence, use PV/PVC.
2. VolumeMounts
- Definition: Specifies how a Volume is mounted into a container (path and read-only flag).
- Purpose: Injects storage into specific containers within a Pod.
- YAML Example:
```yaml
spec:
containers:
- name: app volumeMounts:
- name: temp-storage # References volume mountPath: /app/tmp # Inside container readOnly: false ```
- Nuances: Multiple mounts per volume; subPath for selective files (e.g.,
subPath: config.yaml).
3. PersistentVolume (PV)
- Definition: A cluster-wide storage resource representing physical storage (e.g., AWS EBS volume, NFS share). It's a "piece of storage in the cluster."
- Purpose: Abstracts backend storage; provisioned manually or dynamically.
- Key Fields:
- Capacity: Size (e.g.,
storage: 10Gi). - AccessModes: How it's mounted (e.g.,
ReadWriteOnce(RWO): single node;ReadWriteMany(RWX): multi-node;ReadOnlyMany(ROX)). - Reclaim Policy: What happens on PVC deletion (Retain: keep PV; Delete: destroy; Recycle: scrub).
- YAML Example (Static PV):
```yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: my-pv
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce persistentVolumeReclaimPolicy: Retain storageClassName: standard hostPath: path: /data ```
- Nuances: Static (manual) vs dynamic (StorageClass provisions); bound to one PVC at a time.
4. PersistentVolumeClaim (PVC)
- Definition: A Pod's request for storage, like a "storage ticket." It binds to a matching PV and is used in Pod specs.
- Purpose: Decouples Pods from storage details; Pods request "10Gi RWO" without knowing the backend.
- Key Fields:
- Requests: Desired capacity/access modes.
- StorageClassName: Matches PV's class for dynamic provisioning.
- YAML Example:
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc
spec:
accessModes:
- ReadWriteOnce resources: requests: storage: 10Gi storageClassName: standard ```
- Usage in Pod/Deployment:
```yaml
spec:
volumes:
- name: persistent-storage persistentVolumeClaim: claimName: my-pvc # References PVC containers:
- name: app volumeMounts:
- name: persistent-storage mountPath: /data ```
- Nuances: Namespace-scoped; unbound PVCs wait for PV; dynamic provisioning creates PV if no match.
5. StorageClasses
- Definition: Defines storage "classes" (e.g., fast SSD vs cheap HDD) for dynamic provisioning. Acts as a template for PV creation.
- Purpose: Abstracts storage backends (e.g., AWS EBS, GCE PD); enables policy-based provisioning.
- Key Fields:
- Provisioner: Backend driver (e.g.,
ebs.csi.aws.com). - Parameters: Options (e.g., volume type:
gp3). - AllowVolumeExpansion: Resize PVCs.
- Default: Marked for auto-use if unspecified.
- YAML Example:
yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: fast-ssd provisioner: ebs.csi.aws.com parameters: type: gp3 allowVolumeExpansion: true reclaimPolicy: Delete volumeBindingMode: WaitForFirstConsumer # Delay binding until Pod schedules - Nuances: CSI (Container Storage Interface) drivers for modern backends; multiple classes for tiered storage.
Other Fundamental Storage Concepts
-
Access Modes: | Mode | Description | Example | |------|-------------|---------| | RWO | Read/Write by single node | EBS volumes | | RWX | Read/Write by multiple nodes | NFS, CephFS | | ROX | Read-only by multiple nodes | CD-ROM images |
-
Reclaim Policies (PV spec): | Policy | Effect on PVC Delete | |--------|----------------------| | Retain | PV persists; manual cleanup needed | | Delete | PV and storage destroyed | | Recycle | PV scrubbed and reused (deprecated) |
-
Dynamic Provisioning: StorageClass + provisioner auto-creates PVs when PVC requests match (e.g., unbound PVC triggers EBS volume creation).
- Volume Expansion: Resize PVCs online (if StorageClass allows); e.g.,
kubectl edit pvc my-pvc→ increaserequests.storage: 20Gi. - CSI Drivers: Modern standard for storage plugins (e.g., AWS EBS CSI); replaces in-tree drivers.
- Storage Ephemerality: Without PV/PVC, data lost on Pod restart; use for caches (emptyDir) vs databases (PV).
Flow
- Create StorageClass (template).
- Create PVC (request) → binds to PV (storage).
- Pod/Deployment references PVC via volumes/volumeMounts.
- Data persists across Pod restarts/nodes (if RWX).
Summary: PV = storage supply, PVC = demand, StorageClass = provisioning rules. Use for stateful apps; ephemeral volumes for temp data.
Custom Resources (CRs) in Kubernetes
Custom Resources (CRs) are user-defined extensions to the Kubernetes API that allow you to create your own objects (like Pod, Deployment) with custom behavior. They are the foundation of Kubernetes Operators and extensibility. Without CRs, you'd be limited to generic resources — forcing complex logic into ConfigMaps, annotations, or external systems.
What is a Custom Resource?
A Custom Resource is a user-defined object stored in Kubernetes etcd that extends the Kubernetes API.
- Example: Instead of only managing
Pod, you can defineDatabase,Backup,GameServer, etc. - Analogy:
Built-in resources =
int,string
Custom Resources =class Database { ... }
apiVersion: mycompany.com/v1
kind: Database
metadata:
name: prod-db
spec:
size: 100Gi
engine: postgres
Core Concepts of Custom Resources
| Concept | Explanation |
|---|---|
| 1. CRD (Custom Resource Definition) | The schema that defines your new object type (like a database table schema). |
| 2. Custom Resource (CR) | An instance of the CRD (like a row in the table). |
| 3. API Group & Version | CRs live in custom API groups (e.g., stable.example.com/v1, databases.mycompany.com/v1alpha1). |
| 4. Controller | A reconciler (usually in an Operator) that watches CRs and makes the world match the desired state. |
| 5. Validation | OpenAPI v3 schema in CRD to enforce structure (e.g., size > 0). |
| 6. Storage | CRs are stored in etcd just like built-in objects. |
| 7. Namespacing | Can be namespaced or cluster-scoped. |
1. CRD – The Blueprint
CRD YAML Structure
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: databases.mycompany.com # <plural>.<group>
spec:
group: mycompany.com
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
size:
type: integer
minimum: 1
engine:
type: string
enum: [postgres, mysql]
scope: Namespaced # or Cluster
names:
plural: databases
singular: database
kind: Database
shortNames: [db]
Key Fields
| Field | Purpose |
|---|---|
group |
Your domain (reverse DNS) |
versions |
Supports multiple (like v1, v1beta1) |
storage: true |
Only one version stores data |
scope |
Namespaced or Cluster |
names.kind |
The object type in YAML (kind: Database) |
shortNames |
CLI shortcuts (kubectl get db) |
2. Custom Resource (CR) – The Instance
apiVersion: mycompany.com/v1
kind: Database
metadata:
name: prod-postgres
namespace: production
spec:
size: 100
engine: postgres
backupPolicy: daily
Apply:
kubectl apply -f database.yaml
View:
kubectl get databases
kubectl get db prod-postgres -o yaml
3. Controller – The Brain (Reconciliation Loop)
A controller watches CRs and makes the actual state match the desired state.
Reconciliation Loop
1. Watch CR events (create/update/delete)
2. Read current state (from cluster)
3. Read desired state (from CR spec)
4. Compare
5. Take action (create PVC, deploy StatefulSet, etc.)
6. Update status
7. Repeat
4. Status Subresource
CRs have two parts:
- .spec → desired state (input)
- .status → observed state (output)
status:
phase: Running
replicas: 3
conditions:
- type: Ready
status: "True"
lastUpdate: "2025-04-05T10:00:00Z"
Controller owns
.status, user owns.spec.
5. Validation & Defaulting
OpenAPI v3 Schema in CRD
schema:
openAPIV3Schema:
type: object
required: [spec]
properties:
spec:
type: object
required: [size, engine]
properties:
size:
type: integer
minimum: 1
maximum: 1000
engine:
type: string
enum: [postgres, mysql]
Default Values (via Webhook) Use mutating webhook to set defaults:
spec:
size: 10 → webhook sets to 50 if omitted
Real-World Examples
| Project | CR | Purpose |
|---|---|---|
| Cert-Manager | Certificate |
Auto TLS |
| ArgoCD | Application |
GitOps sync |
| Prometheus Operator | ServiceMonitor |
Auto scraping |
| Istio | VirtualService |
Traffic routing |
| Crossplane | PostgreSQLInstance |
Cloud DB provisioning |
Kubernetes Ingress & Ingress Controllers
Ingress is not a built-in resource — it's an API object that defines HTTP(S) routing rules.
An Ingress Controller is the actual software (NGINX, Traefik, HAProxy, etc.) that reads Ingress objects and configures a reverse proxy.
Ingress = "Take HTTP/S traffic from outside the cluster and route it to the correct Service (and Pod) based on URL, host, path, and other rules — all declaratively."
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /$1
spec:
rules:
- host: app.example.com
http:
paths:
- path: /api
pathType: Prefix
backend:
service:
name: api-service
port:
number: 80
- path: /
pathType: Prefix
backend:
service:
name: web-service
port:
number: 80
tls:
- hosts:
- app.example.com
secretName: app-tls-secret
2. What is an Ingress Controller?
| Component | Role |
|---|---|
| Ingress Resource | Declarative rules (YAML) |
| Ingress Controller | Reconciles rules → configures reverse proxy |
Without a controller, Ingress does nothing.
3. Popular Ingress Controllers (2025)
| Controller | Type | Key Features |
|---|---|---|
| NGINX Ingress | L7 | High perf, rewrite, auth |
| Traefik | L7 | Auto service discovery, middleware |
| HAProxy | L7/L4 | TCP/UDP, enterprise |
| Istio Gateway | L7 | mTLS, traffic splitting |
| Contour (Envoy) | L7 | gRPC, observability |
| Gloo | L7 | Function-level routing |
4. How It Works – Step by Step
graph TD
A[User: app.example.com/api] --> B[Load Balancer]
B --> C[Ingress Controller Pod]
C --> D[Reads Ingress YAML]
D --> E[Configures NGINX/Traefik]
E --> F[Routes to Service]
F --> G[Pod]
- User →
app.example.com - Cloud LB → forwards to Ingress Controller
- Controller watches
Ingressobjects - Generates config → reloads proxy
- Routes to correct
Service→Pod
5. Key Ingress Fields
| Field | Purpose |
|---|---|
spec.rules[].host |
Virtual host (e.g., api.example.com) |
spec.rules[].http.paths[].path |
URL path (/api) |
pathType |
Prefix, Exact, ImplementationSpecific |
backend.service |
Target Service + port |
spec.tls[] |
TLS termination (secret with cert/key) |
metadata.annotations |
Controller-specific config |
6. Path Types (Critical!)
| Type | Behavior |
|---|---|
Prefix |
/api → /api, /api/users |
Exact |
/api only |
ImplementationSpecific |
Controller decides (NGINX: regex, Traefik: regex) |
7. Real-World Example (NGINX)
# 1. Services
apiVersion: v1
kind: Service
metadata:
name: web
spec:
selector:
app: web
ports:
- port: 80
---
# 2. Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: main-ingress
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
cert-manager.io/cluster-issuer: "letsencrypt"
spec:
ingressClassName: nginx # Points to controller
tls:
- hosts: [app.example.com]
secretName: app-tls
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: web
port:
number: 80
8. IngressClass – Avoid Conflicts
-
An Ingress Class in Kubernetes is a resource that defines a specific Ingress controller to handle Ingress resources, allowing administrators to route traffic based on different controller capabilities and configurations.
-
It enables the use of multiple Ingress controllers—such as NGINX, Traefik, or HAProxy—within the same cluster by associating specific Ingress resources with a designated controller through the ingressClassName field in the Ingress manifest
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: nginx
spec:
controller: k8s.io/ingress-nginx# In Ingress
spec:
ingressClassName: nginx
Multiple controllers? Use
ingressClassNameto route.
9. TLS Termination
-
Create Secret:
yaml apiVersion: v1 kind: Secret metadata: name: app-tls type: kubernetes.io/tls data: tls.crt: base64(cert) tls.key: base64(key) -
Auto-TLS with cert-manager:
yaml annotations: cert-manager.io/cluster-issuer: "letsencrypt"
10. Advanced Features (Controller-Specific)
| Feature | Annotation | Controller |
|---|---|---|
| Rate limiting | nginx.ingress.kubernetes.io/limit-rps: "10" |
NGINX |
| Auth | nginx.ingress.kubernetes.io/auth-url: ... |
NGINX |
| Canary | nginx.ingress.kubernetes.io/canary-weight: "20" |
NGINX |
| Middleware | traefik.ingress.kubernetes.io/router.middlewares: ... |
Traefik |
11. Architecture Diagram
Summary Table
| Component | Role |
|---|---|
| Ingress | YAML rules |
| Ingress Controller | Proxy (NGINX/Traefik) |
| IngressClass | Route to correct controller |
| Service | Backend target |
| Secret | TLS certs |
Golden Rule:
Ingress = Rules
Ingress Controller = Engine
No controller = No routing
Kubernetes RBAC
Role-Based Access Control (RBAC) is Kubernetes’ default authorization system that controls who (user/service) can do what (verbs) on which resources in which namespace.
RBAC = "Who → Can do → What → Where"
2. Core RBAC Resources
| Resource | Purpose |
|---|---|
Role / ClusterRole |
Define permissions (verbs on resources) |
RoleBinding / ClusterRoleBinding |
Bind permissions to users/groups/service accounts |
Subject |
Who gets access: User, Group, ServiceAccount |
3. Role vs ClusterRole
| Role | ClusterRole | |
|---|---|---|
| Scope | Namespaced | Cluster-wide |
| Use | default namespace only |
All namespaces + cluster resources |
| Example | Edit Pods in dev |
View Nodes cluster-wide |
4. RoleBinding vs ClusterRoleBinding
| RoleBinding | ClusterRoleBinding | |
|---|---|---|
| Binds | Role → subject in one namespace |
ClusterRole → subject cluster-wide |
| Can bind | Only Role |
ClusterRole or Role (namespaced) |
5. Verbs (Actions)
| Verb | Meaning |
|---|---|
get |
Read one resource |
list |
Read many |
watch |
Stream changes |
create |
Make new |
update / patch |
Modify |
delete |
Remove |
deletecollection |
Bulk delete |
6. Resources & API Groups
| Resource | API Group |
|---|---|
pods, services |
"" (core) |
deployments, ingresses |
apps, networking.k8s.io |
nodes, persistentvolumes |
cluster-level |
* |
All resources |
7. Full Example: Dev Can Edit Pods in dev Namespace
# 1. Role: What can be done
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: dev
name: pod-editor
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list"]
# 2. RoleBinding: Who gets the role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: dev-pod-access
namespace: dev
subjects:
- kind: User
name: alice
apiGroup: rbac.authorization.k8s.io
- kind: ServiceAccount
name: deployer-sa
namespace: dev
roleRef:
kind: Role
name: pod-editor
apiGroup: rbac.authorization.k8s.io
8. Cluster-Wide: View All Nodes
# ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: node-viewer
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
# ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: node-viewer-global
subjects:
- kind: User
name: monitoring-bot
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: node-viewer
apiGroup: rbac.authorization.k8s.io
9. Built-in ClusterRoles (Use These!)
| ClusterRole | Permissions |
|---|---|
cluster-admin |
Everything |
admin |
Most in a namespace |
edit |
Create/update most resources |
view |
Read-only |
# Give admin in namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: namespace-admin
namespace: staging
subjects:
- kind: User
name: bob
roleRef:
kind: ClusterRole
name: admin
apiGroup: rbac.authorization.k8s.io
10. Service Accounts & RBAC
- A Service Account (SA) is a Kubernetes identity for non-human (applications, pods, processes) to authenticate and be authorized in the cluster.
- A Service Account is scoped at namespace level.
# SA
apiVersion: v1
kind: ServiceAccount
metadata:
name: backup-sa
namespace: tools
# Bind to ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: backup-access
subjects:
- kind: ServiceAccount
name: backup-sa
namespace: tools
roleRef:
kind: ClusterRole
name: view
apiGroup: rbac.authorization.k8s.io
Use in Pod:
spec:
serviceAccountName: backup-sa
11. Testing RBAC
# Impersonate user
kubectl auth can-i create pods --as=alice -n dev
# → yes
kubectl auth can-i delete nodes --as=alice
# → no
12. Common Patterns
| Goal | Use |
|---|---|
| Dev team edits in namespace | Role + RoleBinding |
| CI/CD deploys | ServiceAccount + RoleBinding |
| Monitoring reads all | ClusterRole(view) + ClusterRoleBinding |
| Admin per namespace | ClusterRole(admin) + RoleBinding |
13. Best Practices
| Practice | Why |
|---|---|
| Least privilege | Only needed verbs/resources |
| Use groups | system:developers |
Avoid cluster-admin |
Except for admins |
| Use ServiceAccounts | For apps, not users |
| Audit regularly | kubectl get rolebindings -A |
Golden Rule:
Never give
cluster-adminunless absolutely needed.
Always bindClusterRolewithRoleBindingfor namespace isolation.
Kubernetes Monitoring
1. Why Monitor Kubernetes?
| Need | What You Track |
|---|---|
| Reliability | Pod restarts, OOM kills |
| Performance | CPU, memory, latency |
| Capacity | Node saturation |
| Security | Anomalies, failed logins |
| SLOs | 99.9% uptime |
2. Core Monitoring Stack (2025 Standard)
Kubernetes
↓
cAdvisor (built-in) → Metrics Server → kube-state-metrics → Prometheus
↓
Grafana (dashboards) + Alertmanager + Kiali (Istio)
3. In-Built Components
| Component | Role | Built-in? |
|---|---|---|
cAdvisor |
Collects container metrics (CPU, memory, disk, network) | Yes (in kubelet) |
Metrics Server |
Aggregates cAdvisor → kubectl top |
Yes (installable) |
kube-state-metrics |
Exposes cluster state (Pods, Deployments, Nodes) | No (install) |
4. Metrics Server – kubectl top
What It Does
- Lightweight, in-memory aggregator of metrics from all cAdvisors
- Enables:
bash kubectl top nodes kubectl top pods -n prod
Limits
- No long-term storage
- No alerting
- No custom metrics
5. Prometheus – The Gold Standard
| Feature | Details |
|---|---|
| Pull-based | Scrapes /metrics endpoints from various sources |
| Time-series DB | Stores years of data |
| PromQL | Powerful query language |
| Service Discovery. |
Key Targets
| Target | Endpoint | Metrics |
|---|---|---|
| kubelet | /metrics, /metrics/cadvisor |
Container CPU/memory |
| API server | /metrics |
Request latency |
| Nodes | 10250 |
System stats |
| kube-state-metrics | /metrics |
Pod count, phase |
| Your app | /metrics (expose via client lib) |
HTTP requests, errors |
6. Grafana – Visualization
- Dashboards for Prometheus
- Pre-built: Node Exporter, Kubernetes Cluster, Apps
- Alerting via Prometheus rules
# Example Panel
sum(rate(container_cpu_usage_seconds_total{namespace="prod"}[5m])) by (pod)
7. Kiali – Service Mesh Observability (Istio)
| Feature | Use |
|---|---|
| Service Graph | Visual traffic flow |
| Metrics | Golden signals per service |
| Traces | Distributed tracing |
| Config Validation | Istio config errors |
Only with Istio
8. Expose Application Metrics
Go Example
import "github.com/prometheus/client_golang/prometheus/promhttp"
http.Handle("/metrics", promhttp.Handler())
Python
from prometheus_client import start_http_server
start_http_server(8000)
Annotation (Auto-scrape)
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
9. Alertmanager – Handle Alerts
# alert.rules
groups:
- name: node-alerts
rules:
- alert: NodeDown
expr: up{job="node"} == 0
for: 5m
labels:
severity: critical
Routes to Slack, PagerDuty, email.
10. Full Stack Overview
11. Summary Table
| Tool | Type | Must-Have? |
|---|---|---|
| cAdvisor | Container metrics | Yes (built-in) |
| Metrics Server | kubectl top |
Yes |
| Prometheus | Storage + query | Yes |
| Grafana | Dashboards | Yes |
| Kiali | Service mesh | Yes (with Istio) |
| Alertmanager | Alerts | Yes |
Golden Rule:
"If it’s not in Prometheus, it doesn’t exist."
Instrument everything. Alert on SLOs. Visualize trends.
Other Concepts
Annotations
Annotations are arbitrary key-value metadata attached to any Kubernetes object (Pod, Service, Deployment, etc.) — but they are NOT used for selecting or filtering.
metadata:
annotations:
app.kubernetes.io/version: "v1.2.3"
prometheus.io/scrape: "true"
backup.velero.io/backup-at: "2025-04-05T02:00:00Z"
| Feature | Labels | Annotations |
|---|---|---|
| Purpose | Identify & select objects | Attach non-identifying metadata |
| Used by | kubectl get pod -l app=web |
Not used in selectors |
| Size | Small, indexed | Up to 256KB |
| Example | app: web, env: prod |
description, contact, backup-policy |
Why Use Annotations?
| Use Case | Example |
|---|---|
| Tooling Integration | prometheus.io/scrape: "true" → Prometheus auto-scrapes |
| Operators & Controllers | helm.sh/hook: pre-install → Helm runs job |
| Backup & Restore | velero.io/exclude-from-backup: "true" |
| Ingress Rules | nginx.ingress.kubernetes.io/rewrite-target: /$1 |
| CI/CD Metadata | build-id: 12345, git-commit: abc123 |
| Documentation | owner: team-data@company.com |
| Custom Automation | reloader.stakater.com/auto: "true" → ConfigMap reload |
Real-World Examples
# 1. Prometheus
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
# 2. Helm
annotations:
meta.helm.sh/release-name: my-app
meta.helm.sh/release-namespace: prod
# 3. Cert-Manager
annotations:
cert-manager.io/cluster-issuer: "letsencrypt"
# 4. Custom Operator
annotations:
database.mycompany.com/backup-policy: daily
Best Practices
| Do | Don’t |
|---|---|
Use structured prefixes (prometheus.io/, app.example.com/) |
Use random keys |
| Store non-identifying data | Use for selectors |
| Keep under 256KB | Store large logs |
| Use for automation hooks | Hardcode in code |
How Tools Use Annotations
| Tool | Reads Annotations For |
|---|---|
| Prometheus | Scraping config |
| Helm | Release tracking |
| ArgoCD | Sync waves |
| Kubelet | Pod behavior |
| Custom Controllers | Triggers, policies |
Summary:
- Labels = Who is this?
- Annotations = Extra info about this; metadata for tools and automation.
- Not for filtering
- Perfect for integration, hooks, and context
Istio
Istio = Service Mesh → Adds traffic control, security, observability to apps without code changes.
Core Architecture
Your App Pods
↓
Envoy Sidecar (auto-injected) to every container
↓
Istiod (Control Plane)
- Envoy:
- The Envoy proxy is deployed alongside each service instance as a sidecar container, intercepting all inbound and outbound traffic for that service.
- This sidecar model allows Istio to enforce policies, collect telemetry, and manage traffic without requiring changes to the application code itself.
- Istiod: Configures Envoy, certs, policies
1. Traffic Management
| Feature | How |
|---|---|
| Path-based routing | GET /api → api-v1, POST /api → api-v2 |
| Ratio-based (Canary) | 90% → v1, 10% → v2 |
| Header-based | x-user-type: beta → canary |
| Fault Injection | Delay 2s, abort 5% |
| Timeouts/Retries | Auto retry on 5xx |
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
hosts: [api.example.com]
http:
- match:
- uri: {prefix: /api}
headers:
x-user: {exact: beta}
route:
- destination: {host: api-v2, subset: v2}
weight: 100
- route:
- {host: api-v1, subset: v1, weight: 90}
- {host: api-v2, subset: v2, weight: 10}
2. mTLS Encryption (Mutual TLS)
- Automatic between all services
- Zero-trust: Every call encrypted + authenticated
- Istiod issues short-lived certs (SPIFFE)
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
spec:
mtls:
mode: STRICT # Enforce mTLS
3. Access Control (Authorization)
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
spec:
action: ALLOW
rules:
- from:
- source: {principals: ["cluster.local/ns/prod/sa/api"]}
to:
- operation: {methods: ["GET"], paths: ["/public/*"]}
4. Observability (Golden Signals)
| Tool | What |
|---|---|
| Kiali | Service graph, health |
| Prometheus | Metrics (istio_requests_total) |
| Jaeger/Zipkin | Traces |
| Grafana | Dashboards |
5. Key Resources
| Resource | Purpose |
|---|---|
VirtualService |
Routing rules |
DestinationRule |
Subsets, load balancing, circuit breaker |
Gateway |
Ingress (L7 LB) |
ServiceEntry |
External services (e.g., api.google.com) |
PeerAuthentication |
mTLS mode |
AuthorizationPolicy |
RBAC for traffic |
6. Example: Canary + mTLS + Auth
# 1. Subsets
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
spec:
host: reviews
subsets:
- name: v1
labels: {version: v1}
- name: v2
labels: {version: v2}
# 2. 90/10 routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
spec:
host: reviews
http:
- route:
- {destination: {host: reviews, subset: v1}, weight: 90}
- {destination: {host: reviews, subset: v2}, weight: 10}
# 3. Enforce mTLS
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
spec:
mtls: {mode: STRICT}
7. Important Concepts
| Concept | Meaning |
|---|---|
| Sidecar | Envoy injected into every Pod |
| Subset | Group of Pods by labels (e.g., version: v2) |
| Gateway | Ingress controller (replaces NGINX Ingress) |
| mTLS | End-to-end encryption |
| Circuit Breaker | Stop cascading failures |
| Fault Injection | Test resilience |
Golden Rule:
Istio = Envoy + Istiod → Traffic, Security, Observability without app changes.
Use Istio when: - Microservices - Canary/Blue-Green - Zero-trust security - Multi-cluster
Skip if: - Simple apps - Monolith
Now route, secure, and observe your traffic like a pro!
Try: istioctl dashboard kiali