Automate Kubernetes Pod Health Checks & Alerts: Step-by-Step Guide (No Downtime!)

Here’s a step-by-step guide to automate health checks and alerts for failed pods in Kubernetes using Prometheus and Alertmanager.

Step 1: Install Prometheus for Monitoring

Prometheus will monitor the health of your pods and trigger alerts when they fail.

1.1 Create a Namespace

kubectl create namespace monitoring

1.2 Deploy Prometheus using Helm (Recommended)

Helm makes installation easier. First, install Helm if you haven’t:

curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash

Now, add the Prometheus Helm chart and install it:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring

Step 2: Verify Prometheus Installation

Check if Prometheus is running:

kubectl get pods -n monitoring

If everything is fine, you should see pods like:

prometheus-kube-prometheus-stack-0        Running
alertmanager-kube-prometheus-stack-0      Running

Step 3: Install and Configure Alertmanager

Alertmanager will send alerts when a pod fails.

3.1 Configure Alertmanager to Send Alerts

Edit the Alertmanager configuration:

kubectl edit configmap alertmanager-kube-prometheus-stack -n monitoring

Add a receiver (for example, Slack):

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#alerts'
        api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

Save and exit.

3.2 Restart Alertmanager

kubectl delete pod -l app.kubernetes.io/name=alertmanager -n monitoring

Step 4: Define an Alert Rule for Failed Pods

Create a Prometheus alert rule to detect failed pods.

4.1 Create a YAML File for the Rule

Create a file alert-rules.yaml:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: pod-down-alert
  namespace: monitoring
spec:
  groups:
    - name: pod-alerts
      rules:
        - alert: PodDown
          expr: kube_pod_status_phase{phase="Failed"} > 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Pod {{ $labels.pod }} is down"
            description: "The pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has failed."

4.2 Apply the Rule

kubectl apply -f alert-rules.yaml

Step 5: Test the Alerting System

Now, simulate a pod failure to see if the alerting system works.

5.1 Deploy a Sample Pod

Create a failing pod failing-pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: fail-pod
  namespace: default
spec:
  containers:
  - name: fail-container
    image: busybox
    command: ["sh", "-c", "exit 1"]  # This will force the container to fail
  restartPolicy: Never

Apply it:

kubectl apply -f failing-pod.yaml

5.2 Check if the Alert Fires

kubectl get pods

Wait for 2 minutes, and check if an alert is triggered in Prometheus Alertmanager.

Step 6: View Alerts in Prometheus

Port-forward Prometheus UI: kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090 -n monitoring
Open http://localhost:9090 in your browser.
Go to Alerts, and you should see the PodDown alert.

Step 7: Receive Alerts in Slack (or Email)

If everything is set up correctly, Alertmanager should send an alert to your configured Slack channel.

Conclusion

Now, your Kubernetes cluster will: ✅ Monitor pod health using Prometheus
✅ Detect failed pods in real time
✅ Send alerts via Alertmanager (Slack, email, etc.)