
Here’s a step-by-step guide to automate health checks and alerts for failed pods in Kubernetes using Prometheus and Alertmanager.
Step 1: Install Prometheus for Monitoring
Prometheus will monitor the health of your pods and trigger alerts when they fail.
1.1 Create a Namespace
kubectl create namespace monitoring
1.2 Deploy Prometheus using Helm (Recommended)
Helm makes installation easier. First, install Helm if you haven’t:
curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash
Now, add the Prometheus Helm chart and install it:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring
Step 2: Verify Prometheus Installation
Check if Prometheus is running:
kubectl get pods -n monitoring
If everything is fine, you should see pods like:
prometheus-kube-prometheus-stack-0 Running
alertmanager-kube-prometheus-stack-0 Running
Step 3: Install and Configure Alertmanager
Alertmanager will send alerts when a pod fails.
3.1 Configure Alertmanager to Send Alerts
Edit the Alertmanager configuration:
kubectl edit configmap alertmanager-kube-prometheus-stack -n monitoring
Add a receiver (for example, Slack):
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
Save and exit.
3.2 Restart Alertmanager
kubectl delete pod -l app.kubernetes.io/name=alertmanager -n monitoring
Step 4: Define an Alert Rule for Failed Pods
Create a Prometheus alert rule to detect failed pods.
4.1 Create a YAML File for the Rule
Create a file alert-rules.yaml
:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: pod-down-alert
namespace: monitoring
spec:
groups:
- name: pod-alerts
rules:
- alert: PodDown
expr: kube_pod_status_phase{phase="Failed"} > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is down"
description: "The pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has failed."
4.2 Apply the Rule
kubectl apply -f alert-rules.yaml
Step 5: Test the Alerting System
Now, simulate a pod failure to see if the alerting system works.
5.1 Deploy a Sample Pod
Create a failing pod failing-pod.yaml
:
apiVersion: v1
kind: Pod
metadata:
name: fail-pod
namespace: default
spec:
containers:
- name: fail-container
image: busybox
command: ["sh", "-c", "exit 1"] # This will force the container to fail
restartPolicy: Never
Apply it:
kubectl apply -f failing-pod.yaml
5.2 Check if the Alert Fires
kubectl get pods
Wait for 2 minutes, and check if an alert is triggered in Prometheus Alertmanager.
Step 6: View Alerts in Prometheus
- Port-forward Prometheus UI:
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090 -n monitoring
- Open http://localhost:9090 in your browser.
- Go to Alerts, and you should see the PodDown alert.
Step 7: Receive Alerts in Slack (or Email)
If everything is set up correctly, Alertmanager should send an alert to your configured Slack channel.
Conclusion
Now, your Kubernetes cluster will: ✅ Monitor pod health using Prometheus
✅ Detect failed pods in real time
✅ Send alerts via Alertmanager (Slack, email, etc.)