Heartbeat Monitoring for Cron Jobs Explained
You set up a backup script to run every night at 2 AM. The crontab entry is there, the logs look fine from last week, and the server hasn't crashed. But nobody actually checked whether it ran last night. Three weeks pass. A database corruption hits, and the backup that should have saved you โ never ran. Nobody noticed.
This scenario plays out thousands of times per day across servers everywhere. It's the exact problem heartbeat monitoring for cron jobs solves: it tells you when a job doesn't show up on time, without you having to ask.
The Problem
Cron is fire-and-forget. You schedule a task, and that's it. The crontab doesn't care what happens after it forks the process. If the job fails to start because the executable is missing, hangs because a lock file was never released, or exits with error code 1 that nobody is reading โ cron stays completely silent. There is no built-in mechanism to say "hey, I was supposed to run but something went wrong."
Most teams discover this the hard way โ weeks or months after the damage starts accumulating:
- Automated reports that stopped generating, so nobody realized a pricing calculation was wrong
- Database backups that haven't run since the credentials rotated last quarter
- Data sync pipelines that silently dropped records after a schema migration
- Invoice generation that halted mid-run, leaving hundreds of customers unbilled
And the alert always comes from the wrong source: an angry customer, a missing report in a Monday meeting, a compliance audit finding โ never from your own infrastructure.
Why It Happens
Cron jobs fail for reasons that have nothing to do with the script itself:
- Resource exhaustion โ the server ran out of memory, the process got killed by the OOM killer
- Dependency failures โ a database connection pool is full, an API endpoint moved
- Silent hangs โ a network request times out after your timeout threshold, or a lock file wasn't released from a previous run
- Permission changes โ a credentials file rotated, file permissions changed
- Silent success โ the script ran but produced corrupt output (exit code 0, wrong result)
None of these necessarily produce an error in the cron log. The system believes everything is fine.
Why It's Dangerous
The danger scales directly with how critical the job is. A daily report that stops generating is annoying. A nightly database backup that silently stops is catastrophic. A payment reconciliation job that misses a batch means real revenue loss.
Here's what makes silent failures especially dangerous:
- They compound silently โ the longer a job has been failing, the harder recovery becomes. Missing backups snowball. Unprocessed message queues grow. Data drift widens between systems.
- You lose trust in the entire stack โ once you discover one silent failure, you start questioning everything. Which other jobs are silently broken? Should you rewrite all of them?
- Detection costs exponentially more time โ by the time you notice manually, you're not fixing a 5-minute issue. You're reconstructing weeks of lost data and apologising to stakeholders.
- Compliance and SLA impact โ many industries require proof that scheduled tasks ran. Missing a backup audit trail can mean regulatory violations.
The worst failures aren't the ones that throw errors. They're the ones that do nothing at all โ and nobody notices. According to research from PagerDuty, the average incident takes over 30 minutes to detect without proper monitoring. For silent cron failures, that "detection time" is measured in days or weeks.
How Heartbeat Monitoring Works
The concept is borrowed from network monitoring, where a "heartbeat" is a periodic signal that says "I'm alive." Applied to cron jobs, it works like this:
- Your job sends a lightweight HTTP request ("I just finished") to a monitoring endpoint when it completes.
- The monitoring system expects to receive this signal on a defined schedule.
- If the signal doesn't arrive within the expected window, the monitoring system alerts you.
โโโโโโโโโโโโโโโ โ
"I ran!" โโโโโโโโโโโโโโโโ
โ Cron Job โ โโโโโโโโโโโโโโโโโโโ โ Monitoring โ
โ (any task) โ โ Service โ
โโโโโโโโโโโโโโโ โโโโโโโโฌโโโโโโโโ
โ
Missed? โโโโโค
โ
๐ Alert!
The key insight: heartbeat monitoring detects absence of evidence. You don't need to predict every possible failure mode. If the job doesn't check in, something went wrong โ and you get told about it.
Simple Solution with curl
The simplest way to add heartbeat monitoring to any cron job is a single curl command:
# Your actual job
/usr/local/bin/backup.sh
# Send a heartbeat signal (only if the previous command succeeded)
if [ $? -eq 0 ]; then
curl -fsS -m 10 --retry 3 https://your-monitor-endpoint.com/beat/job-123
fi
This sends a GET request after the backup script completes successfully. The monitoring endpoint expects this request every 24 hours. If it doesn't arrive, it fires an alert.
For more detailed monitoring, send exit codes:
/usr/local/bin/backup.sh
EXIT_CODE=$?
curl -fsS -m 10 --retry 3 -X POST \
-H "Content-Type: application/json" \
-d "{\"status\": \"$EXIT_CODE\", \"duration\": \"$SECONDS\"}" \
https://your-monitor-endpoint.com/beat/job-123
This pattern works with any cron job โ shell scripts, Python scripts, Node.js, Go binaries. If your job can make an HTTP request, it can send a heartbeat.
Integrating QuietPulse into the Workflow
Instead of building this yourself, you can use a simple heartbeat monitoring tool like QuietPulse. You create jobs in the dashboard, copy a unique heartbeat URL into your scripts, and get alerts when jobs don't check in. No infrastructure, no configuration โ paste a URL and you're done. QuietPulse supports multiple notification channels including Telegram and webhooks, so you can route alerts wherever your team already pays attention.
Common Mistakes
1. Only Sending Heartbeats on Success
If your job fails and never sends a heartbeat, you'll get an alert โ but you'll have no idea why it failed. Send the exit code or at least distinguish between "ran successfully" and "ran with errors."
2. Setting Timeout Windows Too Tight
If your job runs between 30 seconds and 3 minutes, don't set the monitoring window to 60 seconds. Random delays (slow DNS, temporary locks) will cause false alarms. Add buffer.
3. Not Handling the Heartbeat Request Itself
If the heartbeat HTTP call fails (network issue on your server), that shouldn't fail your job. Use curl -f with a timeout and don't chain it with set -e in bash scripts.
4. Monitoring Only the Easy Jobs
The jobs you monitor should be the ones that hurt most when they fail. Start with backups, data exports, payment reconciliation โ not log rotation.
5. Ignoring the Alert
This sounds obvious, but it happens constantly: teams set up heartbeat monitoring, get the first alert, dismiss it as a fluke, and miss the real pattern. Treat the first missed heartbeat as a real failure until proven otherwise.
Alternative Approaches
Heartbeat monitoring isn't the only way to detect cron job failures, but it's often the most practical. Here's how it compares to other approaches:
Log Monitoring
Parse cron logs (/var/log/cron or journalctl) and look for execution entries. Pros: no code changes. Cons: doesn't detect hangs or silent errors. The job might run and produce garbage output.
Exit Code Tracking
Capture and store exit codes from every cron job execution. Pros: more detail. Cons: requires wrapping every job, and still doesn't detect jobs that never start.
Output Monitoring
Check that your job produces the expected output files or database records. Pros: validates actual results. Cons: complex to set up for every job, requires knowing the expected output format.
Uptime Monitoring
Traditional uptime checks (pinging a server, checking HTTP response). Pros: simple. Cons: only tells you the server is up, not that your specific jobs ran.
Heartbeat Monitoring
The job actively reports completion. Pros: detects any failure that prevents the heartbeat from being sent. Cons: requires a small code change (adding the HTTP call).
For most teams, heartbeat monitoring provides the best signal-to-noise ratio: simple to set up, reliable, and it catches exactly what matters โ the jobs that didn't run.
FAQ
For a complete production checklist around pings, grace periods, alert channels, and logs, see the Cron Job Monitoring Guide.
What is heartbeat monitoring for cron jobs?
Heartbeat monitoring is a pattern where a scheduled task sends a signal (like an HTTP request) when it completes. A monitoring system expects these signals on a defined schedule and alerts you if they stop arriving. It detects the absence of expected activity.
How is heartbeat monitoring different from log monitoring?
Log monitoring checks that cron tried to run a job. Heartbeat monitoring checks that the job actually completed successfully. A job can appear in cron logs while silently failing or hanging โ heartbeat monitoring catches this.
Do I need a special tool for heartbeat monitoring?
Technically, no. You can build a basic version with a simple API endpoint. But dedicated tools like QuietPulse handle scheduling, alert routing, history, and edge cases (timezone handling, grace periods) out of the box.
How often should I expect heartbeats?
Your heartbeat interval should match your job's schedule plus some buffer. A daily job should heartbeat every 24 hours with a grace period of 1โ2 hours. An hourly job might heartbeat every 60 minutes with a 15-minute grace period.
Can I send heartbeats from inside Docker containers or Kubernetes jobs?
Yes, as long as the container can make outbound HTTP requests. The heartbeat call is just a curl or equivalent โ it works from any environment with network access. In Kubernetes, this is especially useful for CronJobs: add a post-run hook that sends a heartbeat after each job completes.
What happens if my cron job runs longer than expected?
A good heartbeat monitoring service lets you set a "grace period" โ a buffer window after the expected completion time. If your job usually takes 2 minutes but sometimes takes 10, set the grace period to 15 minutes. That way, slow runs don't trigger false alarms, but genuinely failed jobs still get caught.
Conclusion
Cron is great at starting jobs and terrible at telling you when they fail. Heartbeat monitoring closes that gap by having each job check in when it's done. One extra line in your script, and you'll never find out about a missed backup from an angry user again.
The simplest approach: add a curl call at the end of your critical jobs. If you want something that handles scheduling, history, and alerts without building infrastructure, tools like QuietPulse make it painless. Either way, monitor the jobs that matter.