In This Article
- The Problem with Single-Location Monitoring
- What Is Distributed Monitoring?
- How a Check Actually Runs
- Types of Distributed Checks
- Consensus-Based Alerting
- Degraded vs. Hard Down: Why the Difference Matters
- Region-Aware Alerts
- How Many Monitoring Regions Do You Need?
- Eliminating False Positives
- What to Look for in a Distributed Monitoring Platform
Most monitoring tools check your service from a single server in a single datacenter. When that check fails, you get an alert. Simple enough — until the failure isn't your service at all, but a network issue between the monitoring server and your infrastructure. Or until your service is genuinely down in one region but perfectly healthy in another, and you have no way to tell the difference.
Distributed monitoring solves this by running health checks from multiple geographic locations simultaneously. But there's more to it than just "check from more places." The real value is in how those results are aggregated, how alerts are triggered, and how you can distinguish between a localized network hiccup and a genuine global outage.
This article explains how distributed monitoring checks work under the hood — from the moment a check is scheduled to the moment an alert lands in your inbox.
The Problem with Single-Location Monitoring
When you monitor from a single location, every network path between the monitoring server and your service is a potential point of failure that has nothing to do with your service itself.
Consider what happens when a monitoring server in Virginia checks your website hosted in Oregon. The request traverses multiple networks, internet exchange points, and peering agreements. If any link along that path goes down or degrades, the check fails — even though your website is serving traffic perfectly to everyone else in the world.
Single-location monitoring creates three specific problems:
- False positives. A routing issue, a congested peering point, or a brief DNS resolver hiccup at the monitoring location triggers an alert for an outage that isn't real. Your phone buzzes at 3 AM, you check your server, everything looks fine. The alert was noise.
- Missed regional outages. Your service might be down for users in Europe because of a CDN edge node failure, but your single monitoring location in the US sees everything as healthy. European users are experiencing a real outage, and you don't know about it.
- No geographic performance data. You have no way to know if your service is fast from Asia but slow from South America. Without multi-location data, you're blind to the experience of users outside the monitoring server's region.
The 3 AM Problem
A study by monitoring platform StatusCake found that up to 30% of single-location alerts are false positives caused by transient network issues. That's nearly one in three alerts waking you up for nothing. Distributed monitoring with consensus-based alerting reduces false positives to near zero.
What Is Distributed Monitoring?
Distributed monitoring runs the same health check from multiple geographic locations — called monitoring regions — and combines the results to form a more accurate picture of your service's availability.
Instead of one server in one datacenter deciding whether your site is up or down, you might have servers in US East, US West, Europe, and Asia Pacific all running the same check within seconds of each other. Each region reports its result independently, and a consensus engine evaluates the collective outcome.
This architecture answers questions that single-location monitoring can't:
- Is the service actually down, or is it a network issue? If three out of four regions report success and one reports failure, it's almost certainly a network or routing problem at the failing location — not a service outage.
- Is it a regional outage or a global one? If all regions report failure, the service is genuinely down everywhere. If two regions fail and two succeed, there's a regional issue worth investigating.
- How does performance vary by region? Response times from each location reveal geographic performance differences, CDN effectiveness, and routing inefficiencies.
How a Check Actually Runs
Here's what happens behind the scenes when a distributed monitoring platform runs a check against your service. We'll use a website uptime check as an example, since it's the most common type.
Step 1: Scheduling
A central scheduler determines which monitors are due for a check based on their configured interval — every 30 seconds, 60 seconds, 5 minutes, or whatever you've set. When a monitor is due, the scheduler dispatches the check to every monitoring region assigned to that monitor.
For example, if you've configured your website monitor to check from US East, EU West, and Asia Pacific, the scheduler sends three independent tasks — one to each region's worker queue.
Step 2: Independent Execution
Each monitoring region has its own pool of worker servers. When a check task arrives, a worker picks it up and executes it independently:
- DNS resolution — Resolves your domain from the region's local DNS infrastructure. This alone can reveal DNS propagation issues that affect specific geographies.
- TCP connection — Opens a connection to your server. The connection time reflects the physical network distance and routing quality between the monitoring region and your server.
- TLS handshake — For HTTPS checks, negotiates the encrypted connection. This adds latency proportional to the round-trip distance.
- HTTP request and response — Sends the configured request (GET, POST, etc.) and reads the response. Measures total response time, status code, and optionally validates response content.
Each region's worker records the result independently: pass or fail, response time in milliseconds, HTTP status code, and any error message. These results are stored per-region so you can see exactly what each location observed.
Step 3: Result Storage and Real-Time Delivery
Each region's result is stored in the check history with a region label. This gives you a granular, per-region timeline of your service's health. You can see that your site responded in 120ms from US East but 340ms from Asia Pacific, or that EU West saw a timeout while other regions reported success.
Results are also pushed to your dashboard in real time via WebSocket connections, so you see status updates as they happen — not on the next page refresh.
Step 4: Consensus Evaluation
This is where distributed monitoring gets interesting. After each region reports its result, a consensus engine evaluates the collective outcome across all regions. It asks: given the results from every region that reported in, what is the actual status of this service?
We'll cover the consensus logic in detail in the next sections.
Types of Distributed Checks
Different types of infrastructure benefit from distributed monitoring in different ways:
HTTP/HTTPS Website Checks
The most common distributed check. Each region sends an HTTP request to your URL and evaluates the response. Multi-region website checks catch CDN failures, geographic load balancer issues, and regional DNS problems that single-location checks miss entirely.
Beyond simple availability, multi-region HTTP checks reveal performance asymmetries. If your server is in US East, users in Asia might experience 3x the latency of users in North America. Distributed checks make this visible so you can decide whether to add a CDN edge or a regional server.
ICMP (Ping) Checks
ICMP ping checks measure raw network reachability and round-trip latency at the network layer — below HTTP, below TLS, below DNS. This makes them ideal for monitoring infrastructure that doesn't serve web traffic: routers, switches, firewalls, VPN concentrators, and bare-metal servers.
Distributed ICMP checks from multiple regions show you whether a device is unreachable globally (hardware failure, power outage) or only from specific network paths (routing issue, ISP problem). They also provide latency and packet loss data per region, which is invaluable for diagnosing network quality issues.
API Endpoint Checks
API checks are HTTP checks with more configuration: custom headers, authentication tokens, request bodies, and response content validation. Distributed API monitoring is especially important for services with global user bases, because API performance is directly tied to user experience in each region.
Mail Server Checks
SMTP, IMAP, and POP3 checks verify that your mail infrastructure is accepting connections and responding to protocol commands. Running these from multiple regions confirms that your mail server is reachable worldwide — not just from one network. This matters because email delivery depends heavily on network path and DNS MX record resolution, both of which can vary by geography.
Consensus-Based Alerting
The key innovation in distributed monitoring isn't running checks from more places — it's deciding what to do with conflicting results. When Region A says "up" and Region B says "down," is the service up or down? That's the consensus problem.
A well-designed consensus engine uses a majority-rule approach, with nuance for edge cases:
| Regions Reporting | Failures | Result | Alert |
|---|---|---|---|
| 1 | 0 | Up | None |
| 1 | 1 | Down | Critical |
| 2 | 1 | Degraded | Warning |
| 3 | 1 | Degraded | Warning |
| 3 | 2–3 | Down | Critical |
| 4 | 1 | Degraded | Warning |
| 4 | 2+ | Down | Critical |
| 5 | 1 | Degraded | Warning |
| 5 | 3+ | Down | Critical |
The general rule: one region failing triggers a degradation warning. A majority of regions failing triggers a critical down alert. This approach dramatically reduces false positives while still catching real outages immediately.
The consensus window also matters. Regions don't all report at the exact same millisecond. A well-designed system collects results within a time window (typically twice the check interval) and evaluates consensus based on the most recent result from each region. If a region's result has expired — meaning the worker is delayed or offline — it's excluded from the consensus rather than counted as a failure.
Degraded vs. Hard Down: Why the Difference Matters
Most monitoring tools offer a binary view: your service is either up or down. Distributed monitoring enables a third state — degraded — which turns out to be one of the most operationally useful signals you can have.
What "Degraded" Means
A degraded status means your service is failing from at least one monitoring region but passing from others. This typically indicates:
- A regional network issue. An ISP or backbone provider is having problems on a specific route. Your service is fine; the path to it from one geography is not.
- A CDN edge failure. One of your CDN's points of presence is down or misconfigured. Users in that region are affected; everyone else is fine.
- A DNS propagation issue. A DNS change hasn't reached all regions yet, or one region's resolver is returning stale records.
- A partial infrastructure failure. If you run across multiple availability zones or datacenters, one might be down while others serve traffic normally.
What "Hard Down" Means
A hard-down status means a majority of monitoring regions — or all of them — report failure. This almost always indicates a genuine service outage: your server is down, your application has crashed, your database is unreachable, or your entire hosting infrastructure is offline.
Why This Distinction Changes Your Response
A degraded alert and a hard-down alert demand very different responses:
- Degraded: Investigate, but don't panic. Check if the failing region corresponds to a specific ISP or CDN edge. Look at whether affected users can reach the service through alternative paths. This might resolve on its own as network routes reconverge, or it might require a CDN configuration change.
- Hard down: This is a real outage. All hands on deck. Check your servers, your application logs, your database, your DNS. Something is genuinely broken and needs immediate attention.
Without the degraded state, a single regional failure triggers a full critical alert. Your team scrambles, investigates, finds everything looks fine from their perspective, and writes it off as a false positive. The actual regional issue goes uninvestigated because the alert felt like a false alarm. With the degraded state, the alert accurately conveys "something is wrong, but it's not a total outage" — which is exactly the information you need to respond appropriately.
Region-Aware Alerts
The most actionable distributed monitoring alerts don't just tell you "your service is down." They tell you where it's down and where it's still working.
A well-designed alert includes a region breakdown:
| Region | Status | Response Time |
|---|---|---|
| US East (Virginia) | Up | 185ms |
| US West (Oregon) | Up | 210ms |
| EU West (Frankfurt) | Down — Connection timeout | — |
This alert immediately tells you: the service is reachable from North America but timing out from Europe. That narrows the investigation to European routing, a CDN edge in Frankfurt, or a DNS issue specific to EU resolvers. Without this breakdown, you'd see "service down" and start investigating your server — which is working fine.
Region-aware alerts are especially valuable for:
- Teams with global infrastructure. If you run servers or CDN edges in multiple regions, the failing region tells you which component to investigate.
- Communicating with stakeholders. "Our service is experiencing degraded access from Europe" is a much better incident communication than "our service might be down."
- Post-incident analysis. Historical per-region data shows the exact timeline: which region failed first, when it recovered, and whether the failure spread to other regions.
How Many Monitoring Regions Do You Need?
More regions isn't always better. The right number depends on your user base and infrastructure:
One Region (Primary Only)
Suitable for internal tools, staging environments, or services with users concentrated in a single geography. You get basic uptime monitoring without the complexity of consensus logic. A single check failure triggers an alert — the same as traditional monitoring.
Two Regions
The minimum for reducing false positives. If both regions report failure, it's almost certainly real. If only one fails, you get a degradation warning instead of a critical alert. This setup eliminates the majority of 3 AM false alarms while still catching genuine outages quickly.
Choose two regions that are geographically separated. US East + EU West is a good default for services with transatlantic users.
Three Regions
The sweet spot for most production services. Three regions give you clear majority consensus (2 out of 3 = confirmed outage) and coverage across major geographies. A typical setup might be US East, EU West, and Asia Pacific.
Three regions also provide meaningful performance comparison data. You can see at a glance how your service performs from each continent and identify which users have the worst experience.
Four or Five Regions
For services with truly global user bases, critical SLA requirements, or complex multi-region infrastructure. More monitoring points give you finer geographic granularity and stronger consensus confidence. The tradeoff is cost — more regions means more check executions — and slightly more complex alert interpretation.
A five-region setup might include US East, US West, EU West, Asia Pacific (Singapore or Tokyo), and South America or Australia. This covers every major continent and provides comprehensive geographic performance data.
Recommendation
Start with two monitoring regions to eliminate false positives. If your service has a global user base or SLA requirements above 99.9%, move to three. Reserve four or five regions for enterprise-grade monitoring where you need per-continent visibility.
Eliminating False Positives
False positive alerts are more than just annoying. They erode trust in your monitoring system. After a few 3 AM pages that turn out to be nothing, teams start ignoring alerts or adding delays that slow down response to real incidents. This is the "boy who cried wolf" problem, and it's the number one reason monitoring setups fail in practice.
Distributed monitoring with consensus-based alerting addresses false positives through multiple layers:
- Multi-region confirmation. A single region failure produces a warning, not a critical alert. The alert is still visible but won't wake anyone up. Critical alerts only fire when multiple regions agree that the service is down.
- Consecutive failure thresholds. Before a single region even reports "down," it can be configured to require two or more consecutive failures. This filters out one-off network blips that resolve within seconds.
- Consensus windows. Results are evaluated within a time window, not at a single instant. If a region's check was delayed by worker queue congestion, its result is excluded rather than treated as a failure.
- Alert cooldowns. Once an alert is sent, a cooldown period prevents duplicate alerts for the same ongoing incident. You get one "service is down" alert, not one every 30 seconds for the duration of the outage.
- Recovery confirmation. A recovery alert only fires when the service has been confirmed up across regions, preventing premature "all clear" messages during flapping (rapid up/down oscillation).
Together, these layers mean that when you do get a critical alert, you can trust it. The monitoring system has already confirmed the failure from multiple independent vantage points, filtered out transient issues, and verified that it's a real, sustained outage. That trust is what makes a monitoring system operationally useful rather than just another source of noise.
What to Look for in a Distributed Monitoring Platform
If you're evaluating monitoring tools, here's what separates a good distributed monitoring platform from one that just checks from multiple places:
- Consensus-based alerting, not just multi-location checking. Some tools check from multiple regions but alert if any single region fails. That's multi-location monitoring without the consensus logic — you'll still get false positives. Look for tools that aggregate results across regions before deciding whether to alert.
- Degraded status support. The tool should distinguish between partial regional failures and full outages. A binary up/down status with multi-region input is a missed opportunity.
- Per-region data in alerts and dashboards. You should be able to see which regions passed, which failed, and what each region observed (response time, error message, status code). This data is critical for diagnosing regional issues.
- Flexible region selection per monitor. Not every service needs five monitoring regions. You should be able to assign different regions to different monitors based on their criticality and user base geography.
- Independent region infrastructure. The monitoring regions should be genuinely independent — different datacenters, different cloud providers, different network paths. If all your "regions" are VMs in the same cloud provider's backbone, a provider-level network issue takes out your monitoring along with your service.
- Real-time result delivery. Per-region results should appear on your dashboard as they happen, not on the next polling interval. WebSocket-based dashboards give you a live view of your infrastructure health across all regions.
- Historical per-region data. Response time trends per region over days and weeks reveal geographic performance patterns that point-in-time checks miss. Your EU response times might be gradually increasing due to a routing change you haven't noticed yet.
Monitor from Multiple Regions Today
Down Device runs health checks from up to five independent monitoring regions with consensus-based alerting. Get degraded and hard-down alerts with per-region breakdowns — so you always know exactly what's failing and where. Free plan available — no credit card required.
Start Free TrialWrapping Up
Distributed monitoring isn't just "checking from more places." It's a fundamentally different approach to determining whether your service is healthy. By running independent checks from multiple geographic regions and applying consensus logic to the results, you get alerts that are both more accurate and more actionable than single-location monitoring can provide.
The key ideas to remember:
- Single-location monitoring generates false positives and misses regional outages. Distributed monitoring fixes both problems.
- Consensus-based alerting uses majority agreement across regions to separate real outages from transient network issues.
- The degraded state — one region failing while others pass — is an operationally useful signal that binary up/down monitoring can't provide.
- Region-aware alerts that show exactly where failures occurred make incident response faster and more targeted.
- Two or three monitoring regions eliminate most false positives. Five regions give you per-continent visibility for global services.
If your current monitoring setup wakes you up for outages that turn out to be nothing, or if you've ever discovered a regional issue because a customer reported it, distributed monitoring with consensus alerting is the fix. Set it up once, and let the monitoring system do the work of confirming whether a failure is real before it pages you.
Ready to set up distributed monitoring? Check out Down Device's plans or contact our team for a walkthrough.