How Distributed Monitoring Checks Work

In This Article

  1. The Problem with Single-Location Monitoring
  2. What Is Distributed Monitoring?
  3. How a Check Actually Runs
  4. Types of Distributed Checks
  5. Consensus-Based Alerting
  6. Degraded vs. Hard Down: Why the Difference Matters
  7. Region-Aware Alerts
  8. How Many Monitoring Regions Do You Need?
  9. Eliminating False Positives
  10. What to Look for in a Distributed Monitoring Platform

Most monitoring tools check your service from a single server in a single datacenter. When that check fails, you get an alert. Simple enough — until the failure isn't your service at all, but a network issue between the monitoring server and your infrastructure. Or until your service is genuinely down in one region but perfectly healthy in another, and you have no way to tell the difference.

Distributed monitoring solves this by running health checks from multiple geographic locations simultaneously. But there's more to it than just "check from more places." The real value is in how those results are aggregated, how alerts are triggered, and how you can distinguish between a localized network hiccup and a genuine global outage.

This article explains how distributed monitoring checks work under the hood — from the moment a check is scheduled to the moment an alert lands in your inbox.

The Problem with Single-Location Monitoring

When you monitor from a single location, every network path between the monitoring server and your service is a potential point of failure that has nothing to do with your service itself.

Consider what happens when a monitoring server in Virginia checks your website hosted in Oregon. The request traverses multiple networks, internet exchange points, and peering agreements. If any link along that path goes down or degrades, the check fails — even though your website is serving traffic perfectly to everyone else in the world.

Single-location monitoring creates three specific problems:

The 3 AM Problem

A study by monitoring platform StatusCake found that up to 30% of single-location alerts are false positives caused by transient network issues. That's nearly one in three alerts waking you up for nothing. Distributed monitoring with consensus-based alerting reduces false positives to near zero.

What Is Distributed Monitoring?

Distributed monitoring runs the same health check from multiple geographic locations — called monitoring regions — and combines the results to form a more accurate picture of your service's availability.

Instead of one server in one datacenter deciding whether your site is up or down, you might have servers in US East, US West, Europe, and Asia Pacific all running the same check within seconds of each other. Each region reports its result independently, and a consensus engine evaluates the collective outcome.

This architecture answers questions that single-location monitoring can't:

How a Check Actually Runs

Here's what happens behind the scenes when a distributed monitoring platform runs a check against your service. We'll use a website uptime check as an example, since it's the most common type.

Step 1: Scheduling

A central scheduler determines which monitors are due for a check based on their configured interval — every 30 seconds, 60 seconds, 5 minutes, or whatever you've set. When a monitor is due, the scheduler dispatches the check to every monitoring region assigned to that monitor.

For example, if you've configured your website monitor to check from US East, EU West, and Asia Pacific, the scheduler sends three independent tasks — one to each region's worker queue.

Step 2: Independent Execution

Each monitoring region has its own pool of worker servers. When a check task arrives, a worker picks it up and executes it independently:

  1. DNS resolution — Resolves your domain from the region's local DNS infrastructure. This alone can reveal DNS propagation issues that affect specific geographies.
  2. TCP connection — Opens a connection to your server. The connection time reflects the physical network distance and routing quality between the monitoring region and your server.
  3. TLS handshake — For HTTPS checks, negotiates the encrypted connection. This adds latency proportional to the round-trip distance.
  4. HTTP request and response — Sends the configured request (GET, POST, etc.) and reads the response. Measures total response time, status code, and optionally validates response content.

Each region's worker records the result independently: pass or fail, response time in milliseconds, HTTP status code, and any error message. These results are stored per-region so you can see exactly what each location observed.

Step 3: Result Storage and Real-Time Delivery

Each region's result is stored in the check history with a region label. This gives you a granular, per-region timeline of your service's health. You can see that your site responded in 120ms from US East but 340ms from Asia Pacific, or that EU West saw a timeout while other regions reported success.

Results are also pushed to your dashboard in real time via WebSocket connections, so you see status updates as they happen — not on the next page refresh.

Step 4: Consensus Evaluation

This is where distributed monitoring gets interesting. After each region reports its result, a consensus engine evaluates the collective outcome across all regions. It asks: given the results from every region that reported in, what is the actual status of this service?

We'll cover the consensus logic in detail in the next sections.

Types of Distributed Checks

Different types of infrastructure benefit from distributed monitoring in different ways:

HTTP/HTTPS Website Checks

The most common distributed check. Each region sends an HTTP request to your URL and evaluates the response. Multi-region website checks catch CDN failures, geographic load balancer issues, and regional DNS problems that single-location checks miss entirely.

Beyond simple availability, multi-region HTTP checks reveal performance asymmetries. If your server is in US East, users in Asia might experience 3x the latency of users in North America. Distributed checks make this visible so you can decide whether to add a CDN edge or a regional server.

ICMP (Ping) Checks

ICMP ping checks measure raw network reachability and round-trip latency at the network layer — below HTTP, below TLS, below DNS. This makes them ideal for monitoring infrastructure that doesn't serve web traffic: routers, switches, firewalls, VPN concentrators, and bare-metal servers.

Distributed ICMP checks from multiple regions show you whether a device is unreachable globally (hardware failure, power outage) or only from specific network paths (routing issue, ISP problem). They also provide latency and packet loss data per region, which is invaluable for diagnosing network quality issues.

API Endpoint Checks

API checks are HTTP checks with more configuration: custom headers, authentication tokens, request bodies, and response content validation. Distributed API monitoring is especially important for services with global user bases, because API performance is directly tied to user experience in each region.

Mail Server Checks

SMTP, IMAP, and POP3 checks verify that your mail infrastructure is accepting connections and responding to protocol commands. Running these from multiple regions confirms that your mail server is reachable worldwide — not just from one network. This matters because email delivery depends heavily on network path and DNS MX record resolution, both of which can vary by geography.

Consensus-Based Alerting

The key innovation in distributed monitoring isn't running checks from more places — it's deciding what to do with conflicting results. When Region A says "up" and Region B says "down," is the service up or down? That's the consensus problem.

A well-designed consensus engine uses a majority-rule approach, with nuance for edge cases:

Regions Reporting Failures Result Alert
1 0 Up None
1 1 Down Critical
2 1 Degraded Warning
3 1 Degraded Warning
3 2–3 Down Critical
4 1 Degraded Warning
4 2+ Down Critical
5 1 Degraded Warning
5 3+ Down Critical

The general rule: one region failing triggers a degradation warning. A majority of regions failing triggers a critical down alert. This approach dramatically reduces false positives while still catching real outages immediately.

The consensus window also matters. Regions don't all report at the exact same millisecond. A well-designed system collects results within a time window (typically twice the check interval) and evaluates consensus based on the most recent result from each region. If a region's result has expired — meaning the worker is delayed or offline — it's excluded from the consensus rather than counted as a failure.

Degraded vs. Hard Down: Why the Difference Matters

Most monitoring tools offer a binary view: your service is either up or down. Distributed monitoring enables a third state — degraded — which turns out to be one of the most operationally useful signals you can have.

What "Degraded" Means

A degraded status means your service is failing from at least one monitoring region but passing from others. This typically indicates:

What "Hard Down" Means

A hard-down status means a majority of monitoring regions — or all of them — report failure. This almost always indicates a genuine service outage: your server is down, your application has crashed, your database is unreachable, or your entire hosting infrastructure is offline.

Why This Distinction Changes Your Response

A degraded alert and a hard-down alert demand very different responses:

Without the degraded state, a single regional failure triggers a full critical alert. Your team scrambles, investigates, finds everything looks fine from their perspective, and writes it off as a false positive. The actual regional issue goes uninvestigated because the alert felt like a false alarm. With the degraded state, the alert accurately conveys "something is wrong, but it's not a total outage" — which is exactly the information you need to respond appropriately.

Region-Aware Alerts

The most actionable distributed monitoring alerts don't just tell you "your service is down." They tell you where it's down and where it's still working.

A well-designed alert includes a region breakdown:

Region Status Response Time
US East (Virginia) Up 185ms
US West (Oregon) Up 210ms
EU West (Frankfurt) Down — Connection timeout

This alert immediately tells you: the service is reachable from North America but timing out from Europe. That narrows the investigation to European routing, a CDN edge in Frankfurt, or a DNS issue specific to EU resolvers. Without this breakdown, you'd see "service down" and start investigating your server — which is working fine.

Region-aware alerts are especially valuable for:

How Many Monitoring Regions Do You Need?

More regions isn't always better. The right number depends on your user base and infrastructure:

One Region (Primary Only)

Suitable for internal tools, staging environments, or services with users concentrated in a single geography. You get basic uptime monitoring without the complexity of consensus logic. A single check failure triggers an alert — the same as traditional monitoring.

Two Regions

The minimum for reducing false positives. If both regions report failure, it's almost certainly real. If only one fails, you get a degradation warning instead of a critical alert. This setup eliminates the majority of 3 AM false alarms while still catching genuine outages quickly.

Choose two regions that are geographically separated. US East + EU West is a good default for services with transatlantic users.

Three Regions

The sweet spot for most production services. Three regions give you clear majority consensus (2 out of 3 = confirmed outage) and coverage across major geographies. A typical setup might be US East, EU West, and Asia Pacific.

Three regions also provide meaningful performance comparison data. You can see at a glance how your service performs from each continent and identify which users have the worst experience.

Four or Five Regions

For services with truly global user bases, critical SLA requirements, or complex multi-region infrastructure. More monitoring points give you finer geographic granularity and stronger consensus confidence. The tradeoff is cost — more regions means more check executions — and slightly more complex alert interpretation.

A five-region setup might include US East, US West, EU West, Asia Pacific (Singapore or Tokyo), and South America or Australia. This covers every major continent and provides comprehensive geographic performance data.

Recommendation

Start with two monitoring regions to eliminate false positives. If your service has a global user base or SLA requirements above 99.9%, move to three. Reserve four or five regions for enterprise-grade monitoring where you need per-continent visibility.

Eliminating False Positives

False positive alerts are more than just annoying. They erode trust in your monitoring system. After a few 3 AM pages that turn out to be nothing, teams start ignoring alerts or adding delays that slow down response to real incidents. This is the "boy who cried wolf" problem, and it's the number one reason monitoring setups fail in practice.

Distributed monitoring with consensus-based alerting addresses false positives through multiple layers:

Together, these layers mean that when you do get a critical alert, you can trust it. The monitoring system has already confirmed the failure from multiple independent vantage points, filtered out transient issues, and verified that it's a real, sustained outage. That trust is what makes a monitoring system operationally useful rather than just another source of noise.

What to Look for in a Distributed Monitoring Platform

If you're evaluating monitoring tools, here's what separates a good distributed monitoring platform from one that just checks from multiple places:

Monitor from Multiple Regions Today

Down Device runs health checks from up to five independent monitoring regions with consensus-based alerting. Get degraded and hard-down alerts with per-region breakdowns — so you always know exactly what's failing and where. Free plan available — no credit card required.

Start Free Trial

Wrapping Up

Distributed monitoring isn't just "checking from more places." It's a fundamentally different approach to determining whether your service is healthy. By running independent checks from multiple geographic regions and applying consensus logic to the results, you get alerts that are both more accurate and more actionable than single-location monitoring can provide.

The key ideas to remember:

If your current monitoring setup wakes you up for outages that turn out to be nothing, or if you've ever discovered a regional issue because a customer reported it, distributed monitoring with consensus alerting is the fix. Set it up once, and let the monitoring system do the work of confirming whether a failure is real before it pages you.

Ready to set up distributed monitoring? Check out Down Device's plans or contact our team for a walkthrough.