Heartbeat

Blip heartbeat works in conjunction with the repl.lag metric collector to measure replication lag. Although MySQL has built-in replication heartbeats and lag metrics, they are not always enabled or accurate. For example, Seconds_Behind_Source from SHOW REPLICA STATUS (or Seconds_Behind_Master from SHOW SLAVE STATUS before MySQL 8.022) is always on but infamously inaccurate: it reports zero when a network issue blocks replication. Consequently, external replication heartbeats are an industry norm because they are easy and accurate—and they work the same across all versions and distributions of MySQL, including the the cloud.

Quick Start

Presuming one source MySQL instance and one read-only replica, the minimal configuration is:

Create the heartbeat table on the source.
Grant the Blip MySQL user these privileges:
• REPLICATION CLIENT ON *.*
• SELECT, INSERT, UPDATE, DELETE ON blip.heartbeat
Enable the heartbeat in the Blip config:

heartbeat:
freq: 2s

Enable the repl.lag metric collector in the Blip plan:

level:
  collect:
       repl.lag:

With this minimal configuration, Blip tries and fails to write heartbeats on the read-only replica, but it keeps trying in the expectation that the replica can become the source after a failover.

Table

The default heartbeat table is blip.heartbeat:

CREATE TABLE IF NOT EXISTS heartbeat (
  src_id   varchar(200)      NOT NULL PRIMARY KEY,
  src_role varchar(200)          NULL DEFAULT NULL,
  ts       timestamp(3)      NOT NULL,  -- heartbeat
  freq     smallint unsigned NOT NULL   -- milliseconds
) ENGINE=InnoDB;

Replication Topology

Replication lag is a point-to-point measurement between a source and a replica, but replication topologies change due to maintenance and failures. That makes it difficult to configure heartbeat because sources and replicas change. Even though the plan changing can change the Blip configuration based on the state of MySQL, that is not sufficient when there are more than three or more nodes in the replication topology and any node might become the source.

To address these challenges, Blip heartbeat has two concepts: source reporting and source following.

Source Reporting

Source reporting determines the value a monitor reports as its replication source ID (or “source” for short). By default, a monitor reports monitor.id as its source.

The default works if MySQL nodes have valid hostnames and replicas are configured to use those hostnames. But this is not always the case, especially in the cloud. To override the default, set config.heartbeat.source (or config.monitors.heartbeat.source) to report a different source value:

monitors:
  - id: host1.local
    heartbeat:
      source: "node1"

The config snippet above will make monitor host1.local report itself as replication source ID node1. Here’s a more advanced configuration that does the same but uses config defaults and interpolation:

heartbeat:
  source: "%{monitor.tags.db_id}"

monitors:
  - id: host1.local
    tags:
      db_id: "node1"

The monitor must report a replication source ID when heartbeat is enabled, and every source must be unique in the replication topology. A monitor can also report an optional source role: a user-defined value that multiple nodes in the replication can report (a shared value). For example, suppose that a replication topology has four nodes in two different regions. The heartbeat config might look like:

heartbeat:
  role: "%{monitor.tags.region}"

monitors:
  - id: host1.local
    tags:
      region: "west"

  - id: host2.local
    tags:
      region: "west"

  - id: host3.local
    tags:
      region: "east"

  - id: host4.local
    tags:
      region: "east"

Nodes host1 and host2 report role west, and nodes host3 and host4 report role east. Roles are used for replication following.

Source Following

Source following refers to the method by which a monitor determines the source from which the lag is measured and reported using the repl.lag collector. The three methods in order of precedence for most to least specific: source ID, role, and latest.

Blip heart does not automatically find the source of a replica, but this feature might be added later. It does not support MySQL multi-source replication, and there are no plans to support this feature.

Source ID

level:
  collect:
    repl.lag:
      options:
        source-id: "node1"

If the source-id option is specified, the monitor will report replication lag only from the monitor reporting as node1, in this example. Option source-id takes precedent over other options because it’s the most specific.

Role

level:
  collect:
    repl.lag:
      options:
        source-role: "east"

If the source-role option is specified, the monitor will report replication lag from the latest timestamp of any monitor reporting role east, in this example. This is useful when a set of nodes replicate only from another specific set of nodes, such as nodes in a disaster recovery (DR) region replicating only from nodes in the primary (or active) region. In this case, monitors in each region can report a role and follow the role of the other region.

Following is an advanced example of the monitor and plan configuration for two nodes in two regions where the one follows the older based on DR region (using config interpolation):

heartbeat:
  role: "%{monitor.tags.region}"

monitors:
  - id: host1.local
    tags:
      region: "west"
      dr_region: "east"

  - id: host3.local
    tags:
      region: "east"
      dr_region: "west"

level:
  collect:
    repl.lag:
      options:
        source-role: "%{monitor.tags.dr_region}"

host1 is active in region west with DR region east. host2 is the opposite: active in east with DR in west. If host1 is the source, then host3 will report replication from it because it’s configured to follow any monitor reporting its DR region (west), which is what host1 reports as its role. After DR failover, the situation reverse: host3 reports its role as east, which is what host1 is configured to follow since that is its DR region.

Latest

level:
  collect:
    repl.lag:
      # Defaults

If source-id and source-role are not specified (the default), the monitor will report replication lag from the latest timestamp of any monitor. This is useful when the replication topology guarantees that only one node is writable (read_only=0) at all times, which is typical for MySQL replication topologies. In this case, every monitor follows (reports replication lag) from whichever node happens to be the (writable) source, as indicated by the fact that it’s able to write heartbeats.

Repl Check

The repl.lag collector option repl-check is used to ignore (not report) replication lag on nodes that are not actually replicas. This is common in the cloud: a reader instance is not a replica but, rather, another MySQL instance that reads data from shared (network-backed) storage. A reader will see heartbeats from the source, but this is not true replication lag.

The repl-check value must be a global MySQL system variable (“sysvar”). If the sysvar value is zero, then Blip does not report replication lag. server_id is the recommended sysvar because reader instances typically have this set to zero. On failover—when a reader becomes the writer—server_id is set to a non-zero value, which makes Blip replication reporting follow the reader-writer changes in the cloud.

Repl check does not use SHOW REPLICA STATUS because that can, in rare cases, be slow to respond or impact MySQL. Using a global MySQL system variable is extremely fast and zero impact.

Accuracy

Blip heartbeat uses a new approach to replication lag monitoring that decouples read accuracy from write frequency. Blip can write heartbeats every 2 seconds, read them every 5 seconds, and still measure (and report) sub-second lag. It is not necessary (or advised) to configure high-frequency heartbeats. The recommended heartbeat frequency is 2 seconds:

heartbeat:
  freq: 2s

Blip always measures and reports replication lag with sub-second precision. (The replication lag metric, repl.lag.current, is reported in milliseconds.) The heartbeat and plan level frequencies do not affect replication lag accuracy. The former determines how frequently replication lag is tested. The latter determines how frequently replication lag is reported as a metric.

Like all external replication heartbeats, accuracy is affected by the clock skew between the source and replica. Blip presumes clock skew is far less than network latency (between source and replica) such that inaccuracy is overwhelmingly due to fluctuations in network latency. The repl.lag collector option network-latency (default: 50 ms) accounts for this presumption. If the source writes a heartbeat at time T, then it should arrive on the replica T + network-latency, presuming no appreciable clock skew. Blip checks for the heartbeat at T + network-latency and subtracts network-latency from the difference of the heartbeat timestamps. If clock skew is negligible, and network latency is steady as configured, and MySQL replication is not lagging, then Blip will measure near-zero lag (skewed only by the few microseconds it takes MySQL to execute the lag check query).