Collectors

Collectors are low-level components that collect metrics for domains. See metrics/status.global/global.go for a reference example with extensive code comments.

Domain Names

Blip metric domain names have three requirements:

Always lowercase
One word: [a-z]+
Singular noun: “size” not “sizes”; “query” not “queries”

Common abbreviation and acronyms are preferred, especially when they match MySQL usage: “thd” not “thread”; “pfs” not “performanceschema”; and so on.

Currently, domain names fit this convention, but if a need arises to allow hyphenation (“domain-name”), it might be allowed. Snake case (“domain_name”) and camel case (“domainName”) are not allowed: the former is used by metrics, and the latter is not Blip style.

Subdomains

Blip domains are dot-separated into subdomains, like status.global: domain is status, subdomain is global. Blip uses subdomains for metric grouping and, in rare cases, Blip organization. See the Domain Quick Reference for a full list of domains and subdomains.

Metric Grouping

Many MySQL metrics are grouped, and Blip mirrors the same grouping in its domain names. This has changed over many years, so if it seems new to you, then consider this:

Long, log ago only SHOW STATUS existed
Then SHOW GLOBAL STATUS came into being
And now we Status Variable Summary Tables

This is why Blip uses domain status.global and not just “status” because the latter is ambiguous.

For brevity, all Blip domains and subdomains are called “domains”. repl is a domain, and status.global is a domain. The subdomain distinction is made only when discussing and naming them.

Global groups, like status.global and var.global, do not have a metric group because what would the group key and value be? During Blip developed, we considered a global group key-value like all="*" or all="", but these are useless magical values that serve no purpose, so we omitted them. Groups are only used when there are meaningful non-global group keys and values, like size.table: these metrics must be grouped by db and tbl to make sense.

Blip Organization

In rare cases, we subdomain for greater clarity and organization in Blip, especially with respect to level collection. For example, with repl and repl.lag a user might want to collect repl.lag frequently (every second) but collect repl info more slowly. If these two were one domain, it would be less clear in the plan and more difficult to code in the collector:

rep_lag:
  freq: 1s
  collect:
    repl:
      options:
        source-role: "east"
      metrics:
        - lag

repl_info:
  freq: 30s
  collect:
    repl:
      metrics:
        - running

That plan is valid, it’s more difficult to code in a single collector given totally different work (and options) at different levels. While collectors must handle different metrics at different levels, it’s usually the same work, which makes coding the collector easier. Also, the plan is less clear since it’s the same domain but configured differently.

By contrast, separate domains makes it easier to develop, test, explain, and use each:

repl collects metrics from SHOW REPLICA STATUS
repl.lag measures and collects replication lag from Blip heartbeat

We subdomain for Blip organization judiciously. If you’re making a custom collector that you want to merge upstream (into public Blip), be sure to file an issue and discuss with us first.

Metric Names

MySQL Metrics

Blip reports MySQL metric names as-is (no renaming) so that what you see in MySQL is what you get in Blip. The only modification Blip makes is lowercasing MySQL metric names for consistency because in MySQL they’re inconsistent:

Foo_bar (most common)
Foo_Bar (replica status)
foo_bar (InnoDB metrics)

If your collector only reports MySQL metrics, then just strings.ToLower() the name as-is from MySQL.

Derived Metrics

If your collector creates and reports derived metrics, then there are three requirements for naming:

Only snake_case
Always lowercase
No prefixes or suffixes

A fully-qualified metric name includes a domain: status.global.threads_running. The metric name is always the last field (split on .).

Code

Example https://github.com/cashapp/blip/tree/main/examples/integrate shows how to create a custom metrics collector for domain “foo”. The high-level work is:

Implement blip.Collector and blip.CollectorFactory
Register the domain/collector by calling metrics.Register(myFactory, "foo")
Use the domain in a plan:

level:
  collect:
    foo: # Custom domain/collector
      metrics:
        - whatever

Long-running

As of Blip v1.2.0, long-running collectors are possible using one of two approaches:

Background thread
ErrMore

But first, it’s important to know that when Collect is called, it’s passed a context (ctx) called the collector max runtime (CMR). The CMR duration is set to the configured level frequency (as set be the user in the plan). A collector must stop running when the CMR context is cancelled.

Background Thread

A collector can run background threads (goroutine) to do anything between and during calls to Collect. Then, when Collect is called, the collector quickly returns the latest value(s).

Using this method, the collector should return a cleanup function (callback) from Prepare. Blip calls collector cleanup functions if/when a monitor is stopped or restarted.

ErrMore

If Collect returns ErrMore (with or without values), Blip keeps calling Collect until the collector stops returning ErrMore or its CMR expires. A collector can run for awhile (within its CMR) by returning ErrMore to signal that it has more values to report.

When used, the second and subsequent calls to Collect have zero value arguments: nil context and empty string for level name. A collector should expect this since it happens only if the collector previously returned ErrMore.