Lecture Notes for Metrics - A Short, Practical Primer

Outline

First Principles

what is a metric

in terms of data

it is a label, a number and a timestamp. so, a time-indexed kv pair. some other details about the number can be kept for fun sometimes.
in terms of use

it is a single piece of information you want to have about a "running" "system" ie. anything that you can measure, the output of the measurement is a metric.
in terms of "why"

well, the scientific method.

debugging something? have a target to meet? need to know how to provision a resource? need to know if you're wasting a resource? metrics to the rescue! find a way to a) quantify your targets b) measure your world

and you have the panaceaic answers to all your questions!

(further reading: How to Measure Anything, Douglas Hubbard)

data challenges

collection and storage timestamps, how do they work storage optimization

use challenges

querying and analysis cardinality size bucketing: the ultimate shitty tradeoff between storage and analysis requirements

motivation challenges

measuring irrelevant things
bad labeling
- nothing is more irritating than rewriting five labels that ought to have been one label at query time.
treating documents and metrics as the same
- cardinality makes a difference.
- measurement collection and storage is not the same as analysis-artifact caching. Unix principle - small purposeful tools.

use cases:

debugging
- how to investigate a problem given a set of metrics?
- what ought you be always collecting in case you need to investigate something?
  - "system" metrics - trying to measure the function of all the critical pieces of an OS
  - "success" metrics - is the thing I want to happen, happening?
    - "uptime" usually tracks the history of these
alerting
- "success" metrics are the zeroth alert.
  - "is this working?" tests.
  - sometimes the answer is on a spectrum, which means you decide on thresholds to divide the spectrum into black and white. (or green, orange, and yellow, if you like to pretend you care about an orange alert.)
- the reactionary flow of alert setting
  - something went drastically wrong
  - it turns out it was due to a certain "bad behaviour"
  - let's alert that bad behaviour so that we catch the next time that goes wrong
- strive to do better
  - catch frequent offenders
  - identify first principles
  - reactionary alerts are bandaids. try hard to fix the deeper problem.
  - always have a way to remember what you created an alert for.
    - DOCUMENT.
profiling
- making stuff better.
- have a target: "X endpoint needs to respond in Y ms at the Zth percentile".
- find out how to measure the target.
- figure out what correlates with the target
  - stare at graphs of stuff you're already collecting
  - make theories about what's happening and figure out how to falsify them
  - collect new data as needed
    - LABEL WELL
    - if there is a standardised way to collect the data you need, USE IT.
    - if there isn't, well, collect it in as ad hoc a way as you like, but write a note in a wiki that you've done that
      - AND LABEL THE DATA INFORMATIVELY.

System metrics

Collectd

get them to install collectd show them what it collects

Graphite

somebody else gotta step in to discuss; pls no practical.

a graphite endpoint they can push to and navigate to the dashboard of?

whisperdb

carbon-relay

carbon-ng? or something the go rewrite

Prometheus

wouldn't minda practical here. prom is smol.

promtsdb

pull-based arch (and the consequences)

exporters

get them to install node exporter

prom web endpoint

alertmanager

alerting rules

Grafana

unified graphs and queries frontend originally mostly for graphite but it can talk to EVERYTHING muhahahaha somewhat heavy webapp tho

dashboard

panel

panel query editing

useful walkthrough here to discuss how and why to use various display features

Duncan method informed class structure

Core skill

measure a system.
- create or find metrics
- collect metrics
  - install and use metrics collectors
    - collectd
    - what else?
- collate metrics
- ask useful questions of metrics
  - know which questions are worth asking and when
    - taxonomy of questions
  - know how to construct a given question in statspeak