Lecture Notes for Metrics - A Short, Practical Primer
Outline
First Principles
what is a metric
in terms of data
it is a label, a number and a timestamp. so, a time-indexed kv pair. some other details about the number can be kept for fun sometimes.
in terms of use
it is a single piece of information you want to have about a "running" "system" ie. anything that you can measure, the output of the measurement is a metric.
in terms of "why"
well, the scientific method.
debugging something? have a target to meet? need to know how to provision a resource? need to know if you're wasting a resource? metrics to the rescue! find a way to
quantify your targets
measure your world
and you have the panaceaic answers to all your questions!
(further reading: How to Measure Anything, Douglas Hubbard)
data challenges
collection and storage timestamps, how do they work storage optimization
use challenges
querying and analysis cardinality size bucketing: the ultimate shitty tradeoff between storage and analysis requirements
motivation challenges
measuring irrelevant things
bad labeling
nothing is more irritating than rewriting five labels that ought to have been one label at query time.
treating documents and metrics as the same
cardinality makes a difference.
measurement collection and storage is not the same as analysis-artifact caching. Unix principle - small purposeful tools.
use cases:
debugging
how to investigate a problem given a set of metrics?
what ought you be always collecting in case you need to investigate something?
"system" metrics - trying to measure the function of all the critical pieces of an OS
"success" metrics - is the thing I want to happen, happening?
"uptime" usually tracks the history of these
alerting
"success" metrics are the zeroth alert.
"is this working?" tests.
sometimes the answer is on a spectrum, which means you decide on thresholds to divide the spectrum into black and white. (or green, orange, and yellow, if you like to pretend you care about an orange alert.)
the reactionary flow of alert setting
something went drastically wrong
it turns out it was due to a certain "bad behaviour"
let's alert that bad behaviour so that we catch the next time that goes wrong
strive to do better
catch frequent offenders
identify first principles
reactionary alerts are bandaids. try hard to fix the deeper problem.
always have a way to remember what you created an alert for.
DOCUMENT.
profiling
making stuff better.
have a target: "X endpoint needs to respond in Y ms at the Zth percentile".
find out how to measure the target.
figure out what correlates with the target
stare at graphs of stuff you're already collecting
make theories about what's happening and figure out how to falsify them
collect new data as needed
LABEL WELL
if there is a standardised way to collect the data you need, USE IT.
if there isn't, well, collect it in as ad hoc a way as you like, but write a note in a wiki that you've done that
AND LABEL THE DATA INFORMATIVELY.
System metrics
Collectd
get them to install collectd show them what it collects
Graphite
somebody else gotta step in to discuss; pls no practical.
a graphite endpoint they can push to and navigate to the dashboard of?
whisperdb
carbon-relay
carbon-ng? or something the go rewrite
Prometheus
wouldn't minda practical here. prom is smol.
promtsdb
pull-based arch (and the consequences)
exporters
get them to install node exporter
prom web endpoint
alertmanager
alerting rules
Grafana
unified graphs and queries frontend originally mostly for graphite but it can talk to EVERYTHING muhahahaha somewhat heavy webapp tho
dashboard
panel
panel query editing
useful walkthrough here to discuss how and why to use various display features
Duncan method informed class structure
Core skill
measure a system.
create or find metrics
collect metrics
install and use metrics collectors
collectd
what else?
collate metrics
ask useful questions of metrics
know which questions are worth asking and when
taxonomy of questions
know how to construct a given question in statspeak