Lecture Notes for Metrics - A Short, Practical Primer
ID: 804671d3-3ec4-4b86-814e-9b15c80f275e REVIEW_SCORE: 0.0 MTIME: [2024-12-25 Wed 15:54]
1. Outline
1.1. First Principles
1.1.1. what is a metric
- in terms of data
it is a label, a number and a timestamp. so, a time-indexed kv pair. some other details about the number can be kept for fun sometimes.
- in terms of use
it is a single piece of information you want to have about a "running" "system" ie. anything that you can measure, the output of the measurement is a metric.
- in terms of "why"
well, the scientific method.
debugging something? have a target to meet? need to know how to provision a resource? need to know if you're wasting a resource? metrics to the rescue! find a way to a) quantify your targets b) measure your world
and you have the panaceaic answers to all your questions!
(further reading: How to Measure Anything, Douglas Hubbard)
1.1.2. data challenges
collection and storage timestamps, how do they work storage optimization
1.1.3. use challenges
querying and analysis cardinality size bucketing: the ultimate shitty tradeoff between storage and analysis requirements
1.1.4. motivation challenges
- measuring irrelevant things
- bad labeling
- nothing is more irritating than rewriting five labels that ought to have been one label at query time.
- treating documents and metrics as the same
- cardinality makes a difference.
- measurement collection and storage is not the same as analysis-artifact caching. Unix principle - small purposeful tools.
use cases:
- debugging
- how to investigate a problem given a set of metrics?
- what ought you be always collecting in case you need to investigate something?
- "system" metrics - trying to measure the function of all the critical pieces of an OS
- "success" metrics - is the thing I want to happen, happening?
- "uptime" usually tracks the history of these
- alerting
- "success" metrics are the zeroth alert.
- "is this working?" tests.
- sometimes the answer is on a spectrum, which means you decide on thresholds to divide the spectrum into black and white. (or green, orange, and yellow, if you like to pretend you care about an orange alert.)
- the reactionary flow of alert setting
- something went drastically wrong
- it turns out it was due to a certain "bad behaviour"
- let's alert that bad behaviour so that we catch the next time that goes wrong
- strive to do better
- catch frequent offenders
- identify first principles
- reactionary alerts are bandaids. try hard to fix the deeper problem.
- always have a way to remember what you created an alert for.
- DOCUMENT.
- "success" metrics are the zeroth alert.
- profiling
- making stuff better.
- have a target: "X endpoint needs to respond in Y ms at the Zth percentile".
- find out how to measure the target.
- figure out what correlates with the target
- stare at graphs of stuff you're already collecting
- make theories about what's happening and figure out how to falsify them
- collect new data as needed
- LABEL WELL
- if there is a standardised way to collect the data you need, USE IT.
- if there isn't, well, collect it in as ad hoc a way as you like, but write a note in a wiki that you've done that
- AND LABEL THE DATA INFORMATIVELY.
1.2. System metrics
1.3. Collectd
get them to install collectd show them what it collects
1.4. Graphite
somebody else gotta step in to discuss; pls no practical.
a graphite endpoint they can push to and navigate to the dashboard of?
1.4.1. whisperdb
1.4.2. carbon-relay
1.4.3. carbon-ng? or something the go rewrite
1.5. Prometheus
wouldn't minda practical here. prom is smol.
1.5.1. promtsdb
1.5.3. prom web endpoint
1.6. Grafana
unified graphs and queries frontend originally mostly for graphite but it can talk to EVERYTHING muhahahaha somewhat heavy webapp tho
1.6.1. dashboard
1.6.2. panel
1.6.3. panel query editing
useful walkthrough here to discuss how and why to use various display features
2. Duncan method informed class structure
2.1. Core skill
- measure a system.
- create or find metrics
- collect metrics
- install and use metrics collectors
- collectd
- what else?
- install and use metrics collectors
- collate metrics
- ask useful questions of metrics
- know which questions are worth asking and when
- taxonomy of questions
- know how to construct a given question in statspeak
- know which questions are worth asking and when
This node is a singleton!