Maybe you’ve been here:

You get a phone call in the middle of the night. The new sysadmin (whom you hired straight out of college) is flipping all of her shits because web app performance has degraded beyond the alert threshold. She’s been clicking through page after page of graphs, checking application logs all the way up and down the stack, and just generally cussing up a storm because she can’t find the source of the issue. You open your laptop, navigate straight to overall performance graphs, drill down to database graphs, see a pattern that looks like mutex contention, log in to the database, find the offending queries, and report them to the on-call dev. You do all this in a matter of minutes.

Or here:

You’re trying to teach your dad to play Mario Kart. It’s like “Okay, go forward… no, forward… you have to press the gas – no, that’s fire – press the gas button… it’s the A button… the blue one… Yeah, there you go, okay, you’re going forward now… so… so go around the corner… why’d you stop? Dad… it’s like driving a car, you can’t turn if you’re stopped… so remember, gas is A… which is the blue one…”

Why is it so hard for experts to understand the novice experience? Well, in his book Sources of Power, decision-making researcher Gary Klein presents some really interesting theories about what makes experts experts. His theories give us insight into the communication barriers between novices and experts, which can make us better teachers and better learners.

Mental Simulation

Klein arrived at his decision-making model, the recognition-primed decision model, by interviewing hundreds of experts over several years. According to his research, experts in a huge variety of fields rely on mental simulation. In Sources of Power, he defines mental simulation as:

the ability to imagine people and objects consciously and to transform those people and objects through several transitions, finally picturing them in a different way than at the start.

Klein has never studied sysadmins, but when I read about his model I recognized it immediately. This is what we do when we’re trying to reason out how a problem got started, and it’s also how we figure out how to fix it. In our head, we have a model of the system in which the problem lives. Our model consists of some set of moving parts that go through transitions from one state to another.

If you and your friend are trying to figure out how to get a couch around a corner in your stairwell, your moving parts are the couch, your body, and your friend’s body. If you’re trying to figure out how a database table got corrupted, your moving parts might be the web app, the database’s storage engine, and the file system buffer. You envision a series of transitions from one state to the next. If those transitions don’t get you from the initial state to the final state then you tweak your simulation and try again until you get a solution.

Here’s the thing, though: we’re people. Our brains have a severely limited amount of working memory. In his interviews with experts about their decision making processes, Klein found that there was a pretty hard upper limit on the complexity of our mental simulations:

  • 3 moving parts
  • 6 transitions

That’s about all we get, regardless of our experience or intelligence. So how do experts mentally simulate so much more effectively than novices?

Abstractions

As we gain experience in a domain, we start to see how the pieces fit together. As we notice more and more causal patterns, we build a mental bank of abstractions. An abstraction is a kind of abbreviation that stands in for a set of transitions or moving parts that usually functions as a whole. It’s like the keyboard of a piano: when the piano’s working correctly, we don’t have to think about the Rube Golberg-esque series of yanks and shoves going on inside it; we press a key, and the corresponding note comes out.

Experts have access to a huge mental bank of abstractions. Novices don’t yet. This makes experts more efficient at creating mental simulations.

When you’re first learning to drive a car, you have to do everything step by step. You don’t have the abstraction bank of an experienced driver. When the driving instructor tells you to back out of a parking space, your procedure looks something like this:

  • Make sure foot is on brake pedal
  • Shift into reverse
  • Release brake enough to get rolling
  • Turn steering wheel (which direction is it when I’m in reverse?)
  • Put foot back on brake pedal
  • Shift into drive

It’s a choppy, nerve-racking sequence of individual steps. But once you practice this a dozen times or so, you start to build some useful abstractions. Your procedure for backing out of a parking space becomes more like:

  • Go backward (you no longer think about how you need to break, shift, and release the brake)
  • Get facing the right direction
  • Go forward

Once you’ve done it a hundred times, it’s just one step: “Back out of the parking space.”

Now if you recall that problem solving involves mental simulations with at most 3 moving parts and 6 transitions, you’ll see why abstractions are so critical to the making of an expert. Whereas a novice requires several transitions to represent a process, an expert might only need one. The right choice of abstraction allows the expert to hold a much richer simulation in mind, which improves their effectiveness in predicting outcomes and diagnosing problems.

Counterfactuals

Klein highlights another important difference between experts and novices: experts can readily process counterfactuals: explanations and predictions that are inconsistent with the data. This is how experts are able to improvise in unexpected situations.

Imagine that you’re troubleshooting a spate of improper 403 responses from a web app that you admin. You expect that the permissions on some cache directory got borked in the last deploy, so you log in to one of the web servers and tail the access log to see which requests in particular are generating 403s. But you can’t find a single log entry with a 403 error code! You refresh the app a few times in your browser, and sure enough you get a 403 response. But the log file still shows 200 after 200. What’s going on?

If you were a novice, you might just say “That’s impossible” and throw up your hands. But an experienced sysadmin could imagine any number of plausible scenarios to accommodate this counterfactual:

  • You logged in to staging instead of production
  • The 403s are only coming from one of the web servers, and it’s not the one you logged in to
  • 403s are being generated by the load balancer before the requests ever make it to the web servers
  • What you’re looking at in your browser is actually a 200 response with a body that says “403 Forbidden”

Why are experts able to adjust so fluidly to counterfactuals while novices aren’t?

It comes back to abstractions. When experts see something that doesn’t match expectations, they can easily recognize which abstraction is leaking. They understand what’s going on inside the piano, so when they expect a tink but hear a plunk, they can seamlessly jump to a lower level of abstraction and generate a new mental simulation that explains the discrepancy.

Empathizing with novices

By understanding a little about the relationship between abstractions and expertise, we can teach ourselves to see problems from a novice’s perspective. Rather than getting frustrated and taking over, we can try some different strategies:

  1. Tell stories. When Gary Klein and his research team want to understand an expert’s thought process, they don’t use questionnaires or ask the expert to make a flow chart or anything artificial like that. The most effective way to get inside an expert’s thought process is to listen to their stories. So when you’re teaching a novice how to reason about a system, try thinking of an interesting and surprising troubleshooting experience you’ve had with that system before, and tell that story.
  2. Use the Socratic method. Novices need practice at juggling abstractions and digesting counterfactuals. When a novice is describing their mental model of a problem or a potential path forward, ask a hypothetical question or two and watch the gears turn. Questions like “You saw Q happen because of P, but what are some ways we could’ve gotten to Q without P?” or “You expect that changing A will have an effect on B, but what would it mean if you changed A and there was no effect on B?” will challenge the novice to bounce between different layers of abstraction like an expert does.
  3. Remember: your boss may be a novice. Take a moment to look around your org chart and find the nearest novice; it may be above you. Even if your boss used to do your job, they’re a manager now. They may be rusty at dealing with the abstractions you use every day. When your boss is asking for a situation report or an explanation for some decision you made, keep in mind the power of narratives and counterfactuals.

I’ve been reading a really excellent book on product development called The Principles of Product Development Flow, by Donald G. Reinertsen. It’s a very appealing book to me, because it sort of lays down the theoretical and mathematical case for agile product development. And you know that theory is the tea, earl grey, hot to my Jean-Luc Picard.

But as much as I love this book, I just have to bring up this chart that’s in it:

terrible chart

This is the Hindenburg of charts. I can’t even, and it’s not for lack of trying to even. Being horrified by the awfulness of this chart is left as an exercise for the reader, but don’t hold me responsible if this chart gives you ebola.

But despite the utter contempt I feel for that chart, I think the point it’s trying to make is very interesting. So let’s talk about highways.

Highways!

Highways need to be able to get many many people into the city in the morning and out of the city in the evening. So when civil engineers design highways, one of their main concerns is throughput, measured in cars per hour.

Average throughput can be measured in a very straightforward way. First, you figure out the average speed, in miles per hour, of the cars on the highway. The cars are all traveling different speeds depending on what lane they’re in, how old they are, etc. But you don’t care about all that variation: you just need the average.

The other thing you need to calculate is the density of cars on the highway, measured in cars per mile. You take a given length of highway, and you count how many cars are on it, then you repeat. Ta-da! Average car density.

Then you do some math:

\frac{cars}{hour} = \frac{cars}{mile} \cdot \frac{miles}{hour}

Easy peasy. But let’s think about what that means. Here’s a super interesting graph of average car speed versus average car speed:

speed vs speed

Stay with me. Here’s a graph of average car density versus average car speed:

density vs speed

This makes sense, right? Cars tend to pack together at low speed. That’s called a bumper-to-bumper traffic jam. And when they’re going fast, cars tend to spread out because they need more time to hit the brakes if there’s a problem.

So, going back to our equation, what shape do we get when we multiply a linear equation by another linear equation? That’s right: we get a parabola:

throughput-vs-speed

That right there is the throughput curve for the highway (which in the real world isn’t quite so symmetric, but has roughly the same properties). On the left hand side, throughput is low because traffic is stopped in a bumper-to-bumper jam. On the right hand side, throughput is low too: the cars that are on the highway are able to go very fast, but there aren’t enough of them to raise the throughput appreciably.

So already we can pick up a very important lesson: Faster movement does not equate to higher throughput. Up to a point, faster average car speed improves throughput. Then you get up toward the peak of the parabola and it starts having less and less effect. And then you get past the peak, and throughput actually goes down as you increase speed. Very interesting.

Congestion

Now, looking at that throughput curve, you might be tempted to run your highway at the very top, where the highest throughput is. If you can get cars traveling the right average speed, you can maximize throughput thereby minimizing rush hour duration. Sounds great, right?

Well, not so fast. Suppose you’re operating right at the peak, throughput coming out the wazoo. What happens if a couple more cars get on the highway? The traffic’s denser now, so cars have to slow down to accommodate that density. The average speed is therefore reduced, which means we’re now a bit left of our throughput peak. So throughput has been reduced, but cars are still arriving at the same rate, so that’s gonna increase density some more.

congestion

This is congestion collapse: a sharp, catastrophic drop in throughput that leads to a traffic jam. It can happen in any queueing system where there’s feedback between throughput and processing speed. It’s why traffic jams tend to start and end all at once rather than gradually appearing and disappearing.

The optimal place for a highway to be is just a little to the right of the throughput peak. This doesn’t hurt throughput much because the curve is pretty flat near the top, but it keeps us away from the dangerous positive feedback loop.

So what does all this have to do with product development workflow?

Kanban Boards Are Just Like Highways

On a kanban board, tickets move from left to right as we make progress on them. If we had a kanban board where tickets moved continuously rather than in discrete steps, we could measure the average speed of tickets on our board in inches per day (or, if we were using the metric system, centimeters per kilosecond):

ticket-speedAnd we could also measure the density of tickets just like we measured the density of cars, by dividing the board into inch-wide slices and counting the tickets per inch:

tickets-per-inchLet’s see how seriously we can abuse the analogy between this continuous kanban board and a highway:

  • Tickets arrive in our queue at random intervals, just as cars pull onto a highway at random intervals.
  • Tickets “travel” at random speeds (in inches/day) because we’re never quite sure how long a given task is going to take. This is just like how cars travel at random speeds (in miles per hour)
  • Tickets travel more slowly when there are many tickets to do (because of context switching, interdependencies, blocked resources, etc.), just as cars travel more slowly when they’re packed more densely onto the highway.
  • Tickets travel more quickly when there are fewer tickets to do, just as cars travel more quickly when the road ahead of them is open.

There are similarities enough that we can readily mine traffic engineering patterns for help dealing with ticket queues. We end up with a very familiar throughput curve for our kanban board:

ticket-throughputAnd just like with highway traffic, we run the risk of congestion collapse if we stray too close to the left-hand side of this curve. Since kanban boards generally have a limit on the number of tickets in progress, however, our congestion won’t manifest as a board densely packed with tickets. Rather, it will take the form of very long queues of work waiting to start. This is just as bad: longer queues mean longer wait times for incoming work, and long queues don’t go away without a direct effort to smash them.

What we can learn from real-world queues

A kanban board is a queueing system like any other, and the laws of queueing theory are incredibly general in their applicability. So we can learn a lot about managing ticket throughput by looking at the ways in which other queueing systems have been optimized.

Measure your system’s attributes

First off: you need metrics. Use automation to measure and graph, at the very least,

  • Number of tickets in queue (waiting to start work)
  • Number of tickets in progress
  • Number of tickets completed per day (or week)

Productivity metrics smell bad to a lot of people, and I think that’s because they’re so often used by incompetent managers as “proof” that employees could be pulling more weight. But these metrics can be invaluable if you understand the theory that they fit into. You can’t improve without measuring.

Control occupancy to sustain throughput

As we’ve seen, when there are too many tickets in the system, completion times suffer in a self-reinforcing way. If we’re to avoid that, we need to control the number of tickets not just in progress, but occupying the system as a whole. This includes queued tickets.

In some cities (Minneapolis and Los Angeles, for example), highway occupancy is controlled during rush hour by traffic lights on the on-ramp. The light flashes green to let a single car at a time onto the highway, and the frequency at which that happens can be tuned to the current density of traffic. It’s a safeguard against an abrupt increase in density shoving throughput over the peak into congestion collapse.

But how can we control the total occupancy of our ticketing system when tickets arrive at random?

Don’t let long queues linger

If you’re monitoring your queue length, you’ll be able to see when there’s a sharp spike in incoming tickets. When that happens, you need to address it immediately.

For every item in a queue, the average wait time for all work in the system goes up. Very long queues cause very long wait times. And long queues don’t go away by themselves: if tickets are arriving at random intervals, then a long queue is just as likely to grow as it is to shrink.

One way to address a long queue is to provision a bit more capacity as soon as you see the queue forming. Think about supermarkets. When checkout lines are getting a bit too long, the manager will staff one or two extra lanes. All it takes is enough capacity to get the queues back down to normal – it’s not necessary to eliminate them altogether – and then those employees can leave the register and go back to whatever they were doing before.

The other way to address a long queue is to slash requirements. When you see a long queue of tickets forming, spend some time going through it and asking questions like

  • Can this ticket be assigned to a different team?
  • Can this feature go into a later release?
  • Are there any duplicates?
  • Can we get increased efficiency by merging any of these tickets into one? (e.g. through automation or reduced context switching)

If you can shave down your queue by eliminating some unnecessary work, your system’s wait times will improve and the required capacity to mop up the queue will be lower.

Provide forecasts of expected wait time

At Disney World, they tell you how long the wait will be for each ride. Do you think they do this because it’s a fun little bit of data? Of course not. It helps them break the feedback loop of congestion.

When the wait for Space Mountain is 15 minutes, you don’t think twice. But when the wait is an hour, you might say to yourself “Eh, maybe I’ll go get lunch now and see if the line’s shorter later.” So these wait time forecasts are a very elegant way to cut down on incoming traffic when occupancy is high. Just like those traffic lights on highway on-ramps.

Why not use Little’s law to make your own forecasts of expected ticket wait time? If you’re tracking the occupancy of your system (including queued tickets) and the average processing rate (in tickets completed per day), it’s just:

\text{Average Wait Time} = \frac{\text{Occupancy}}{\text{Average Processing Rate}}

If you display this forecast in a public place, like the home page for your JIRA project, people will think twice when they’re about to submit a ticket. They might say to themselves “If it’s gonna take that many days, I might as well do this task myself” or “The information I’m asking for won’t be useful a week from now, so I guess there’s no point filing this ticket.”

Forecasts like this allow you to shed incoming ticket load when queues are high without having to tell stakeholders “no.”

Queues are everywhere

If you learn a little bit about queueing theory, you’ll see queues everywhere you look. It’s a great lens for solving problems and understanding the world. Try it out.

 

 

 

 

It’s a great idea to track your MTTR (Mean Time To Recover) as an operational metric. MTTR is defined as the average interval between onset of a failure and recovery from that failure. We acknowledge that failures are part of the game, so we want our organization to be good at responding quickly to them. It’s intuitive that we’d want our MTTR to trend down.

This is one of those places where our intuition can be misleading.

MTTR is an average over incidents of incident duration. That means that the total amount of downtime gets denominatored out. Consider these two brothers who run different websites:

  • Achenar’s site only had 1 outage in September, and it lasted 60 minutes.
  • Sirius’s site had 120 outages in September, lasting 20 minutes each.

Sirius had 40 times as much downtime as Achenar in the month of September. Sirius’s MTTR, however, was 1/3 that of Achenar: 20 minutes rather than 60 minutes.

Lowering your MTTR is a good strategy in certain situations. But you need to make sure it’s the right strategy. If you don’t look at the whole picture, things like nuisance alarms and insufficient automation can be confounded with the meaning of your MTTR. If you fix a whole bunch of meaningless alerts that always recover quickly without intervention (you know the type), your MTTR goes up!

mttr_vs_incident_count

 

MTTR is useful to track, and it can be useful for decision-making. Just remember: our goal is to minimize downtime and noise, not MTTR. If the path of least resistance to lower downtime and a stronger signal is to respond to incidents quicker, then MTTR is your best friend. But that’s not always true.

I often meet with skepticism when I say that server monitoring systems should only page when a service stops doing its work. It’s one of the suggestions I made in my Smoke Alarms & Car Alarms talk at Monitorama this year. I don’t page on high CPU usage, or rapidly-growing RAM usage, or anything like that. Skeptics usually ask some variation on:

If you only alert on things that are already broken, won’t you miss opportunities to fix things before they break?

The answer is a clear and unapologetic yes! Sometimes that will happen.

It’s easy to be certain that a service is down: just check whether its work is still getting done. It’s even pretty easy to detect a performance degradation, as long as you have clearly defined what constitutes acceptable performance. But it’s orders of magnitude more difficult to reliably predict that a service will go down soon without human intervention.

We like to operate our systems at the very edge of their capacity. This is true not only in tech, but in all sectors: from medicine to energy to transportation. And it makes sense: we bought a certain amount of capacity: why would we waste any? But a side effect of this insatiable lust for capacity is that it makes the line between working and not working extremely subtle. As Mark Burgess points out in his thought-provoking In Search of Certainty, this is a consequence of nonlinear dynamics (or “chaos theory“), and our systems are vulnerable to it as long as we operate them so close to an unstable region.

But we really really want to predict failures! It’s tempting to try and develop increasingly complex models of our nonlinear systems, aiming for perfect failure prediction. Unfortunately, since these systems are almost always operating under an unpredictable workload, we end up having to couple these models tightly to our implementation: number of threads, number of servers, network link speed, JVM heap size, and so on.

This is just like overfitting a regression in statistics: it may work incredibly well for the conditions that you sampled to build your model, but it will fail as soon as new conditions are introduced. In short, predictive models for nonlinear systems are fragile. So fragile that they’re not worth the effort to build.

Instead of trying to buck the unbuckable (which is a bucking waste of time), we should seek to capture every failure and let our system learn from it. We should make systems that are aware of their own performance and the status of their own monitors. That way we can build feedback loops and self-healing into them: a strategy that won’t crumble when the implementation or the workload takes a sharp left.

 

People make mistakes, and that’s fine.

We’ve come a long way in recognizing that humans are integral to our systems, and that human error is an unavoidable reality in these systems. There’s a lot of talk these days about the necessity of blameless postmortems (for example, this talk from Devops Days Brisbane and this blog post by Etsy’s Daniel Schauenberg), and I think that’s great.

The discussion around blamelessness usually focuses on the need to recognize human interaction as a part of a complex system, rather than external force. Like any system component, an operator is constrained to manipulate and examine the system in certain predefined ways. The gaps between the system’s access points inevitably leave blind spots. When you think about failures this way, it’s easier to move past human errors to a more productive analysis of those blind spots.

That’s all well and good, but I want to give a mathematical justification for blamelessness in postmortems. It follows directly from one of the fundamental theorems of probability.

Bayes’ Theorem

Some time in the mid 1700s, the British statistician and minister Thomas Bayes came up with a very useful tool. So useful, in fact, that Sir Harold Jeffreys said Bayes Theorem “is to the theory of probability what Pythagoras’s theorem is to geometry.”

Bayes’ Theorem lets us perform calculations on what are called conditional probabilities. A conditional probability is the probability of some event (we’ll call it E) given that another event (which we’ll call D) has occurred. Such a conditional probability would be written

P(E|D)

The “pipe” character between ‘E’ and ‘D’ is pronounced “given that.”

For example, let’s suppose that there’s a 10% probability of a traffic jam on your drive to work. We would write that as

P(\{\mbox{traffic jam}\}) = 0.1

That’s just a regular old probability. But suppose we also know that, when there’s an accident on the highway, the probability of a traffic jam quadruples up to 40%. We would write that number, the probability of a traffic jam given that there’s been an accident, using the conditional probability notation:

P(\{\mbox{traffic jam}\}|\{\mbox{accident}\}) = 0.4

Bayes’ Theorem lets us use conditional probabilities like these to calculate other conditional probabilities. It goes like this:

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

In our traffic example, we can use Bayes’ Theorem to calculate the probability that there’s been an accident given that there’s a traffic jam. Informally, you could call this the percentage of traffic jams for which car accidents are responsible. Assuming we know that the probability of a car accident on any given morning is 1 in 5, or 0.2, we can just plug the events in which we’re interested into Bayes’ Theorem:

P(\{\mbox{accident}\}|\{\mbox{traffic jam}\}) = \frac{P(\{\mbox{traffic jam}\}|\{\mbox{accident}\})P(\{\mbox{accident}\})}{P(\{\mbox{traffic jam}\})}

P(\{\mbox{accident}\}|\{\mbox{traffic jam}\}) = \frac{0.4 \cdot 0.2}{0.1}

P(\{\mbox{accident}\}|\{\mbox{traffic jam}\}) = 0.8 = 80\%

Mistakes

Bayes’ Theorem can give us insight into the usefulness of assigning blame for engineering decisions that go awry. If you take away all the empathy- and complexity-based arguments for blameless postmortems, you still have a pretty solid reason to look past human error to find the root cause of a problem.

Engineers spend their days making a series of decisions, most of which are right but some of which are wrong. We can gauge the effectiveness of an engineer by the number N of decisions she makes in a given day and the probability P(M) that any given one of those decisions is a mistake.

Suppose we have a two-person engineering team — Eleanor and Liz — who work according to these parameters:

  • Eleanor: Makes 120 decisions per day. Each decision has a 1-in-20 chance of being a mistake.
  • Liz: Makes 30 decisions per day. Each decision has a 1-in-6 chance of being a mistake.

Eleanor is a better engineer both in terms of the quality and the quantity of her work. If one of these engineers needs to shape up, it’s clearly Liz. Without Eleanor, 4/5 of the product wouldn’t exist.

But if the manager of this team is in the habit of punishing engineers when their mistakes cause a visible problem (for example, by doing blameful postmortems), we’ll get a very different idea of the team’s distribution of skill. We can use Bayes’ Theorem to see this.

The system that Eleanor and Liz have built is running constantly, and let’s assume that it’s exercising all pieces of its functionality equally. That is, at any time, the system is equally likely to be executing any given behavior that’s been designed into it. (Sure, most systems strongly favor certain execution paths over others, but bear with me.)

Well, 120 out of 150 of the system’s design decisions were made by Eleanor, so there’s an 80% chance that the system is exercising Eleanor’s functionality. The other 20% of the time, it’s exercising Liz’s. So if a bug crops up, what’s the probability that Eleanor designed the buggy component? Let’s define event A as “Eleanor made the decision” and event B as “The decision is wrong”. Bayes’ theorem tells us that

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

which, in the context of our example, lets us calculate the probability P(A|B) that Eleanor made a particular decision given that the decision was wrong (i.e. contains a bug). We already know two of the quantities in this formula:

  • P(B|A) reads as “the probability that a particular decision is wrong given that Eleanor made that decision.” Referring to Eleanor’s engineering skills above, we know that this probability is 1 in 20.
  • P(A) is the probability that Eleanor made a particular decision in the system design. Since Eleanor makes 120 decisions a day and Liz only makes 30, P(A) is 120/150, or 4 in 5.

The last unknown variable in the formula is P(B): the probability that a given design decision is wrong. But we can calculate this too; I’ll let you figure out how. The answer is 11 in 150.

Now that we have all the numbers, let’s plug them in to our formula:

P(A|B) = \frac{[1/20] \cdot [4/5]}{[11/150]}

P(A|B) = 0.545 = 54.5\%

In other words, Eleanor’s decisions are responsible for 54.5% of the mistakes in the system design. Remember, this is despite the fact that Eleanor is clearly the better engineer! She’s both faster and more thorough than Liz.

So think about how backwards it would be to go chastising Eleanor for every bug of hers that got found. Blameful postmortems don’t even make sense from a purely probabilistic angle.

But but but…

Could Eleanor be a better engineer by making fewer mistakes? Of course. The point here is that she’s already objectively better at engineering than Liz, yet she ends up being responsible for more bugs than Liz.

Isn’t this a grotesquely oversimplified model of how engineering works? Of course. But there’s nothing to stop us from generalizing it to account for things like the interactions between design decisions, variations in the severity of mistakes, and so on. The basic idea would hold up.

Couldn’t Eleanor make even fewer mistakes if she slowed down? Probably. That’s a tradeoff that Eleanor and her manager could discuss. But the potential benefit isn’t so cut-and-dry. If Eleanor slowed down to 60 decisions per day with a mistake rate of 1:40, then

  • fewer features (40% fewer!) would be making it out the door, and
  • a higher proportion of the design decisions would now be made by bug-prone Liz, such that the proportion of decisions that are bad would only be reduced very slightly — from 7.33% to 7.22%

So if, for some crazy reason, you’re still stopping your postmortems at “Eleanor made a mistake,” stop it. You could be punishing engineers for being more effective.

I just spent 2 days at Devops Days Minneapolis, the first Devops Days in the Midwest. My brain is still whirling. I was brainstorming while I wrote this; forgive me if it goes astray.

The organizers of the conference (too many to name here) did a super admirable job of building a diverse, empathy-oriented conference while keeping the spotlight on tech. It was really a wonderful conference.

What really strikes me about the fascinating people I’ve met over the last two days is the diversity of their backgrounds and passions. The tech industry attracts all sorts of fascinating people, from philosophy PhDs to roadies who used to tour with Pantera. Our field has become a real melting pot of viewpoints, and that’s a beautiful thing.

However, we’re all doing engineering now. Most of us are doing engineering without any formal engineering training. And I firmly believe that any engineer will be much more effective with a few basic statistical tools in their back pocket.

I proposed an Open Space on this topic, asking the question “Why are so many engineers in tech reluctant to learn any math, despite knowing how useful math can be?” The session attracted a bunch of very smart people, and I came away with many new insights.

We teach how to math, but not why to math

When I give talks to software and ops engineers about math, they eat it right up. They have a strong intuitive sense that math is good and useful and powerful.

On the other hand, when I talk to individuals in tech about math, I’m often shocked at how quickly they shut down. They say things like “Oh, I’m no good at math,” or “I don’t have time to do math,” or “There’s no use for math in what I do.” It smacks of C.P. Snow’s observations on the Two Cultures, but these are people in a career that most humans would place solidly on the analytical side of that intellectual chasm. But chatting at Devops Days reminded me of a really important fact that can help explain this phenomenon.

Math is a part of everyone’s education, but statistics is not.

That’s absurd! Statistics in the single most widely applicable branch of mathematics. You can apply it directly to your life with practical, observable benefits no matter who you are or what you do. Instead, we learn algebra, geometry, trigonometry, and calculus — wonderful, rich disciplines to be sure — but without statistics, you can’t blame students for thinking that math is too abstract and academic to be useful in their lives.

So now, when people inexperienced in math hear “You should learn some math,” they think of impenetrable equations winding across boards, or they draw on “Good Will Hunting” and “A Beautiful Mind” and imagine that they’re being asked to do novel number theory research.

This is one of the biggest misconceptions we need to break. We need to spread the message that statistics is universally applicable, practical, and not very hard to learn with modern tools.

Staring at a blank wall

Local mathematician and Rubyist Kaisa Taipale makes a good point: when you’re new to computers, you worry a lot about making mistakes. The pictures on the screen doesn’t map onto a logical framework, you don’t know the right questions to ask, and as Pagerduty’s David Shackelford puts it, you can’t push past the blankness. Being new to stats is very similar.

In our session, we discussed two ways to remove the hurdles in front of stats-curious engineers. First of all, it’s critical that we make data gathering as trivial as possible. It’s easy to tell someone to look at the data, but when the data’s in the form of million-line Apache logfiles that need to be massaged and crunched and filtered with a custom script, that person will be frustrated and angry before they even start their analysis! At my company, I’ve written scripts to pull data out of Graphite and Logstash into R, but I haven’t been very forthcoming or solicitous about those scripts. I’ll change that.

The other major hurdle to understanding data is the academic undertone of the existing tools. I love me some R, but I’ll be the first to admit that it’s not the friendliest piece of software for someone who’s newly curious about data analysis. We need to spread knowledge of some easier and more accessible stats tools, like Tableau and Statwing, and maybe come up with some open source alternatives.

Once you push past the feeling that you need a PhD to even ask the right questions about data, stats learning becomes a positive feedback loop that benefits everybody in the organization. Let’s keep the discussion going and make sure that the potential data addicts in our midst don’t get discouraged.

You can teach the thought process

Statistical inference, and math in general, can seem like a gift. People think “I just don’t get it. I don’t have the spark for math.”

On the contrary, math is a set of skills that you can learn. They can be taught. In fact, Ian Malpass teaches a course at Etsy called “Graphite Bootcamp,” where interested engineers can learn what an integral means (“it just means adding stuff up,” he says) or how to use time series smoothing techniques.

Math is a tool that can take you from a bunch of opaque, independently useless metrics to a meaningful number that gives you real insight into the system. That’s exactly the mindset that Ian teaches in his class. This is a fantastic idea, and I would love to see other companies copy Ian’s idea and build upon it.

My point

My point here is: a bit of statistics is easy to learn, useful in almost any situation, and beneficial to the level of discourse and decision-making in any organization. If there are barriers stopping engineers from learning it, then those barriers should be attacked furiously until they’re gone. Math still leaves room for arguments over interpretation, but when we’re arguing about data we’re much much more likely to make the right decision.

Monitorama PDX was a fantastic conference. Lots of engineers, lots of diversity, not a lot of bullshit. Jason Dixon and his minions are some top-notch conference-runners.

As someone who loves to see math everywhere, I was absolutely psyched at the mathiness of the talks at Monitorama 2014. I mean, damn. Here, watch my favorite talk: Noah Kantrowitz’s.

I studied physics in college, and I worked in computational research, so the Fourier Transform was a huge deal for me. In his talk, Noah gives some really interesting takes on the application of digital signal processing techniques to ops. I came home inspired by this talk and immediately started trying my hand at this stuff.

What Fourier analysis is

“FOUR-ee-ay” or if you want to be French about it “FOO-ree-ay” with a hint of phlegm on the “r”.

Fourier analysis is used in all sorts of fields that study waves. I learned about it when I was studying physics in college, but it’s most notably popular in digital signal processing.

It’s a thing you can do to waves. Let’s take sound waves, for instance. When you digitize a sound wave, you _sample_ it at a certain frequency: some number of times every second, you write down the strength of the wave. For sound, this sampling frequency is usually 44100 Hz (44,100 times a second).

Why 44100 Hz? Well, the highest pitch that can be heard by the human ear is around 20000 Hz, and you can only reconstitute frequencies from a digital signal at up to half the sampling rate. We don’t need to capture frequencies we can’t hear, so we don’t need to sample any faster than twice our top hearing frequency.

Now what Fourier analysis lets you do is look at a given wave and determine the frequencies that were superimposed to create it. Take the example of a pure middle C (this is R code):

library(ggplot2)
signal = data.frame(t=seq(0, .1, 1/44100))
signal$a = cos(signal$t * 261.625)
qplot(data=signal, t, a, geom='line', color=I('blue'), size=I(2)) +
    ylab('Amplitude') + xlab('Time (seconds)') + ggtitle('Middle C')

Image

Pass this through a Fourier transform and you get:

fourier = data.frame(v=fft(signal$a))
# The first value of fourier$v will be a (zero) DC
# component we don't care about:
fourier = tail(fourier, nrow(fourier) - 1)
# Also, a Fourier transform contains imaginary values, which
# I'm going to ignore for the sake of this example:
fourier$v = Re(fourier$v)
# These are the frequencies represented by each value in the
# Fourier transform:
fourier$f = seq(10, 44100, 10)
# And anything over half our sampling frequency is gonna be
# garbage:
fourier = subset(fourier, f <= 22050)

qplot(data=fourier, f, v, geom='line', color=I('red'), size=I(2)) +
    ylab('Fourier Transform') + xlab('Frequency (Hz)') +
    ggtitle('Middle C Fourier Transform') + coord_cartesian(xlim=c(0,400))

Image

As you can see, there’s a spike at 261.625 Hz, which is the frequency of middle C. Why does it gradually go up, and then go negative and asymptotically come back to 0? That has to do with windowing, but let’s not worry about it. It’s an artifact of this being a numerical approximation of a Fourier Transform, rather than an analytical solution.

You can do Fourier analysis with equal success on a composition of frequencies, like a chord. Here’s a C7 chord, which consists of four notes:

signal$a = cos(signal$t * 261.625 * 2 * pi) + # C
    cos(signal$t * 329.63 * 2 * pi) + # E
    cos(signal$t * 392.00 * 2 * pi) + # G
    cos(signal$t * 466.16 * 2 * pi) + # B flat
qplot(data=signal, t, a, geom='line', color=I('blue'), size=I(2)) +
    ylab('Amplitude') + xlab('Time (seconds)') + ggtitle('C7 chord')

Image

Looking at that mess, you probably wouldn’t guess that it was a C7 chord. You probably wouldn’t even guess that it’s composed of exactly four pure tones. But Fourier analysis makes this very clear:

fourier = data.frame(v=fft(signal$a))
fourier = tail(fourier, nrow(fourier) - 1)
fourier$v = Re(fourier$v)
fourier$f = seq(10, 44100, 10)
fourier = subset(fourier, f <= 22050)

qplot(data=fourier, f, v, geom='line', color=I('red'), size=I(2)) +
    ylab('Fourier Transform') +
    xlab('Frequency (Hz)') +
    ggtitle('Middle C Fourier Transform') +
    coord_cartesian(xlim=c(0,400))

Image

And there are our four peaks, right at the frequencies of the four notes in a C7 chord!

Straight-up Fourier analysis on server metrics

Naturally, when I heard all these Monitorama speakers mention Fourier transforms, I got super psyched. It’s an extremely versatile technique, and I was sure that I was about to get some amazing results.

It’s been kinda disappointing.

By default, a Graphite server samples your metrics (in a manner of speaking) once every 10 seconds. That’s a sampling frequency of 0.1 Hz. So we have a Nyquist Frequency (the maximum frequency at which we can resolve signals with a Fourier transform) of half that: 0.005 Hz.

So, if our goal is to look at a Fourier transform and see important information jump out at us, we have to be looking for oscillations that occur three times a minute or less. I don’t know about you, but I find that outages and performance anomalies rarely show up as oscillations like that. And when they do, you’re going to notice them before you do a Fourier transform.

Usually we get spikes or step functions instead, which bleed into wide ranges of nearby frequencies and end up being much clearer in amplitude-space than in Fourier space. Take this example of some shit hitting some fans:

Image

If we were trying to get information from this metric with Fourier transforms, we’d be interested in the Fourier transform before and after the fan got shitty. But those transforms are much less useful than the amplitude-space data:

Image

I haven’t been able to find the value in automatically applying Fourier transforms to server metrics. It’s a good technique for finding oscillating components of a messy signal, but unless you know that that’s what you’re looking for, I don’t think you’ll get much else out of them.

What about low-pass filters?

A low-pass filter uses a Fourier transform to remove high frequency components from a signal. One of my favorite takeaways from that Noah Kantrowitz talk was this: Nagios’s flapping detection mechanism is a low-pass filter.

If you want to alert when a threshold is exceeded — but not every time your metric goes above and below that threshold in a short period of time — you can run your metric through a low-pass filter. The high-frequency, less-valuable data will go away, and you’ll be left with a more stable signal to check against your threshold.

I haven’t tried this method of flap detection, but I suspect that the low-sampling-frequency problem would make it significantly less useful than one might hope. If you’ve seen Fourier analysis applied as a flap detection algorithm, I’d love to see it. I would eat my words, and they’d be delicious.

I hope I’m wrong

If somebody can show me a useful application of Fourier analysis to server monitoring, I will freak out with happiness. I love the concept. But until I see a concrete example of Fourier analysis doing something that couldn’t be done effectively with a much simpler algorithm, I’m skeptical.

Addendum

As Abe Stanway points out, Fourier analysis is a great tool to have in your modeling toolbox. It excels at finding seasonal (meaning periodic) oscillations in your data. Also, Abe and the Skyline team are working on adding seasonality detection to Skyline, which might use Fourier analysis to determine whether seasonal components should be used.

Theo Schlossnagle coyly suggests that Circonus uses Fourier analysis in a similar manner.

Follow

Get every new post delivered to your Inbox.