People make mistakes, and that’s fine.

We’ve come a long way in recognizing that humans are integral to our systems, and that human error is an unavoidable reality in these systems. There’s a lot of talk these days about the necessity of blameless postmortems (for example, this talk from Devops Days Brisbane and this blog post by Etsy’s Daniel Schauenberg), and I think that’s great.

The discussion around blamelessness usually focuses on the need to recognize human interaction as a part of a complex system, rather than external force. Like any system component, an operator is constrained to manipulate and examine the system in certain predefined ways. The gaps between the system’s access points inevitably leave blind spots. When you think about failures this way, it’s easier to move past human errors to a more productive analysis of those blind spots.

That’s all well and good, but I want to give a mathematical justification for blamelessness in postmortems. It follows directly from one of the fundamental theorems of probability.

Bayes’ Theorem

Some time in the mid 1700s, the British statistician and minister Thomas Bayes came up with a very useful tool. So useful, in fact, that Sir Harold Jeffreys said Bayes Theorem “is to the theory of probability what Pythagoras’s theorem is to geometry.”

Bayes’ Theorem lets us perform calculations on what are called conditional probabilities. A conditional probability is the probability of some event (we’ll call it E) given that another event (which we’ll call D) has occurred. Such a conditional probability would be written

P(E|D)

The “pipe” character between ‘E’ and ‘D’ is pronounced “given that.”

For example, let’s suppose that there’s a 10% probability of a traffic jam on your drive to work. We would write that as

P(\{\mbox{traffic jam}\}) = 0.1

That’s just a regular old probability. But suppose we also know that, when there’s an accident on the highway, the probability of a traffic jam quadruples up to 40%. We would write that number, the probability of a traffic jam given that there’s been an accident, using the conditional probability notation:

P(\{\mbox{traffic jam}\}|\{\mbox{accident}\}) = 0.4

Bayes’ Theorem lets us use conditional probabilities like these to calculate other conditional probabilities. It goes like this:

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

In our traffic example, we can use Bayes’ Theorem to calculate the probability that there’s been an accident given that there’s a traffic jam. Informally, you could call this the percentage of traffic jams for which car accidents are responsible. Assuming we know that the probability of a car accident on any given morning is 1 in 5, or 0.2, we can just plug the events in which we’re interested into Bayes’ Theorem:

P(\{\mbox{accident}\}|\{\mbox{traffic jam}\}) = \frac{P(\{\mbox{traffic jam}\}|\{\mbox{accident}\})P(\{\mbox{accident}\})}{P(\{\mbox{traffic jam}\})}

P(\{\mbox{accident}\}|\{\mbox{traffic jam}\}) = \frac{0.4 \cdot 0.2}{0.1}

P(\{\mbox{accident}\}|\{\mbox{traffic jam}\}) = 0.8 = 80\%

Mistakes

Bayes’ Theorem can give us insight into the usefulness of assigning blame for engineering decisions that go awry. If you take away all the empathy- and complexity-based arguments for blameless postmortems, you still have a pretty solid reason to look past human error to find the root cause of a problem.

Engineers spend their days making a series of decisions, most of which are right but some of which are wrong. We can gauge the effectiveness of an engineer by the number N of decisions she makes in a given day and the probability P(M) that any given one of those decisions is a mistake.

Suppose we have a two-person engineering team — Eleanor and Liz — who work according to these parameters:

  • Eleanor: Makes 120 decisions per day. Each decision has a 1-in-20 chance of being a mistake.
  • Liz: Makes 30 decisions per day. Each decision has a 1-in-6 chance of being a mistake.

Eleanor is a better engineer both in terms of the quality and the quantity of her work. If one of these engineers needs to shape up, it’s clearly Liz. Without Eleanor, 4/5 of the product wouldn’t exist.

But if the manager of this team is in the habit of punishing engineers when their mistakes cause a visible problem (for example, by doing blameful postmortems), we’ll get a very different idea of the team’s distribution of skill. We can use Bayes’ Theorem to see this.

The system that Eleanor and Liz have built is running constantly, and let’s assume that it’s exercising all pieces of its functionality equally. That is, at any time, the system is equally likely to be executing any given behavior that’s been designed into it. (Sure, most systems strongly favor certain execution paths over others, but bear with me.)

Well, 120 out of 150 of the system’s design decisions were made by Eleanor, so there’s an 80% chance that the system is exercising Eleanor’s functionality. The other 20% of the time, it’s exercising Liz’s. So if a bug crops up, what’s the probability that Eleanor designed the buggy component? Let’s define event A as “Eleanor made the decision” and event B as “The decision is wrong”. Bayes’ theorem tells us that

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

which, in the context of our example, lets us calculate the probability P(A|B) that Eleanor made a particular decision given that the decision was wrong (i.e. contains a bug). We already know two of the quantities in this formula:

  • P(B|A) reads as “the probability that a particular decision is wrong given that Eleanor made that decision.” Referring to Eleanor’s engineering skills above, we know that this probability is 1 in 20.
  • P(A) is the probability that Eleanor made a particular decision in the system design. Since Eleanor makes 120 decisions a day and Liz only makes 30, P(A) is 120/150, or 4 in 5.

The last unknown variable in the formula is P(B): the probability that a given design decision is wrong. But we can calculate this too; I’ll let you figure out how. The answer is 11 in 150.

Now that we have all the numbers, let’s plug them in to our formula:

P(A|B) = \frac{[1/20] \cdot [4/5]}{[11/150]}

P(A|B) = 0.545 = 54.5\%

In other words, Eleanor’s decisions are responsible for 54.5% of the mistakes in the system design. Remember, this is despite the fact that Eleanor is clearly the better engineer! She’s both faster and more thorough than Liz.

So think about how backwards it would be to go chastising Eleanor for every bug of hers that got found. Blameful postmortems don’t even make sense from a purely probabilistic angle.

But but but…

Could Eleanor be a better engineer by making fewer mistakes? Of course. The point here is that she’s already objectively better at engineering than Liz, yet she ends up being responsible for more bugs than Liz.

Isn’t this a grotesquely oversimplified model of how engineering works? Of course. But there’s nothing to stop us from generalizing it to account for things like the interactions between design decisions, variations in the severity of mistakes, and so on. The basic idea would hold up.

Couldn’t Eleanor make even fewer mistakes if she slowed down? Probably. That’s a tradeoff that Eleanor and her manager could discuss. But the potential benefit isn’t so cut-and-dry. If Eleanor slowed down to 60 decisions per day with a mistake rate of 1:40, then

  • fewer features (40% fewer!) would be making it out the door, and
  • a higher proportion of the design decisions would now be made by bug-prone Liz, such that the proportion of decisions that are bad would only be reduced very slightly — from 7.33% to 7.22%

So if, for some crazy reason, you’re still stopping your postmortems at “Eleanor made a mistake,” stop it. You could be punishing engineers for being more effective.

I just spent 2 days at Devops Days Minneapolis, the first Devops Days in the Midwest. My brain is still whirling. I was brainstorming while I wrote this; forgive me if it goes astray.

The organizers of the conference (too many to name here) did a super admirable job of building a diverse, empathy-oriented conference while keeping the spotlight on tech. It was really a wonderful conference.

What really strikes me about the fascinating people I’ve met over the last two days is the diversity of their backgrounds and passions. The tech industry attracts all sorts of fascinating people, from philosophy PhDs to roadies who used to tour with Pantera. Our field has become a real melting pot of viewpoints, and that’s a beautiful thing.

However, we’re all doing engineering now. Most of us are doing engineering without any formal engineering training. And I firmly believe that any engineer will be much more effective with a few basic statistical tools in their back pocket.

I proposed an Open Space on this topic, asking the question “Why are so many engineers in tech reluctant to learn any math, despite knowing how useful math can be?” The session attracted a bunch of very smart people, and I came away with many new insights.

We teach how to math, but not why to math

When I give talks to software and ops engineers about math, they eat it right up. They have a strong intuitive sense that math is good and useful and powerful.

On the other hand, when I talk to individuals in tech about math, I’m often shocked at how quickly they shut down. They say things like “Oh, I’m no good at math,” or “I don’t have time to do math,” or “There’s no use for math in what I do.” It smacks of C.P. Snow’s observations on the Two Cultures, but these are people in a career that most humans would place solidly on the analytical side of that intellectual chasm. But chatting at Devops Days reminded me of a really important fact that can help explain this phenomenon.

Math is a part of everyone’s education, but statistics is not.

That’s absurd! Statistics in the single most widely applicable branch of mathematics. You can apply it directly to your life with practical, observable benefits no matter who you are or what you do. Instead, we learn algebra, geometry, trigonometry, and calculus — wonderful, rich disciplines to be sure — but without statistics, you can’t blame students for thinking that math is too abstract and academic to be useful in their lives.

So now, when people inexperienced in math hear “You should learn some math,” they think of impenetrable equations winding across boards, or they draw on “Good Will Hunting” and “A Beautiful Mind” and imagine that they’re being asked to do novel number theory research.

This is one of the biggest misconceptions we need to break. We need to spread the message that statistics is universally applicable, practical, and not very hard to learn with modern tools.

Staring at a blank wall

Local mathematician and Rubyist Kaisa Taipale makes a good point: when you’re new to computers, you worry a lot about making mistakes. The pictures on the screen doesn’t map onto a logical framework, you don’t know the right questions to ask, and as Pagerduty’s David Shackelford puts it, you can’t push past the blankness. Being new to stats is very similar.

In our session, we discussed two ways to remove the hurdles in front of stats-curious engineers. First of all, it’s critical that we make data gathering as trivial as possible. It’s easy to tell someone to look at the data, but when the data’s in the form of million-line Apache logfiles that need to be massaged and crunched and filtered with a custom script, that person will be frustrated and angry before they even start their analysis! At my company, I’ve written scripts to pull data out of Graphite and Logstash into R, but I haven’t been very forthcoming or solicitous about those scripts. I’ll change that.

The other major hurdle to understanding data is the academic undertone of the existing tools. I love me some R, but I’ll be the first to admit that it’s not the friendliest piece of software for someone who’s newly curious about data analysis. We need to spread knowledge of some easier and more accessible stats tools, like Tableau and Statwing, and maybe come up with some open source alternatives.

Once you push past the feeling that you need a PhD to even ask the right questions about data, stats learning becomes a positive feedback loop that benefits everybody in the organization. Let’s keep the discussion going and make sure that the potential data addicts in our midst don’t get discouraged.

You can teach the thought process

Statistical inference, and math in general, can seem like a gift. People think “I just don’t get it. I don’t have the spark for math.”

On the contrary, math is a set of skills that you can learn. They can be taught. In fact, Ian Malpass teaches a course at Etsy called “Graphite Bootcamp,” where interested engineers can learn what an integral means (“it just means adding stuff up,” he says) or how to use time series smoothing techniques.

Math is a tool that can take you from a bunch of opaque, independently useless metrics to a meaningful number that gives you real insight into the system. That’s exactly the mindset that Ian teaches in his class. This is a fantastic idea, and I would love to see other companies copy Ian’s idea and build upon it.

My point

My point here is: a bit of statistics is easy to learn, useful in almost any situation, and beneficial to the level of discourse and decision-making in any organization. If there are barriers stopping engineers from learning it, then those barriers should be attacked furiously until they’re gone. Math still leaves room for arguments over interpretation, but when we’re arguing about data we’re much much more likely to make the right decision.

Monitorama PDX was a fantastic conference. Lots of engineers, lots of diversity, not a lot of bullshit. Jason Dixon and his minions are some top-notch conference-runners.

As someone who loves to see math everywhere, I was absolutely psyched at the mathiness of the talks at Monitorama 2014. I mean, damn. Here, watch my favorite talk: Noah Kantrowitz’s.

I studied physics in college, and I worked in computational research, so the Fourier Transform was a huge deal for me. In his talk, Noah gives some really interesting takes on the application of digital signal processing techniques to ops. I came home inspired by this talk and immediately started trying my hand at this stuff.

What Fourier analysis is

“FOUR-ee-ay” or if you want to be French about it “FOO-ree-ay” with a hint of phlegm on the “r”.

Fourier analysis is used in all sorts of fields that study waves. I learned about it when I was studying physics in college, but it’s most notably popular in digital signal processing.

It’s a thing you can do to waves. Let’s take sound waves, for instance. When you digitize a sound wave, you _sample_ it at a certain frequency: some number of times every second, you write down the strength of the wave. For sound, this sampling frequency is usually 44100 Hz (44,100 times a second).

Why 44100 Hz? Well, the highest pitch that can be heard by the human ear is around 20000 Hz, and you can only reconstitute frequencies from a digital signal at up to half the sampling rate. We don’t need to capture frequencies we can’t hear, so we don’t need to sample any faster than twice our top hearing frequency.

Now what Fourier analysis lets you do is look at a given wave and determine the frequencies that were superimposed to create it. Take the example of a pure middle C (this is R code):

library(ggplot2)
signal = data.frame(t=seq(0, .1, 1/44100))
signal$a = cos(signal$t * 261.625)
qplot(data=signal, t, a, geom='line', color=I('blue'), size=I(2)) +
    ylab('Amplitude') + xlab('Time (seconds)') + ggtitle('Middle C')

Image

Pass this through a Fourier transform and you get:

fourier = data.frame(v=fft(signal$a))
# The first value of fourier$v will be a (zero) DC
# component we don't care about:
fourier = tail(fourier, nrow(fourier) - 1)
# Also, a Fourier transform contains imaginary values, which
# I'm going to ignore for the sake of this example:
fourier$v = Re(fourier$v)
# These are the frequencies represented by each value in the
# Fourier transform:
fourier$f = seq(10, 44100, 10)
# And anything over half our sampling frequency is gonna be
# garbage:
fourier = subset(fourier, f <= 22050)

qplot(data=fourier, f, v, geom='line', color=I('red'), size=I(2)) +
    ylab('Fourier Transform') + xlab('Frequency (Hz)') +
    ggtitle('Middle C Fourier Transform') + coord_cartesian(xlim=c(0,400))

Image

As you can see, there’s a spike at 261.625 Hz, which is the frequency of middle C. Why does it gradually go up, and then go negative and asymptotically come back to 0? That has to do with windowing, but let’s not worry about it. It’s an artifact of this being a numerical approximation of a Fourier Transform, rather than an analytical solution.

You can do Fourier analysis with equal success on a composition of frequencies, like a chord. Here’s a C7 chord, which consists of four notes:

signal$a = cos(signal$t * 261.625 * 2 * pi) + # C
    cos(signal$t * 329.63 * 2 * pi) + # E
    cos(signal$t * 392.00 * 2 * pi) + # G
    cos(signal$t * 466.16 * 2 * pi) + # B flat
qplot(data=signal, t, a, geom='line', color=I('blue'), size=I(2)) +
    ylab('Amplitude') + xlab('Time (seconds)') + ggtitle('C7 chord')

Image

Looking at that mess, you probably wouldn’t guess that it was a C7 chord. You probably wouldn’t even guess that it’s composed of exactly four pure tones. But Fourier analysis makes this very clear:

fourier = data.frame(v=fft(signal$a))
fourier = tail(fourier, nrow(fourier) - 1)
fourier$v = Re(fourier$v)
fourier$f = seq(10, 44100, 10)
fourier = subset(fourier, f <= 22050)

qplot(data=fourier, f, v, geom='line', color=I('red'), size=I(2)) +
    ylab('Fourier Transform') +
    xlab('Frequency (Hz)') +
    ggtitle('Middle C Fourier Transform') +
    coord_cartesian(xlim=c(0,400))

Image

And there are our four peaks, right at the frequencies of the four notes in a C7 chord!

Straight-up Fourier analysis on server metrics

Naturally, when I heard all these Monitorama speakers mention Fourier transforms, I got super psyched. It’s an extremely versatile technique, and I was sure that I was about to get some amazing results.

It’s been kinda disappointing.

By default, a Graphite server samples your metrics (in a manner of speaking) once every 10 seconds. That’s a sampling frequency of 0.1 Hz. So we have a Nyquist Frequency (the maximum frequency at which we can resolve signals with a Fourier transform) of half that: 0.005 Hz.

So, if our goal is to look at a Fourier transform and see important information jump out at us, we have to be looking for oscillations that occur three times a minute or less. I don’t know about you, but I find that outages and performance anomalies rarely show up as oscillations like that. And when they do, you’re going to notice them before you do a Fourier transform.

Usually we get spikes or step functions instead, which bleed into wide ranges of nearby frequencies and end up being much clearer in amplitude-space than in Fourier space. Take this example of some shit hitting some fans:

Image

If we were trying to get information from this metric with Fourier transforms, we’d be interested in the Fourier transform before and after the fan got shitty. But those transforms are much less useful than the amplitude-space data:

Image

I haven’t been able to find the value in automatically applying Fourier transforms to server metrics. It’s a good technique for finding oscillating components of a messy signal, but unless you know that that’s what you’re looking for, I don’t think you’ll get much else out of them.

What about low-pass filters?

A low-pass filter uses a Fourier transform to remove high frequency components from a signal. One of my favorite takeaways from that Noah Kantrowitz talk was this: Nagios’s flapping detection mechanism is a low-pass filter.

If you want to alert when a threshold is exceeded — but not every time your metric goes above and below that threshold in a short period of time — you can run your metric through a low-pass filter. The high-frequency, less-valuable data will go away, and you’ll be left with a more stable signal to check against your threshold.

I haven’t tried this method of flap detection, but I suspect that the low-sampling-frequency problem would make it significantly less useful than one might hope. If you’ve seen Fourier analysis applied as a flap detection algorithm, I’d love to see it. I would eat my words, and they’d be delicious.

I hope I’m wrong

If somebody can show me a useful application of Fourier analysis to server monitoring, I will freak out with happiness. I love the concept. But until I see a concrete example of Fourier analysis doing something that couldn’t be done effectively with a much simpler algorithm, I’m skeptical.

Addendum

As Abe Stanway points out, Fourier analysis is a great tool to have in your modeling toolbox. It excels at finding seasonal (meaning periodic) oscillations in your data. Also, Abe and the Skyline team are working on adding seasonality detection to Skyline, which might use Fourier analysis to determine whether seasonal components should be used.

Theo Schlossnagle coyly suggests that Circonus uses Fourier analysis in a similar manner.

I just returned to Minneapolis from Velocity NY bursting with ideas as always. The program was saturated with fantastic speakers, like my new ops crush Ilya Grigorik of Google. And my favorite part, as always, was the hallway track. I met dozens of brilliant, inspiring engineers. Allspaw, Souders, and Nash really know how to throw a conference.

One exhilarating thing about Velocity is the focus on culture as a driving force for business. Everybody’s in introspection mode, ready to break down their organizational culture and work new ideas into it. It reminds me of artificial competence in genetic engineering. It’s a joy to experience.

But despite all this wonderful cultural introspection, y’know what word you don’t hear? Y’know what drags up awkward silences and sometimes downright reactionary vitriol?

Feminism.

As long as we’re putting our tech culture under the microscope, why don’t we swap in a feminist lens? If you question any random geek at Velocity about gender, you can bet they’ll say “Women are just as good as men at ops,” or “I work with a female engineer, and she’s really smart!” But as soon as you say “feminism,” the barriers go up. It’s like packet loss: the crowd only hears part of what you’re saying, and they assume that there’s nothing else to hear.

We need to build feminism into our organizations, and Velocity would be a great venue for that. I’m just one engineer, and I’m not by any means a feminism expert, but I do think I can shed some light on the most common wrongnesses uttered by engineers when feminism is placed on the table.

Feminism != “Girls are better than boys”

Mention feminism to a random engineer, and you’re likely to hear some variation on:

I’m against all bias! We don’t need feminism, we just need to treat each other equally.

Feminism is often portrayed as the belief that women are superior, or that men should be punished for the inequality they’ve created. Feminism is often portrayed as man-hating.

Feminism is not that. Everyone defines it differently, but I like the definition at the Geek Feminism Wiki:

Feminism is a movement which seeks respect and equality for women both under law and culturally.

Equality. Everyone who’s not an asshole wants it, but we don’t have it yet. That’s why we need a framework in which to analyze our shortcomings, conscious and unconscious. Feminism can be that framework.

Imagine hearing an engineer say this:

Our product should perform optimally! We don’t need metrics, we just need to build a system that performs well.

Would this not be face-palmingly absurd? Of course it would. Metrics let you define your goals, demonstrate the value of your goals, and check how well you’re doing. Metrics show you where you’re losing milliseconds. Metrics are the compass and map with which you navigate the dungeon of performance.

Feminism is to equality as metrics are to performance. Without a framework for self-examination, all the best intentions in the world won’t get you any closer to an equality culture.

Wanting equality isn’t enough

When feminism comes up, you might hear yourself say something like this:

I already treat female engineers equally. Good engineers are good engineers, no matter their gender.

Hey great! The intention to treat others equally is a necessary condition for a culture of equality. But it’s not a sufficient condition.

This is akin to saying:

I’m really into performance, so our site is as fast as it can be.

You might be a performance juggernaut, but you’re just one engineer. You’re responsible for one cross-section of the product. First of all, one person doesn’t constitute a self-improving or even a self-sustaining performance culture. And even more crucially, there are performance mistakes you don’t even know you’re making!

Promoting equality in your organization requires a cultural shift, just like promoting performance. Cultural shifts happen through discourse and introspection and goal-setting — not wishing. That’s why we need to look to feminism.

If you start actively working to attack inequality in your organization, I guarantee you’ll realize you were already a feminist.

Feminism doesn’t require you to be ashamed of yourself

When your heart’s in the right place and you’re constantly examining your own actions and your organization’s, you start to notice bias and prejudice in more and more places. Most disturbingly, you notice it in yourself.

Biases are baked right into ourselves and our culture. They’re so deeply ingrained that we often don’t see or hear them anymore. Think anti-patterns and the broken windows theory. When we do notice our biases, it’s horrifying. We feel ashamed and we want to sweep them under the rug.

Seth Walker of Etsy gave an excellent talk at Velocity NY entitled A Public Commitment to Performance.” It’s about how, rather than keeping their performance shortcomings private until everything’s fixed, Etsy makes public blog posts detailing their current performance challenges and recent performance improvements. This way, everyone at the company knows that there will be public eyes on any performance enhancement they make. It promotes a culture of excitement about improvements, rather than one of shame about failures.

When you notice biases in your organization — and moreover when others notice them — don’t hide them. Talk about them, analyze them, and figure out how to fix them. That’s the productive thing to do with software bugs and performance bottlenecks, so why not inequality?

Where to go from here

I’m kind of a feminism noob, but that won’t stop me from exploring it and talking about it. It shouldn’t stop you either. Geek Feminism is a good jumping-off point if you want to learn about feminism, and they also have a blog. @OnlyGirlInTech is a good Twitter account. I know there’s other stuff out there, so if you’ve got something relevant,  jam it in the comment section!

EDIT on 2013-10-21: Here are some links provided in the comments by Alexis Finch (thanks, Alexis Finch!)

Ada Initiative – focused on OpenSource, working to create allies as well as support women directly
http://adainitiative.org/what-we-do/workshops-and-training/

Girls Who Code – working with high school girls to teach them the skills and provide inspiration to join the tech fields
http://www.girlswhocode.com/

LadyBits – adding women’s voices to the media, covering tech and science [w/ a few men writing as well]
https://medium.com/ladybits-on-medium

Reductress – satire addressing the absurdity of women’s portrayal in the media [The Onion, feminized]
http://www.reductress.com/top-five-lip-glosses-paid-tell/

WomenWhoCode & LadiesWhoCode & PyLadies – if you want to find an expert engineer who happens to also be of the female persuasion [to speak at a conference, or to join your team] these are places to find seasoned tech folks, as well as for those new to tech to get started learning, with chapters worldwide.
http://www.meetup.com/Women-Who-Code-SF/ & https://twitter.com/WomenWhoCode
http://www.ladieswhocode.com/ & https://twitter.com/ladieswhocode
http://www.pyladies.com/ https://twitter.com/pyladies

Lately we’ve been getting concerned about our PHP error log. You know the story: errors start popping up, but they’re not causing problems in production, so you move on with your busy life. But you know in your heart of hearts that you should really be fixing the error.

The time has come for us to prune those errors, and I thought the first step should be, as always, to look at the data. Since it’s really the PHP developers who will know what to do with it, I thought it might be useful to make my analysis interactive. Enter Shiny: a web app framework that lets users interact directly with your data.

The first step was to massage my log data into a CSV that looks like this:

"date","error.id","error.count","access.count"
"2013-06-04","inc/foo/mario/journey.php:700",5,308733
"2013-06-04","inc/foo/mario/xenu.php:498",1,308733
"2013-06-04","inc/bar/mario/larp.php:363",14,308733
"2013-06-04","inc/nico.php:1859",3,308733
"2013-06-04","inc/spoot/heehaw.php:728",5,308733
"2013-06-04","inc/spoot/heehaw.php:735",5,308733
"2013-06-04","inc/spoot/heehaw.php:736",5,308733
"2013-06-04","inc/spoot/heehaw.php:737",5,308733
"2013-06-04","inc/spoot/heehaw.php:739",5,308733

For each date, error.id indicates the file and line on which the error occurred, error.count is how many times that error occurred on that date, and access.count is the total number of hits our app received on that date. With me so far?

Now I install Shiny (sure, this is artifice — I already had Shiny installed — but let’s pretend) at the R console like so:

install.packages('devtools')
library(devtools)
install_github('shiny', 'rstudio')
library(shiny)

And from the shell, I start a project:

mkdir portalserr
cd portalserr
cp /tmp/portalserr.csv .

Defining the UI

Now I write my app. I know what I want it to look like, so I’ll start with ui.R. Going through that bit by bit:

shinyUI(pageWithSidebar(
  headerPanel("PHP errors by time"),

I’m telling Shiny how to lay out my UI. I want a sidebar with form controls, and a header that describes the app.

  sidebarPanel(
    checkboxGroupInput("errors_shown", "Most common errors:", c(
      "davidbowie.php:50"="lib/exosite/robot/davidbowie.php:50",
      "heehaw.php:728"="inc/spoot/heehaw.php:728",
      …
      "llamas-10.php:84"="inc/widgets/llamas-10.php:84"
    )
  )),

Now we put a bunch of checkboxes on my sidebar. The first argument to checkboxGroupInput() gives the checkbox group a name. This is how server.R will refer to the checkbox contents. You’ll see.

The second argument is a label for the form control, and the last argument is a list (in non-R parlance an associative array or a hash) defining the checkboxes themselves. The keys (like davidbowie.php:50) will be the labels visible in the browser, and the values are the strings that server.R will receive when the corresponding box is checked.

  mainPanel(
    plotOutput("freqPlot")
  )

We’re finished describing the sidebar, so now we describe the main section of the page. It will contain only one thing: a plot called “freqPlot”.

And that’s it for the UI! But it needs something to talk to.

Defining the server

The server goes in — surprise — server.R. Let’s walk through that.

logfreq <- read.csv('portalserr.csv')
logfreq$date <- as.POSIXct(logfreq$date)
logfreq$perthou <- logfreq$error.count / logfreq$access.count * 10^3

We load the CSV into a data frame called logfreq and translate all the strings in the date column into POSIXct objects so that they’ll plot right.

Then we generate the perthou column, which contains the number of occurrences of a given error on a given day, per thousand requests that occurred that day.

shinyServer(function(input, output) {
  output$freqPlot

Okay now we start to see the magic that makes Shiny so easy to use: reactivity. We start declaring the server application with shinyServer(), which we pass a callback. That callback will be passed the input and output parameters.

input is a data frame containing the values of all the inputs we defined in ui.R. Whenever the user messes with those checkboxes, the reactive blocks (what does that mean? I’ll tell you in a bit) of our callback will be re-run, and the names of any checked boxes will be in input$errors_shown.

Similarly, output is where you put the stuff you want to send back to the UI, like freqPlot.

But the coolest part of this excerpt is the last bit: renderPlot({. That curly-bracket there means that what follows is an expression: a literal block of R code that can be evaluated later. Shiny uses expressions in a very clever way: it determines which expressions depend on which input elements, and when the user messes with inputs Shiny reevaluates only the expressions that depend on the inputs that were touched! That way, if you have a complicated analysis that can be broken down into independent subroutines, you don’t have to re-run the whole thing every time a single parameter changes.

     lf.filtered <- subset(logfreq, error.id %in% input$errors_shown)

      p <- ggplot(lf.filtered) +
        geom_point(aes(date, perthou, color=error.id), size=3) +
        geom_line(aes(date, perthou, color=error.id, group=error.id), size=2) +
        expand_limits(ymin=0) +
        theme(legend.position='left') +
        ggtitle('Errors per thousand requests') +
        ylab('Errors per thousand requests') +
        xlab('Date')
      print(p)

This logic will be reevaluated every time our checkboxes are touched. It filters the logfreq data frame down to just the errors whose boxes are checked, then makes a plot with ggplot2 and sends it to the UI.

And we’re done.

Running it

From the R console, we do this:

> runApp('/path/to/portalserr')

Listening on port 3087

This automatically opens up http://localhost:3087 in a browser and presents us with our shiny new… uh… Shiny app:

Why don’t we do it in production?

Running Shiny apps straight from the R console is fine for sharing them around the office, but if you need a more robust production environment for Shiny apps (e.g. if you want to share them with the whole company or with the public), you’ll probably want to use shiny-server. If you’re putting your app behind an SSL-enabled proxy server, use the latest HEAD from Github since it contains this change.

Go forth and visualize!

“Stop Expecting That.”

When you use a program a lot, you start to notice its quirks. If you’re a programmer yourself, you start to develop theories about why the quirks exist, and how you’d fix them if you had the time or the source. If you’re not a programmer, you just shrug and work around the quirks.

I review about 400 virtual flash cards a day in studying for Jeopardy, so I’ve really started to pick up on the quirks of the flash card software I use. One quirk in particular really bothered me: the documentation, along with the first-tier support team, claims that when cards come up for review they will be presented in a random order. But I’ve noticed that, far from being truly random, the program presents cards in bunches of 50: old cards in the first bunch, then newer and newer bunches of cards. By the time I get to my last 50 cards of the day, they’re all less than 2 weeks old.

So I submitted a bug report, complete with scatterplot demonstrating this clear pattern. I explained “I would expect the cards to be shuffled evenly, but that doesn’t appear to be the case.” And do you know what the lead developer of the project told me?

“Stop expecting that.”

Not in so many words, of course, but there you have it. The problem was not in the software; it was in my expectations.

It’s a common reaction among software developers. We think “Look, that’s just the way it works. I understand why it works that way and I can explain it to you. So, you see, it’s not really a bug.” And as frustrating as this attitude is, I can’t say I’m immune to it myself. I’m in ops, so the users of my software are usually highly technical. I can easily make them understand why a weird thing keeps happening, and they can figure out how to work around the quirk. But the “stop expecting that” attitude is wrong, and it hurts everyone’s productivity, and it makes software worse. We have to consciously reject it.

Quirks are bugs.

A bug is when the program doesn’t work the way the programmer expects.

A quirk is when the program doesn’t work the way the user expects.

What’s the difference, really? Especially in the open-source world, where every user is a potential developer, and all your developers are users?

Quirks and bugs can both be worked around, but a workaround requires the user to learn arbitrary procedures which aren’t directly useful, and which aren’t connected in any meaningful way to his mental model of the software.

Quirks and bugs both make software less useful. They make users less productive. Neglected, they necessitate a sort of oral tradition — not dissimilar from superstition — in which users pass the proper set of incantations from generation to generation. Software shouldn’t be like that.

Quirks and bugs both drive users away.

Why should we treat them differently?

Stop “Stop Expecting That”ing

I’ve made some resolutions that I hope will gradually erase the distinction in my mind between quirks and bugs.

When I hear that a user encountered an unexpected condition in my software, I will ask myself how they developed their incorrect expectation. As they’ve used the program, has it guided them toward a flawed understanding? Or have I just exposed some internal detail that should be covered up?

If I find myself explaining to a user how my software works under the hood, I will redirect my spiel toward a search for ways to abstract away those implementation details instead of requiring the user to understand them.

If users are frequently confused about a particular feature, I’ll take a step back and examine the differences between my mental model of the software and the users’ mental model of it. I’ll then adjust one or both in order to bring them into congruence.

Anything that makes me a stronger force multiplier is worth doing.

Ever since I moved to Minneapolis and started working at Exosite, I’ve been using R almost daily to get to the bottom of ops mysteries. Sometimes it pays off and sometimes it doesn’t, but it’s always interesting.

Here’s an enigma I recently had the pleasure to rub some R on.

Huh…

One fine day, after a release of Portals (our webapp that lets you work with data from your cloud-enabled devices), I noticed something odd in the server metrics. The 90th-percentile response time had increased by a factor of four!

Awry

This would be alarming by itself, but here’s what made it interesting. We have two different ways of reporting the response time of our application. We have the Apache response time metric from the access logs (which is shown above), but we also have a metric generated by the PHP application itself, through statsd. A timer is initialized as soon as the PHP script starts running, and its value is reported as a timer when it’s done. And the PHP metric, for the same time period, looked like this:

Agley

This does not follow the same pattern. Sure there’s some variability, but they’re both response time: shouldn’t they follow each other more closely? And let’s look at the 90th-percentile times for Apache (blue) and PHP (red) on a common scale:

Awhat

Oh whoops — I should have warned you not to be drinking water when you looked at that graph. Sorry if you ruined your keyboard. Apache’s 90th-percentile response time is way higher than PHP’s. Like way higher. What gives?

Let’s summarize what we know so far:

  • 90th-percentile response times, as reported by Apache, have climbed above 5 seconds, which is much higher than normal.
  • PHP response times have not experienced any such climb, so something Apache-specific is causing this behavior.
  • Median values were not affected for either metric, so this issue is only affecting a particular subset of traffic.

Now it’s time to do some R.

Munging is my middle name

This disagreement between Apache response times and PHP response times is really intriguing, and might hint us toward the origin of the issue, so let’s dig into it.

The first thing to do is pull some access log entries into R. We aggregate and parse our logs with Logstash, so each entry is a JSON blob like this (irrelevant bits snipped for brevity):

{
    "@fields": {
        "agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.37 Safari/537.36",
        "bytes": 2660,
        "clientip": "XXX.XXX.XXX.XXX",
        "hostheader": "portals.exosite.com",
        "overall_time": 0.097,
        "request": "/manage/devices",
        "response": "200",
        "servtime": 106908
    },
    "@timestamp": "2013-08-22T14:51:32.000Z",
    "@type": "apache_access"
}

The servtime field is the Apache response time in microseconds, and the overall_time field is the PHP response time in seconds. I’ve got a script that goes and grabs a time-slice of logs from the log server and converts them to CSV format, and the result looks like this:

"@timestamp","overall_time","servtime","hostheader"
"2013-08-21T13:00:49.000Z",0.083,93408,"foo.exosite.com"
"2013-08-21T13:00:48.000Z",0.173,185900,"foo.exosite.com"
"2013-08-21T13:00:46.000Z",0.094,104675,"bar.exosite.com"
"2013-08-21T13:00:46.000Z",0.122,131222,"foo.exosite.com"
"2013-08-21T13:00:46.000Z",0.132,141991,"bar.exosite.com"
"2013-08-21T13:00:46.000Z",0.096,106194,"baz.exosite.com"
"2013-08-21T13:00:46.000Z",0.136,146550,"bar.exosite.com"
"2013-08-21T13:00:46.000Z",0.154,163942,"foo.exosite.com"
"2013-08-21T13:00:46.000Z",0.174,184448,"bar.exosite.com"

Which is perfect for slurping into R like so:

> reqdata <- read.csv("/tmp/reqdata.csv", header=TRUE, as.is=TRUE)
> # Parse the timestamps with Hadley's lubridate package
> reqdata$X.timestamp <- parse_date_time(reqdata$X.timestamp, "ymdHMs")
> # Remember, Apache response times are in microseconds, so we want to scale them to seconds
> reqdata$servtime <- reqdata$servtime / 10^6

And now we have a data frame called reqdata with a row for each log entry:

> head(reqdata)
X.timestamp overall_time servtime hostheader
1 2013-09-13 13:00:49 0.083 0.093408 foo.exosite.com
2 2013-09-13 13:00:48 0.173 0.185900 foo.exosite.com
3 2013-09-13 13:00:46 0.094 0.104675 bar.exosite.com
4 2013-09-13 13:00:46 0.122 0.131222 foo.exosite.com
5 2013-09-13 13:00:46 0.132 0.141991 bar.exosite.com
6 2013-09-13 13:00:46 0.096 0.106194 baz.exosite.com

The plot coagulates

Now that we have a nicely formatted data frame (we would also have accepted a pretty flower from your sweetie), we can use ggplot to check it out.

To get a handle on the Apache/PHP response time dichotomy, let’s plot one versus the other:

> p <- ggplot(reqdata, aes(overall_time, servtime))
> p + geom_point() +
    # A blue line showing the 1-to-1 relationship that we'd naïvely expect
    geom_abline(intercept=0, slope=1, color=I('blue')) +

    ggtitle('Apache response time versus PHP execution time for individual requests') +
    ylab('Apache response time') +
    xlab('PHP execution time')

Apache vs PHP time

So we can see here that most of the requests are doing what we expect: getting sent to PHP as soon as they come in, and getting sent back to the client as soon as they’re done. That’s what the blue line indicates: it’s a line of slope 1.

But then we’ve got this big near-vertical chunk of weird. Those are requests where PHP finished quickly (consistently under 200 milliseconds) but Apache took much longer to serve the request (up to 35 seconds). What is Apache doing for so long? Why can’t it just serve content like it’s supposed to?

We can get a big clue if we color the points by Host header (the domain requested by the browser):

p + geom_point(color=hostheader)

Apache vs PHP time

Now we’re getting somewhere! Look at all that green. That means that this issue is limited to a particular domain: baz.exosite.com.

The home stretch

Okay, so now we know:

  • Requests for baz.exosite.com are sitting around in Apache for up to 35 seconds before being delivered
  • Our PHP executes very quickly for this domain — it’s only Apache that’s sucking up the time

Given that this traffic is limited to a particular domain, I’m getting the suspicion that there are other homogenous things about it. And we can confirm this (pretend I’d pulled in the user-agent and client IP address from the logs with my initial import):

> Filter down our dataset to just the baz.exosite.com requests
> baz <- subset(reqdata,hostheader=='baz.exosite.com')
> unique(baz$useragent)
[1] "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.37 Safari/537.36"

There’s only one browser hitting this domain. Is there only one IP too? (These IPs have been changed to protect the irrelevant)

> unique(baz$clientip)
[1] "257.8.55.154" "257.28.192.38" "257.81.101.4" "257.119.178.94"

Now let’s skip to the most neato part. I plotted Apache response time versus time over the course of 2 hours, faceted by client IP address. It looks like this:

ggplot(baz, aes(X.timestamp, servtime)) + geom_point(aes(color=clientip)) + facet_grid(clientip ~ .)

Apache vs PHP time

So what are we looking at? Here’s what I think.

Only one person, let’s call him François, uses this domain. François has a Portal with several auto-refreshing widgets on it. Also, François is on a train, and he’s using a wireless access point to browse the internet.

François has his Portal open in a tab that he’s forgotten about. It’s dutifully refreshing his widgets every few seconds, but as the train moves, François’s access point keeps switching towers. It starts getting a weak signal, it switches towers, François’s IP address therefore changes, and repeat.

PHP only logs how long it took to execute a script on the server, but Apache logs how long it takes for the client to acknowledge receipt of the final TCP segment in the response. Since François is on an ever-changing series of unreliable networks, he’s dropping packets left and right. Therefore, Apache is logging response times all over the map.

Alas

Alas, this information is useless. Sometimes that happens: sometimes you dig and dig and learn and learn and prepare your beautiful data, only to find at the last minute that it was all for naught.

But at least you learned something.

Follow

Get every new post delivered to your Inbox.