Concert line (Félix Pagaimo via Flickr)

The most important thing to understand about queues

You only need to learn a little bit of queueing theory before you start getting that ecstatic “everything is connected!” high that good math always evokes. So many damn things follow the same set of abstract rules. Queueing theory lets you reason effectively about an enormous class of diverse systems, all with a tiny number of theorems.

I want to share with you the most important fundamental fact I have learned about queues. It’s counterintuitive, but once you understand it, you’ll have deeper insight into the behavior not just of CPUs and database thread pools, but also grocery store checkout lines, ticket queues, highways – really just a mind-blowing collection of systems.

Okay. Here’s the Important Thing:

As you approach maximum throughput, average queue size – and therefore average wait time – approaches infinity.

“Wat,” you say? Let’s break it down.

Queues

A queueing is exactly what you think it is: a processor that processes tasks whenever they’re available, and a queue that holds tasks until the processor is ready to process them. When a task is processed, it leaves the system.

queue_basic
A queueing system with a single queue and a single processor. Tasks (yellow circles) arrive at random intervals and take different amounts of time to process.

The Important Thing I stated above applies to a specific type of queue called an M/M/1/∞ queue. That blob of letters and numbers and symbols is a description in Kendall’s notation. I won’t go into the M/M/1 part, as it doesn’t really end up affecting the Important Thing I’m telling you about. What does matter, however, is that ‘∞’.

The ‘∞’ in “M/M/1/∞” means that the queue is unbounded. There’s no limit to the number of tasks that can be waiting in the queue at any given time.

Like most infinities, this doesn’t reflect any real-world situation. It would be like a grocery store with infinite waiting space, or an I/O scheduler that can queue infinitely many operations. But really, all we’re saying with this ‘∞’ is that the system under study never hits its maximum occupancy.

Capacity

The processor (or processors) of a queueing system have a given total capacity. Capacity is the number of tasks per unit time that can be processed. For an I/O scheduler, the capacity might be measured in IOPS (I/O operations per second). For a ticket queue, it might be measured in tickets per sprint.

Consider a simple queueing system with a single processor, like the one depicted in the GIF above. If, on average, 50% of the system’s capacity is in use, that means that the processor is working on a task 50% of the time. At 0% utilization, no tasks are ever processed. At 100% utilization, the processor is always busy.

Let’s think about 100% utilization a bit more. Imagine a server with all 4 of its CPUs constantly crunching numbers, or a dev team that’s working on high-priority tickets every hour of the day. Or a devOp who’s so busy putting out fires that he never has time to blog.

When a system has steady-state 100% utilization, its queue will grow to infinity. It might shrink over short time scales, but over a sufficiently long time it will grow without bound.

Why is that?

If the processor is always busy, that means there’s never even an instant when a newly arriving task can be assigned immediately to the processor. That means that, whenever a task finishes processing, there must always be a task waiting in the queue; otherwise the processor would have an idle moment. And by induction, we can show that a system that’s forever at 100% utilization will exceed any finite queue size you care to pick:

no_good_0
No good. Processor idle, so utilization can’t be 100%.
no_good_1.png
No good. As soon as the task finishes, we’re back to the above picture.
no_good_2.png
No good. As soon as the current task finishes, we’re back to the above picture, which we’ve already decided is no good.
no_good_3.png
You get the idea.

So basically, the statements “average capacity utilization is 100%” and “the queue size is growing without bound” are equivalent. And remember: by Little’s Law, we know that average service time is directly proportional to queue size. That means that a queueing system with 100% average capacity utilization will have wait times that grow without bound.

If all the developers on a team are constantly working on critical tickets, then the time it takes to complete tickets in the queue will approach infinity. If all the CPUs on a server are constantly grinding, load average will climb and climb. And so on.

Now you may be saying to yourself, “100% capacity utilization is a purely theoretical construct. There’s always some about of idle capacity, so queues will never grow toward infinity forever.” And you’re right. But you might be surprised at how quickly things start to look ugly as you get closer to maximum throughput.

What about 90%?

The problems start well before you get to 100% utilization. Why?

Sometimes a bunch of tasks will just happen to show up at the same time. These tasks will get queued up. If there’s not much capacity available to crunch through that backlog, then it’ll probably still be around when the next random bunch of tasks show up. The queue will just keep getting longer until there happens to be along enough period of low activity to clear it out.

I ran a qsim simulation to illustrate this point. It simulates a single queue that can process, on average, one task per second. I ran the simulation for a range of arrival rates to get different average capacity usages. (For the wonks: I used a Poisson arrival process and exponentially distributed processing times). Here’s the plot:

without_extreme

Notice how things start to get a little crazy around 80% utilization. By the time you reach 96% utilization (the rightmost point on the plot), average wait times are already 20 seconds. And what if you go further and run at 99.8% utilization?

with_extreme

Check out that Y axis. Crazy. You don’t want any part of that.

This is why overworked engineering teams get into death spirals. It’s also why, when you finally log in to that server that’s been slow as a dog, you’ll sometimes find that it has a load average of like 500. Queues are everywhere.

What can be done about it

To reiterate:

As you approach maximum throughput, average queue size – and therefore average wait time – approaches infinity.

What’s the solution, then? We only have so many variables to work with in a queueing system. Any solution for the exploding-queue-size problem is going to involve some combination of these three approaches:

  1. Increase capacity
  2. Decrease demand
  3. Set an upper bound for the queue size

I like number 3. It forces you to acknowledge the fact that queue size is always finite, and to think through the failure modes that a full queue will engender.

Stay tuned for a forthcoming blog post in which I’ll examine some of the common methods for dealing with this property of queueing systems. Until then, I hope I’ve helped you understand a very important truth about many of the systems you encounter on a daily basis.

screens_jpg__1024×576_

3 Things That Make Encryption Easier

Almost everyone (especially in ops) knows they should be better about encrypting secret data. And yet most organizations have at least a few passwords and secret keys checked into Git somewhere.

The ideal solution would be for everyone at your company to use PGP all the time, but that is a huge pain. Encryption tools are annoying to use, and a significant time investment is required to learn to use them correctly. And if security is hard, people will always find a way to avoid it.

In the last few months, I’ve adopted 3 new technologies that make secure storage and exchange of secret information at least bearable.

1: Blackbox

StackExchange’s blackbox tool makes it easy to store encrypted data in a Git repository. First you need to import into your personal keyring all the PGP keys you want to grant access to. Then you initialize the blackbox directory structure:

Once you’ve initialized blackbox, you can start adding administrators, which are keys that will be granted access to the secret data in the repository:

Now you can start adding secrets securely:

I really like how this tool gives my team a distributed, version-controlled repository of secret information. We can even give other teams access to the repository without worrying about exposing secrets!

My team uses this tool for shared passwords and SSL private keys, and it works great. Check it out.

2: Salt

At my company, we use Salt for config management. Like most config management systems, Salt lets you decouple the values in a config file from the file itself. You make a template of the config file that will appear on the node, and you put the values in a pillar (equivalent to a Chef databag, or a Puppet… whatever it’s called in Puppet).

So instead of storing a config file like this:

You store a template like this:

and a pillar (which is just a YAML file) like this:

Now suppose you don’t want to commit that super-secure password directly to your Salt repository. Instead, you can create a PGP keypair, and give the private key to your Salt server. Then you can encrypt the password with that key. Your pillar will now look like this:

When processing your template on the target node, Salt will seamlessly decrypt the password for you.

I love that I can give non-admins access to our Salt repo, and let them submit pull requests, without worrying about leaking passwords. To learn more about this Salt functionality, you can read the documentation for salt.renderers.gpg.

3: SecretShare

Salt’s GPG renderer and blackbox are great ways to store shared secret data, but what about transmitting secrets to particular people? In most organizations, when passwords and such need to be transmitted from employee to employee, insecure methods are used. Email, chat, and Google docs are very common media for transmitting secrets. They’re all saved indefinitely, meaning that an attacker who gains access to your account can gain access to all the secret info you’ve ever sent or received.

To make transmitting secrets as easy and secure as possible, my teammate Alex created secretshare. It lets you transmit arbitrary secret data to others in your organization, and it has immense advantages over other systems:

  • Secrets are never transmitted or stored in the clear, so a snooper can’t even read them if they manage to compromise the Amazon S3 bucket in which they’re stored.
  • Secrets are deleted from S3 after 24-48 hours, so a snooper can’t go back through the recipient’s or sender’s communication history later and retrieve them.
  • Secrets are encrypted with a one-time-use key, so a snooper can’t use the key from one secret to steal another.
  • Users don’t need Amazon AWS credentials, so a snooper can’t steal those credentials from a user.

Right now, secretshare only exists as a command-line utility, but we’re very close to having a web UI as well, which will make it even easier for non-technical people to use.

 

Security’s worst enemy is bad UX. It’s critical to make the most secure path also the easiest path. That’s what these three solutions aim to do, and they’ve made me feel much more comfortable with the security of secret data at my company. I hope they can do the same for you.

Start a Paper Club!

A few months ago, a work friend and I were commiserating about how we never make time to read any research. There’s all this fascinating, challenging stuff being written, we agreed, and we’re missing it all.

When a few more coworkers chimed in, saying they’d also like to push themselves to read more academic literature, we realized we were onto something. The next day, the Exosite Paper Club was born.

It’s been really fun to organize and participate in Paper Club these last few months. I’ve learned a lot, not just about the fields on which our reading material focused, but also about my coworkers and my company. Now I want to share some my excitement about this project and encourage you to start a Paper Club at your own company.

What Paper Club is and does

Paper Club is a group that meets every other week over lunch. We all prepare by reading a particular academic paper, and we discuss it in a free-form group session. Sometimes we teach each other about concepts from the paper, sometimes we brainstorm ways to apply ideas to our work at Exosite, and sometimes we just chat.

The paper for each session is selected by a participant from the previous session, and can be about any topic from project management to statistics to psychology to medicine. Readers of any skill level in the paper’s subject matter are warmly welcomed at meetings, but you can always skip a session if the topic doesn’t interest you. All we ask is that, when the material is hard for you, you push yourself and try to grasp it anyway.

We want a wide variety of participants, from DevOps to UI to sales & marketing, but we also want to read papers that are detailed enough to be challenging. To these ends, we have three guidelines for submitting a paper:

  1. Papers should be accessible. Try to pick papers at a technical level such that at least some of your peers will be able to understand most of the content. You want the session to be a discussion, not a lecture.
  2. Papers should be challenging. While accessibility is important, you don’t want your peers to have too easy a time. The best conversations happen when people are forced to push themselves a bit. It’s okay to suggest papers that are only accessible to those with a particular academic background, as long as you think the paper will create a good discussion with some subset of your coworkers.
  3. Papers should be deep. The best discussions tend to come from papers that dive pretty deep into a topic. Reviews and textbook chapters can be interesting, but we tend to prefer papers that go into detail on a specific topic.

What we’ve read so far

We’ve been doing Paper Club at Exosite since mid-November, and we’ve discussed 8 papers to date. I thought we’d be reading mostly computer science papers, but I couldn’t be happier with the variety we’ve gotten!

Here are a few of the papers that produced the most interesting discussions:

  • Confidence in Judgment: Persistence of the Illusion of Validity. This classic behavioral study from 1978 takes the well-established observation that people (even especially experts) are usually more confident in their judgments than they should be. The authors build a simple but powerful model for the mental processes responsible for this overconfidence, and present some suggestions for systematically curtailing it.
  • On Bullshit. If you’re like most engineers, you’re utterly allergic to bullshit. But have you ever thought about what makes bullshit bullshit? How it’s different from outright lying, and how it’s different from normal speech? This famous philosophical essay tries to answer these questions, and it provides some valuable insights for someone trying to excise bullshit from their life.
  • Why Johnny Can’t Encrypt. This 1999 analysis of PGP 5.0 usability raises some points that are crucial for anyone trying to design intuitive user experiences. The paper led us into a super productive critique of the metaphorical structure used in our own company’s documentation.

How it’s going

I have been really happy with the intellectual diversity of our Paper Club participants. Even when the paper under discussion is an especially wonky one, we get project managers, devs from all different software teams, salespeople, marketers, managers, and occasionally even company executives. Seeing all these people engage in thoughtful dialogue on such a wide variety of topics is inspiring. It takes me out of my DevOps bubble and reminds me that I work with some very interesting and smart folks.

I’d like to see how far we can stretch ourselves. Selfishly, I’d like us to read a math or linguistics paper some time, and help each other through it. I think that would be very rewarding.

Overall, I’m very glad we have a Paper Club here. I’d recommend it to anyone who likes people and/or learning. If you start one at your company, let me know how it goes!

196727__UY475_SS475__jpg__475×475_

The 10 Best Books I Read In 2015

I read 56 books in 2015, which is more than I’d read in the previous 5 years combined. Turns out books are pretty cool. Who knew?

Here are the 10 books I liked the most this year, in no particular order.

1. Metaphors We Live By (1980)

Goodreads___Metaphors_We_Live_By_by_George_Lakoff_—_Reviews__Discussion__Bookclubs__Lists.pngI got real into linguistics this year, and this book offers an interesting perspective on semantics. We usually think of metaphor as a poetic device. But this book argues that the whole human conceptual system is based on metaphor! According to George Lakoff, every concept we understand (short of concepts corresponding to our direct experience) is understood by analogy with a more concrete concept.

There are some very intriguing ideas in here. I didn’t necessarily buy (or even understand) them all, but I’m really glad I read this book.

2. The One World Schoolhouse: Education Reimagined (2012)
one_world_schoolhouse_-_Google_Search.png

This thought-provoking book on education was written by the Khan Academy guy. He presents a lot of research pointing toward the hypothesis that self-paced “mastery learning” is much more broadly effective than the contemporary American model of arbitrarily delineated, one-speed-fits all classes.

Discussing this book with my friends (who tend to be smart academic underachievers), really brought home the point that our education system underserves anyone who understands a concept more slowly, or more quickly, than the rest of the class.

3. The Orphan Master’s Son (2012)

Goodreads___The_Orphan_Master_s_Son_by_Adam_Johnson_—_Reviews__Discussion__Bookclubs__ListsI’m endlessly fascinated by North Korea, and this novel stoked my fascination. That alone would probably have been enough, but it’s also super well written. I found it beautiful and sad and gripping the whole way through.

 

 

 

4. The Immortal Life of Henrietta Lacks (2010)

Goodreads___The_Immortal_Life_of_Henrietta_Lacks_by_Rebecca_Skloot_—_Reviews__Discussion__Bookclubs__ListsThe author of this book tracked down the family of the long-deceased woman whose incredibly robust tumor cells became the most widely studied strain of human tissue in the world. Her cells have been used by scientists to make countless discoveries in genetics and immunology.

Through interviews with Henrietta Lacks’ descendants, all of whom still live in abject poverty, Rebecca Skloot raises important and nuanced questions about the interplay between science and culture and race. It’d be hard to read this book and still think of science as “pure” or “objective.” Scientists aspire to objectivity, but they’re just as boxed in by their cultural preconceptions as anyone else.

5. Red Rising (2014)

Goodreads___Red_Rising__Red_Rising_Trilogy___1__by_Pierce_Brown_—_Reviews__Discussion__Bookclubs__ListsThis is a super fun young-adult novel about badass vicious teens trying to kill each other. The premise is pretty similar to that of The Hunger Games, but I found the character development and storytelling way better. And the second book in the trilogy, Golden Son, is awesome too! The third one is coming out very soon, and I can’t wait.

Don’t expect any deep truths or transcendent prose. This book is just really fun to read.

6. The Road (2006)

Goodreads___The_Road_by_Cormac_McCarthy_—_Reviews__Discussion__Bookclubs__Lists.pngHey, speaking of books that are really fun to read, this book is not one. It’s a painfully stark novel about a father and son trying to survive in post-apocalyptic North America. I definitely wouldn’t call it a “feel-good” book.

But despite the author’s unremittingly bleak vision, the relationship between the man and the boy (those are the only names given for the characters) is very touching. A friend of mine claims to use this book as a parenting handbook of sorts. I don’t know if I’d go that far, but I do see what he means.

7. History in Three Keys: The Boxers as Event, Experience, and Myth (1997)

Goodreads___History_in_Three_Keys__The_Boxers_as_Event__Experience__and_Myth_by_Paul_A__Cohen_—_Reviews__Discussion__Bookclubs__Lists.pngMy favorite non-fiction book of the year. On one level, this is a history book about the Boxer Rebellion: a grimly compelling episode in its own right. But, moreover, this book is about history itself. Paul A. Cohen describes the Boxer Rebellion through 3 different lenses – experience, history, and myth – each of which represents part of the way we engage with the past.

I came away from this book with a newfound appreciation for the role of the historian in creating history, and a newfound skepticism about the idea that history is composed of objective facts and dates.

[I did a lightning talk (transcript in the first slide’s speaker notes) relating what I learned from this book to post-mortem analysis in DevOps.]

8. The Martian Chronicles (1950)

Goodreads___The_Martian_Chronicles_by_Ray_Bradbury_—_Reviews__Discussion__Bookclubs__ListsI tried to read The Martian Chronicles in high school, and I was like “Pfft! This isn’t science fiction. The science doesn’t make any sense.” I’ve read a lot more sci-fi now, so I decided to give this classic another shot.

What I’ve learned since high school is that the sci-fi I like most isn’t about cool technology or mind-bending thought experiments (although those can add some nice seasoning to a story). It’s about humanity: what continues to define us as human, even when the things we think of basic to our humanity – Earth, language, gender, our bodies – are stripped away?

Bradbury knew exactly what he was on about. After reading these stories with the human question in mind, I finally understand why he’s been so influential in science fiction.

9. The New Jim Crow: Mass Incarceration in the Age of Colorblindness (2009)

Goodreads___The_New_Jim_Crow__Mass_Incarceration_in_the_Age_of_Colorblindness_by_Michelle_Alexander_—_Reviews__Discussion__Bookclubs__ListsI already had strong objections to the War on Drugs on account of the senseless imprisonment of people who’ve done nothing harmful. But this book shows how drug arrest quotas, mandatory minimum sentences, and felon profiling work together as an insidious system to maintain white supremacy in America.

I’d recommend this book to any American. We all need to understand that racial oppression didn’t go away on July 2, 1964; it just adopted more clever camouflage.

10. Anathem (2008)

anathem_cover_-_Google_Search.pngThis is now my favorite Neal Stephenson book. It has his trademark mathyness and detail to a truly engaging story in a richly-imagined universe.

Like most Neal Stephenson books, it’s not for everyone. A lot of time is spent on epistemological ruminations and physics. I love that shit, so I loved this book all the way from page 1 to page 937.

 

Welp, those are some books I loved this year.

But I want to see what you’re reading, so I can find more books to love. Friend-poke me on GreatBooks!

star-trek-tng-s2-002

Troubleshooting On A Distributed Team Without Losing Common Ground

I work on a team that fixes complex systems under time pressure. My teammates have different skill sets, different priorities, and different levels of expertise. But we all have to troubleshoot and solve problems together.

This is really hard to do effectively. Fortunately for us in the relatively new domain of DevOps, situations like ours have been studied extensively in the last couple decades. We can use the results of this research to inform our own processes and automation for troubleshooting.

One of the most important concepts to emerge from recent teamwork research, common ground, helps us understand why collaborative troubleshooting breaks down over time. This breakdown leads to wasted effort and mistakes, even if the team maintains constant communication in a chat room. But if we extend ChatOps by drawing on some ideas from medical diagnosis, we can make troubleshooting way easier without losing the benefits of fluid team conversation.

Common Ground

Ergonomics researchers D.D. Woods and Gary Klein (the latter of whom I wrote about in What makes an expert an expert?) published a phenomenally insightful paper in 2004 called Common Ground and Coordination in Joint Activity. In it, they describe a particular kind of failure that occurs when people engage in joint cognition: the Fundamental Common Ground Breakdown. Once you learn about the Fundamental Common Ground Breakdown, you see it everywhere. Here’s how the Woods/Klein paper describes the FCGB:

  • Party A believes that Party B possesses some knowledge
  • Party B doesn’t have this knowledge, and doesn’t know he is supposed to have it.
  • Therefore, he or she doesn’t request it.
  • This lack of a request confirms to Party A that Party B has the knowledge.

When this happens, Party A and Party B lose common ground, which Woods & Klein define as “pertinent knowledge, beliefs and assumptions that are shared among the involved parties.” The two parties start making incorrect assumptions about each other’s knowledge and beliefs, which causes their common ground to break down further and further. Eventually they reach a coordination surprise, which forces them to re-synchronize their understanding of the coordinated activity:

csel_eng_ohio-state_edu_woods_distributed_CG_final_pdf

Seriously, the FCGB is everywhere. Check out the paper.

I’m especially interested in one particular area where an understanding of common ground can help us do better teamwork: joint troubleshooting.

Common Ground Breakdown in Chatroom Troubleshooting

Everybody’s into ChatOps these days, and I totally get it. When a critical system is broken, it’s super useful to get everybody in the same room and hash it out. ChatOps allows everybody to track progress, coordinate activities, and share results. And it also helps to have lots of different roles represented in the room:

  • Operations folks, to provide insight into the differences between the system’s normal behavior and its current state
  • Software engineers, who bring detailed knowledge of the ways subsystems are supposed to work
  • Account managers and product managers and support reps: not just for their ability to translate technical jargon into the customer’s language for status reporting, but also because their understanding of customer needs can help establish the right priorities
  • Q.A. engineers, who can rule out certain paths of investigation early with their intuition for the ways in which subsystems tend to fail

The process of communicating across role boundaries isn’t just overhead: it helps us refine our own understanding, look for extra evidence, and empathize with each other’s perspectives.

But ChatOps still offers a lot of opportunities for common ground breakdown. The FCGB can occur whenever different people interpret the same facts in different ways. Interpretations can differ for many different reasons:

  • Some people have less technical fluency in the system than others. A statement like “OOM killer just killed Cassandra on db014” might change an ops engineer’s whole understanding of the problem, but such a shift could fly under the radar of, say, a support engineer.
  • Some people are multitasking. They may have a stake in the troubleshooting effort but be unable to internalize every detail from the chat room in real time.
  • Some people are co-located. They find it easier to discuss the problem using mouth words or by physically showing each other graphs, thereby adjusting their own shared understanding without transmitting these adjustments to the rest of the team.
  • Some people enter the conversation late, or leave for a while and come back. These people will miss common ground changes that happen during their absence.

These FCGB opportunities all become more pronounced as the troubleshooting drags on and folks become tired, bored, and confused. And when somebody says they’ve lost track of common ground, what do we do? Two main things: we provide a summary of recent events and let the person ask questions until they feel comfortable; or we tell them to read the backlog.

The Q&A approach has serious drawbacks. First of all, it requires somebody knowledgeable to stop what they’re doing and summarize the situation. If people are frequently leaving and entering the chat room, you end up with a big distraction. Second of all, it leaves lots of room for important information to get missed. The Fundamental Common Ground Breakdown happens when somebody doesn’t know what to ask, so fixing it with a Q&A session is kind of silly.

The other way people catch up with the troubleshooting effort is by reading the backlog. This is even more inefficient than Q&A. Here’s the kind of stuff you have to dig through when you’re reading a chat backlog:

tng-hipchat

There’s a lot to unpack there – and that’s just 18 messages! Imagine piecing together a troubleshooting effort that’s gone on for hours, or days. It would take forever, and you’d still make a lot of mistakes. It’s just not a good way to preserve common ground.

So what do we need?

Differential Diagnosis as an Engine of Common Ground

I’ve blogged before about how much I love differential diagnosis. It’s a formalism that doctors use to keep the diagnostic process moving in the right direction. I’ve used it many times in ops since I learned about it. It’s incredibly useful.

In differential diagnosis, you get together with your team in front of a whiteboard – making sure to bring together people from a wide variety of roles – and you go through a cycle of 3 steps:

  1. Identify symptoms. Write down all the anomalies you’ve seen. Don’t try to connect the dots just yet; just write down your observations.
  2. Generate hypotheses. Brainstorm explanations for the symptoms you’ve observed. This is where it really helps to have a good cross-section of roles represented. The more diverse the ideas you write down, the better.
  3. Test hypotheses. Now that you have a list of things that might be causing the problem, you start narrowing down that list by coming up with a test that will prove or disprove a certain hypothesis.

Once you’re done with step #3, you can cross out a hypothesis or two. Then you head back to step #1 and repeat the cycle until the problem is identified.

A big part of the power of differential diagnosis is that it’s written down. Anybody can walk into the room, read the whiteboard, and understand the state of the collaborative effort. It cuts down on redundant Q&A, because the most salient information is summarized on the board. It eliminates inefficient chat log reading – the chat log is still there, but you use it to search for specific pieces of information instead of reading it like a novel. But, most importantly, differential diagnosis cuts down on fundamental common ground breakdowns, because everybody has agreed to accept what’s on the whiteboard as the canonical state of troubleshooting.

Integrating Differential Diagnosis with ChatOps

We don’t want to lose the off-the-cuff, conversational nature of ChatOps. But we need a structured source of truth to provide a point-in-time understanding of the effort. And we (read: I) don’t want to write a whole damn software project to make that happen.

My proposal is this: use Trello for differential diagnosis, and integrate it with the chat through a Hubot plugin. I haven’t written this plugin yet, but it shouldn’t take long (I’ll probably fork hubot-trello and start from there). That way people could update the list of symptoms, hypotheses, and tests on the fly, and they’d always have a central source of common ground to refer to.

In the system I envision, the chat room conversation would be peppered with statements like:

Geordi: hubot symptom warp engine going full speed, but ship not moving

Hubot: Created (symp0): warp engine going full speed, but ship not moving

Beverly: hubot falsify hypo1

Hubot: Falsified (hypo1): feedback loop between graviton emitter and graviton roaster

Geordi: hubot finish test1

Hubot: Marked (test1) finished: reboot the quantum phase allometer

And the resulting differential diagnosis board, containing the agreed-upon state of the troubleshooting effort, might look like this example, with cards labeled to indicate that they’re no longer in play.

What do you think?

Let me know if your organization already has something like this, or has tried a formal differential diagnosis approach before. I’d love to read some observations about your team’s process in the comments. Also, VictorOps has a pretty neat suite of tools that approaches what I have in mind, but I still think a more conceptually structured (not to mention free) solution could be very useful.

Automation is most effective when it’s a team player. By using automation to preserve common ground, we can solve problems faster and more thoroughly, with less frustration and less waste. And that all sounds pretty good to me.

Minding Your Pees And Queues

A couple weeks ago, my wife & I went to Basilica Block Party, a local music festival. It was a good time, and OH MANS you have to see Fitz & The Tantrums live. Their sax player is a hero unit.

Anyhoo, we walked over to the porta-potties between sets. The lines were about 8–10 people long. And my spouse suggested an intriguing strategy for minimizing her wait time. I will call this strategy The Strategy:

The Strategy: All else being equal, get in the line with the most men.

The reasoning behind The Strategy is obvious: women take longer to pee than men, so if 2 queues are the same length, then the faster-moving queue should be the one with fewer women. It’s intuitive, but due to my current obsession with queueing theory, I became intensely interested in the strategy’s implications. In particular, I started to wonder things like:

  • How much can you expect to shave off your wait time by following The Strategy?
  • How does the effectiveness of The Strategy vary with its popularity? Little’s Law tells us that the overall average wait time won’t be affected, but is The Strategy still effective if 10% of the crowd is using it? 25%? 90%?

And then I thought to myself “I could answer these questions through the magic of computation!”

qsim

Lately I’ve been seeing queueing systems everywhere I go, so I figured it’d be worthwhile to write a generic queueing system simulator to satisfy my curiosity. That’s what I did. It’s called qsim.

To quote the README,

A queueing system in qsim processes arbitrary jobs and is composed of 5 pieces:

  • The arrival process controls how often jobs enter the system.
  • The arrival behavior defines what happens when a new job arrives. When the arrival process generates a new job, the arrival behavior either sends it straight to a processor or appends it to a queue.
  • Queues are simply holding pens for jobs. A system may have many queues associated with different processors.
  • A queueing discipline defines the relationship between queues and processors. It’s responsible for choosing the next job to process and assigning that job to a processor.
  • Processors are the entities that remove jobs from the system. A processor may take differing amounts of time to process different jobs. Once a job has been processed, it leaves the queueing system.

qsim provides a framework for implementing these building blocks and putting them together, and it also provides hooks that can be used to gather data about a simulation as it runs. I’m really looking forward to using qsim to gain insight into all sorts of different systems.

For now: porta-potties.

The porta-potty simulation

You’ll recall The Strategy:

The Strategy: All else being equal, get in the line with the most men.

To determine the effectiveness of The Strategy, I implemented PortaPottySystem using qsim. Here are some of the assumptions I made:

  • People arrive very frequently, but if all the queues are too long (8 people) they leave.
  • There are 15 porta-potties, each with its own queue. Once a person enters a queue, they stay in that queue until the corresponding porta-potty is vacant.
  • Shockingly, I couldn’t find any reliable data on the empirical distribution of pee times by sex, so I chose a normal distribution with a mean of 40 seconds for men and 60 seconds for women.
  • Most people just pick a random queue to join (as long as it’s no longer than the shortest queue), but some people use The Strategy of getting into the queue with the highest man:woman ratio (again, as long as it’s no longer than the shortest queue).
  • Nobody’s going number 2 because that’s gross.
  • Everyone is either a man or a woman. I know all about gender being a spectrum, and if you want to submit a pull request that smashes the gender binary, please do.

The first question I wanted to answer was: how does using The Strategy affect your wait time?

To answer this question, I ran a simulation where the probability of a given person deciding to use The Strategy is 1%. The other 99% of people simply join one of the shortest queues without regard to its man:woman ratio. I ran 20 simulations, each for the equivalent of 2 weeks, and came up with these wait time distributions:

waittimes

More of a box-plot person? I hear ya:

boxplot

The Strategy is definitely not a huge win here. On average, your wait time will be reduced by about 10–15 seconds (4–6%) if you use The Strategy. Still, it’s not nothing, right?

Now how does the benefit of using The Strategy vary with its popularity? This is actually really interesting. I never would have guessed it. But the data shows that you should always use The Strategy, even if everybody else is using it too.

Here I’ve charted average wait times against the proportion of people using the strategy. Colors are more prominent where the data set in question is large (and therefore heavily influences the overall average):

eff_v_pop

You’ll notice that the overall average (dark green) does not vary with Strategy popularity. This is good news, because otherwise we’d be violating Little’s Law, which would probably just mean our simulation was broken.

The interesting thing here is that the benefit of using The Strategy decreases pretty much linearly as its popularity increases, but at the same time there accrues a disadvantage to not using it. If everybody else in the system is using The Strategy, and you come along and decide not to, you can still expect to wait 10 seconds longer than everybody else. Therefore using The Strategy is unequivocally better than not using The Strategy.

Unless of course you don’t really care about those 10 seconds, in which case you should do whatever you want.

eyebrows

When efficiency hurts more than it helps

When we imagine how to use a resource effectively – be that resource a development team, a CPU core, or a port-a-potty – our thoughts usually turn to efficiency. Ideally, the resource gets used at 100% of its capacity: we have enough capacity to serve our needs without generating queues, but not so much that we’re wasting money on idle resources. In practice there are spikes and lulls in traffic, so we should provision enough capacity to handle those spikes when they arrive, but we should always try to minimize the amount of capacity that’s sitting idle.

Except what I just said is bullshit.

In the early chapters of Donald G. Reinertsen’s brain-curdlingly rich Principles of Product Development Flow, I learned a very important and counterintuitive lesson about queueing theory that puts the lie to this naïve aspiration to efficiency-above-all-else. I want to share it with you, because once you understand it you will see the consequences everywhere.

Queueing theory?

Queueing theory is an unreasonably effective discipline that deals with systems in which tasks take time to get processed, and if there are no processors available then a task has to wait its turn in a queue. Sound familiar? That’s because queueing theory can be used to study basically anything.

In its easiest-to-consume form, queueing theory tells us about average quantities in the steady state of a queueing system. Suppose you’re managing a small supermarket with 3 checkout lines. Customers take different, unpredictable amounts of time to finish their shopping. So they arrive at the checkout line at different intervals. We call the interval between two customers reaching the checkout line the arrival interval.

And customers also take different, unpredictable amounts of time to get checked out. The time it takes from when the cashier scans a customer’s first item to when they finish checking that customer out is called the processing time.

Each of these quantities has some variability in it and can’t be predicted in advance for a particular customer. But you can empirically determine the probability distribution of these quantities:

distributions

Given just the information we’ve stated so far, queueing theory can answer a lot of questions about your supermarket. Questions like:

  • How long on average will a customer have to wait to check out?
  • What proportion of customers will arrive at the checkout counter without having to wait in line?
  • Can you get away with pulling an employee off one of the registers to go stock shelves? And if you do that, how will you know when you need to re-staff that register?

These sorts of questions are super important in all sorts of systems, and queueing theory provides a shockingly generalizable framework for answering them. Here’s an important theme that shows up in a huge variety of queueing systems:

The closer you get to full capacity utilization, the longer your queues get. If you’re using 100% of capacity all time, your queues grow to infinity.

This is counterintuitive but absolutely true, so let’s think through it.

What happens when you have no idle capacity

What the hell? Isn’t using capacity efficiently how you’re supposed to get rid of queues? Well yes, but it doesn’t work if you do it all the time. You need some buffer capacity.

Let’s think about a generic queueing system with 5 processors. This system’s manager is all about efficiency, so the system operates at 100% capacity all the time. No idle time. That’s ideal, right?

fullcap-0

Sure, okay, now what happens when a task gets completed? If we want to make sure we’re always operating at 100% capacity, then there needs to be a task waiting behind that one. Otherwise we’d end up with an idle processor. So our queueing system must look more like this:

fullcap-1

In order to operate at 100% capacity all the time, we need to have at least as many tasks queued as there are processors. But wait! That means that when another new task arrives, it has to get in line behind those other tasks in the queue! Here’s what our system might look like a little while later:

fullcap-2

Some queues may be longer than others, but no queue is ever empty. This forces the total number of items in the queue to grow without limit. Eventually our system will look like this:

fullcap-3

If you don’t quite believe it, I don’t blame you. Go back through the logic and convince yourself. It took me a while to absorb the idea too.

What this means for teams

You can think of a team as a queueing system. Tasks arrive in your queue at random intervals, and they take unpredictable amounts of time to complete. Each member of the team is a processor, and when everybody’s working as hard as they can, the system is at 100% capacity.

That’s what a Taylorist manager would want: everybody working as hard as they can, all the time, with no waste of capacity. But as we’ve seen, in any system with variability, that’s an unachievable goal. The closer you get to full capacity utilization, the faster your queues grow. The longer your queues are, the longer the average task waits in the queue before getting done. It gets bad real fast:

Quartz_3____

So there are very serious costs to pushing your capacity too hard for too long:

  • Your queues get longer, which itself is demotivating. People are less effective when they don’t feel that their work is making a difference (see The Progress Principle)
  • The average wait time between a task arriving and a getting done rises linearly with queue length. With long wait times, you hemorrhage value: you commit time and energy to ideas that might not be relevant anymore by the time you get around to them (again: read the crap out of Principles of Product Development Flow)
  • Since you’re already operating at or near full capacity, you can’t even deploy extra capacity to knock those queues down: it becomes basically impossible to ever get rid of them.
  • The increased wait time in your ticket queue creates long feedback times, nullifying the benefit of agile techniques.

Efficiency isn’t the holy grail

Any queueing system operating at full capacity is gonna build up giant queues. That includes your team. What should you do about it?

Just by being aware that this relationship exists, you can gain a lot of intuition about team dynamics. What I’m taking away from it is this: There’s a tradeoff between how fast your team gets planned work done and how long it takes your team to get around to tasks. This changes the way I think about side projects, and makes me want to find the sweet spot. Let me know what you take away from it.