Intermittent Conjecture: Debugging servers and bodies

For quite a while I've wanted to write up some thoughts about the nature of cause and effect. In actually trying to do so, I realized two things: First, I don't know that much about the main streams of thought on this, which go back millennia, and second, that this was material for several posts. Rather than step back and formulate an overarching structure for a series, or deep dive into the philosophical literature, I decided to just start somewhere and call that Part I, with the intention of coming back to the subject, probably with unrelated posts in between, from time to time. And then I realized that there was no need to call it Part anything. I could just introduce a new tag, cause and effect and apply it where necessary. So here we go ...

I press keys on the keyboard and words appear on the screen. It seems pretty clear that one is the cause of the other. I take a cough suppressant and my cough subsides. Quite likely it subsided because of the medicine, but maybe I was getting better anyway. I run an ad online and sales go up. But they've been going up for months. Did they go up more because of the ad? (half of your advertising budget is wasted; the trick is to know which half). I wear a lucky shirt and my team wins. I may want to think the two are related, but realistically it's just a bit of fun.

I spend a lot of my time debugging, trying to figure out why something isn't working the way it was expected to, whether it's why the TV isn't working at home or why a server crashed at work. If I'm trying to figure out what's wrong with something electronic at home, I generally just turn things off and on until everything's back in a sane state. That's usually fine at home, but not such a good idea at work. Even if restarting the server fixes the problem, you still want to know why, so next time you won't have to restart it.

How do you tell if one thing caused another? Debugging is much older than computing. For example, the practice of debugging human health, that is, medicine, developed a useful paradigm in 1884, known as Koch's postulates after Robert Koch, for determining whether a particular microorganism causes a particular disease:

The microorganism must be found in abundance in all organisms suffering from the disease, but should not be found in healthy organisms.
The microorganism must be isolated from a diseased organism and grown in pure culture.
The cultured microorganism should cause disease when introduced into a healthy organism.
The microorganism must be reisolated from the inoculated, diseased experimental host and identified as being identical to the original specific causative agent.

Koch actually abandoned the "should not be found" part of postulate 1 after discovering that some people could carry a disease without showing symptoms.

Abstracting a little bit with an eye toward applying these postulates to an ailing server, one might say

The conditions you think are causing the problem should be present in affected servers but ideally not in healthy ones. For example, the unhealthy servers are all deployed in sector A and the healthy ones aren't or, more commonly, the unhealthy ones are running one version of the code while the healthy ones are running the previous version.
This is a bit tricky, but I think it boils down to: There has to be a well-defined way to induce the conditions you think are causing the problem. For example, you can start a new instance of a server in sector A or update a healthy server to the new version you think is buggy. It isn't always immediately obvious how to do this. For example, if you think that the problem is some sort of bad request coming from random parts of the internet, you'll probably have to search through logs to find requests that correspond with problems.
If you trigger the conditions, the problem occurs.
Again doesn't apply directly, but in this context I think it means double-checking for evidence that the conditions you thought were triggered really were triggered. When you brought up the test server and it fell over, was it actually in sector A? When you sent the query-of-death you found in the logs, did the server that fell over actually log receiving it just before it fell over?

The kind of double-checking in postulate 4 is crucial in real debugging. It's very common, for example, to think you restarted a server with a new setting that should cause or fix a given problem, only to find that you restarted a different instance, or accidentally restarted it with the old configuration instead of the new. For example, as I was writing this paragraph I realized that the command I thought would send a problem request to an unwell server I'd been debugging had actually failed, explaining why I saw no evidence of the request having been handled.

There's also a distinction, in both medicine and software, between fixing the problem at hand -- curing the patient or getting the service back online -- and pinning down exactly what happened. In my business a common course of action is to roll the production servers back to the last configuration that was known to work, then use a test setup to try to reproduce the problem without impacting the production system. The ultimate goal is a "red test" that fails with the buggy code and then passes ("goes green") with the fix.

In medicine, as I understand it, the work of isolating causes and developing vaccines and drugs similarly goes on in a laboratory environment until everyone is quite certain that the proposed treatment will be safe, and hopefully effective, in real patients. In the mean time, doctors mostly do their best with known treatments.

While Koch's postulates are fairly famous, the kind of thing you remember from high school biology years later, they're not actually what modern medicine goes by, just like modern economists don't consult The Wealth of Nations, influential though it was. One more modern approach can be found in Hill's criteria, a set of nine criteria for determining if a given cause is responsible for a given effect, but there are many other, more recent paradigms.

Notably, Hill's criteria and its modern cousins are not nearly so crisp as Koch's postulates. The very name "postulate" suggests that you can obtain a rigorous proof, while "criteria" suggests something more indirect: if you don't meet the criteria, then you don't have cause and effect. The criteria themselves are of the form "more of this suggests causality is more likely", and the end result is an idea of the probability that something is causing something else.

As in many other areas, switching from a yes/no answer to a probability solves a lot of problems, particularly the problem of gray areas where there are reasons to say either yes or no. It does so, however, at the cost of being able to say for certain "X caused Y". In my world, you very often can say with confidence "The tweaks we made to parsing in change number 271828 caused the server to reject these requests", but in my world we have a high degree of control of the system. I can roll the server back to just before and after change number 271828 and run it in a test environment where I can control exactly what data it's trying to parse (or just write a "unit test" that exercises the problem code directly without spinning up a server).

In the field of medicine, and much of the scientific world, however, that's generally not the case. If we're trying to determine whether eating carrots for years causes freckles, we can't really make people eat carrots for years and count their freckles every week for the duration. Medicine doesn't, and shouldn't, have the same level of control over patients as I have over a server. That is, there's less control over the possible causes.

There's typically also less certainty about the effects. Lots of people get freckles whether or not they eat carrots, so you need more subtle statistical techniques to see if there's even a correlation between carrot eating and freckle getting. This sort of thing is a major reason that it's generally not a good idea to pin too much on any single medical study, even if it's careful with the data and its interpretation.

Nonetheless, medicine advances. Some of this is because research has pinned down some causes and effects to a good degree of certainty. There's no doubt that vaccines were effective against smallpox and continue to be effective against other diseases, albeit not always perfectly. There's no doubt that antibiotics can be effective against bacterial infections, or that some bacteria have evolved defenses against them. There's no realistic doubt that smoking causes a number of ill effects, up to and including lung cancer and emphysema.

But medicine is useful even in the absence of certainty. If there's an 80% chance you've got condition X and the treatment is 90% effective with a 5% chance of major side effects, and condition X is 95% fatal if left untreated, you should probably go for the treatment. If the treatment is 5% effective with a 90% chance of major side effects, and condition X is almost never fatal, you probably don't want to. You don't need absolute certainty to make a decision like this, or even to know the exact causes and effects involved.

Intermittent Conjecture

Saturday, October 6, 2018

Debugging servers and bodies

2 comments: