Epic Fail #5: Not writing fast-fail code

For distributed systems class recently, we had to write a simulated distributed network in Erlang.

The assignment involved getting a whole bunch of Erlang processes (which are basically really lightweight threads) to send messages to each other according to the specification. It had a bit of a convoluted way of doing things, but it was intentionally to produce a model that demonstrated an important concept in distributed systems, albeit without the additional complexity layers of socket programming and networking.

I ended up with a bug that wasted an embarrassing number of hours, which got me slowly stepping through a very substantial portion of my program’s functionality in reverse. The entire program culminated into one process producing a small amount of outputs. Being a simulation of a distributed system of several concurrently running nodes, this meant tracing through a web of message-passing.

A lot of the time wasted during debugging was actually due to poor assumptions I made about how the system worked. That’s an entirely separate epic fail on its own (see the appendix section), but that’s not the focus of this particular post.

It turned out that towards the beginning of the entire message exchange, I made a one-character typo. This would’ve been caught instantly with good programming practices.

Disclaimer

I’m kinda a super-noob at Erlang and functional programming in general. I literally crash-coursed it over the course of an hour. If I’ve completely missed the mark on something, feel free to yell at me at [email protected].

The Problematic Code

The erroneous Erlang code took the following form:

if
    Foo =:= commited ->
        doSomething();
    true ->
        doSomethingElse()
end,

This is basically equivalent to the following pseudocode if-else:

IF (Foo == commited) THEN
    doSomething()
ELSE
    doSomethingElse()
END IF

The Foo variable was only expected one of two possible atom values: committed or abort. Because of the typo, doSomethingElse() was always executed, even if doSomething() was meant to be executed instead.

If you’re not familiar with Erlang’s atom type, you can just think of them as a big global enum, and atoms are used in place of things like boolean types, magic number constants, and enum types as seen in other languages. For example, instead of the boolean type, we use the atoms true and false.

From the programmer’s perspective, atom literals can look somewhat like string literals. This makes them particularly susceptible to typos that only get tested during runtime.

Though, using error-prone language constructs doesn’t have to be this painful. All those hours wasted debugging could’ve been solved using one simple technique: writing fast-failing code.

The Fast-Failing Solution

What I should’ve done in my assignment is the following:

case Foo of
    commited ->
        doSomething();
    abort ->
        doSomethingElse()
end,

Here, the variable Foo is pattern-matched against the atoms commited and abort. If the case statement fails to find a match, it raises an exception:

Eshell V9.3  (abort with ^G)
1> Foo = committed.
committed
2> case Foo of
2>     commited ->
2>         doSomething();
2>     abort ->
2>         doSomethingElse()
2> end.
** exception error: no case clause matching committed

With this version of the code, the error immediately makes itself known, and I would’ve fixed the typo and moved on with my life.

Another Fast-Failing Example In Python

Different languages work differently, so let’s have a look at a similar example in Python. Suppose we had a string that is expected to only be either "committed" or "abort". A fast-failing solution is the following:

if foo == "commited":
    do_something()
else:
    assert foo == "abort"
    do_something_else()

Here, the assert checks that if the else block is entered, foo indeed does have the correct value. This would’ve caught the typo in much the same way as our Erlang solution:

>>> foo = "committed"
>>> if foo == "commited":
...     do_something()
... else:
...     assert foo == "abort"
...     do_something_else()
...
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
AssertionError

However, assert will not work with the -O flag, so if you absolutely have to raise an exception in production, it should be checked in an elif block and an exception raised in the else block:

if foo == "commited":
    do_something()
elif foo == "abort":
    do_something_else()
else:
    raise ValueError("Unexpected value for foo.")

Taking This Further

If you haven’t been using fast-failing code before, I hope I’ve convinced you to start using them now. However, fast-fail goes so much deeper than the scope of this post.

For example, asserts are highly versatile in verifying the state of your program at specific points in the code while also usefully doubling as implicit documentation. The following example shows asserts being used to check function preconditions:

def frobnicate(x, y):
    assert x < 10
    assert isinstance(y, str) and ("frob" in "y")
    ...

Though, fast-fail is desirable in production, and asserts are usually not checked in production. There are many different practices for raising and handling errors, especially for critical applications such as aeronautics and distributed infrastructure. After all, you wouldn’t want your million-dollar aircraft exploding shortly after launch due to a missing semicolon, or your entire network of servers failing just because one server threw an error.

And of course, I haven’t even touched the ultimate in fast-fail: compile-time errors (as opposed to runtime errors), and statically typed languages (as opposed to dynamically typed languages).

Appendix: Other Comments

You can certainly still use the Erlang if-statement like so while still failing fast:

Eshell V9.3  (abort with ^G)
1> Foo = committed.
committed
2> if
2>     Foo =:= commited ->
2>         doSomething();
2>     Foo =:= abort ->
2>         doSomethingElse()
2> end.
** exception error: no true branch found when evaluating an if expression

However, I personally wouldn’t recommend it for this particular use-case (or even the majority of use-cases) since the case statement here is much clearer, concise, and less error-prone to write.

It should of course also be noted that this is a quirk specific to Erlang. Python would just skip right over the if-statement:

>>> def f(x):
...     if x == "commited":
...         print("bar")
...     elif x == "aborted":
...         print("baz")
...     print("foo")
...
>>> f("commited")
bar
foo
>>> f("committed")
foo

Be sure to learn the behaviour of your particular language when trying to write fail-fast code.

Also, Python unfortunately lacks such a cases statement, though techniques such as the use of a dictionary may be used if it makes things clearer. The Python dictionary also usefully throws an error if no such key exists:

>>> def f(x):
...     {
...         "commited": lambda : print("bar"),
...         "abort":    lambda : print("baz")
...     }[x]()
...     print("foo")
...
>>> f("commited")
bar
foo
>>> f("committed")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in f
KeyError: 'committed'

Appendix: Epic Fail Extra

I did mention that the way I went about debugging my assignment was its own epic fail, but it’s not particularly deserving of its own separate post so I’ll summarise it here in case it’s of interest to anyone. As a warning though, this explanation might be a bit abstract, and I really don’t blame you if you don’t get it. I’m writing this here for completeness.

Each node in my simulated network potentially spawns a bunch of processes (which I will call “mini processes” for convenience) which it kills only if the network decides to return “abort”. If the network decides to return “committed” instead, the network doesn’t touch them.

These mini processes in my test case receive messages from a controller process after the network makes its committed/abort response.

However, I was finding that despite the network returning “committed”, the messages from the controller process didn’t seem to reach the mini-processes. Or perhaps more accurately, when I put print-statements in the mini-processes, expecting them to all print messages to the terminal, nothing was printed.

Not knowing any better at the time, I focused intensely on why messages sent from one process may not reach a target process.

Maybe the process identifiers used to send the messages are actually subtly wrong, thus the messages are being sent to the wrong place?
Maybe the test case kills all processes too quickly, thus not allowing the mini processes to flush their write buffers?
Maybe it’s because the processes are far-relatives of each other?
- Processes are created by a “parent process” spawning a “child process”.
- I was considering that perhaps processes that are sufficiently far-relatives of each other might not be allowed to send messages.
- On the other hand, perhaps closely related processes such as a parent and child, or two children of the same parent, may be sufficiently close enough to send messages to each other.
I was also considering that maybe I just haven’t learnt enough about Erlang.

It took me far too long to realize that the bug as detailed in this blog post caused all nodes to instantly kill all mini-processes since the “abort” code was always executed. The mini-processes were always killed before they could receive the controller’s messages.

Always challenge assumptions, kids.