Hunting the White Whale

Woodcut of 5 whalers in a boat, with a harpoon, hunting a while sperm whale that has breached. A scene from Herman Melville's Moby-Dick
Picture of Stuart Criley, MBA

Stuart Criley, MBA

Chief Operating Officer of Indelible Learning

Software bugs are a fact of coding. The more a program attempts to do, the more opportunity there is for error. The sources of error are legion, but largely stem from the problem that much of the error-prone code is written by hand, by artisans. The industry is slowly moving toward standardized code that is reliable and interchangeable, but the industrial revolution that transformed manufacturing is still on the horizon for software coding. Generative AI, trained on mountains of human-written code, is becoming more and more capable, and may accelerate trends toward compact, reusable code that can be assembled rapidly into quality software. Google claims up to a quarter of their new code is generated by AI.

But what remains, for human-generated or machine-generated code, are bugs. These are errors in the code that cause unexpected or undesirable behavior.

Bugs that cause a program to crash immediately are paradoxically the easiest kind to find and fix. They fail immediately. The human (or machine-learning assistant) only needs to keep tweaking the code until the thing starts to run.

The more pernicious bugs are intermittent. The code runs in your hands, but only occasionally, something fails. Here, the cause of failure may not be obvious. Because the cause is not obvious, it becomes harder to test whether a proposed change in the code actually fixes the problem.

Then there are the white whales, those elusive bugs that cascade into disaster, that your users see, but you do not. They can sink your ship, or worse, the ships of your customers. They roam under the surface, undetected. Because you were not there to witness the crash, you have little idea what the trigger was.

We have been hunting a white whale for the better half of a year.

We named it the ghost bug. In Election Lab, two or more players join a virtual room, and are paired up into 1-on-1 games. The ghost bug would occur when at some point the a player would find his opponent had disappeared. At its most destructive, this bug would leave players stranded in a game they could not leave. The host of the virtual room, usually a teacher, would normally be able to kill a game, and return the players to the lobby. But not games with a ghost. These players were stuck. The games could not be killed.

We have lopsided pictures of screens taken from smartphones when this happened in classrooms. On Halloween, we saw it first hand. But in a classroom full of players, we could not find the two participating players in time to discover how it was triggered.

With these ingredients in my head, but not the specific steps it took to reproduce the bug. I met with my partner in bug hunting for a long video call. We played each other, and tried to find the trigger.

One ingredient that sat in my mind was an ambiguity at the end of the game. We prompt the player to identify the strategy they used, from a line up of possible deployments. Only one of the deployments is the correct one. It is a clever way to reinforce the learning, and a nudge to players who simply button mash their way through the game. Thoughtful deployments of resources is a key component of winning, both in our game, and in winning the Electoral College in actual elections.

The ambiguity lied in the user interface. It was not obvious that this task (identifying the player’s strategy) needed to be done. But until it was done, the player could not return to the lobby to play another round.

So our bug hunting focused on this end-game modal. But how to create a ghost match? Who would ghost whom?

We had four browser tabs open: the room host, Alice, Bob, and Charlie. It took all three players to trigger the bug. The first two players would match, play, and finish a game. Alice would correctly press the top button on the modal, and answer the quiz about the proper strategy. Bob would wait at the modal, modeling what many students did who were honestly confused about what to do next, because the user interface did not prompt them.

Looking at the lobby from the room host’s perspective, we made an important discovery: as far as the game was concerned, Alice—as well as Bob—were back in the lobby, ready to match into a new game. As soon as either player finished the quiz, and then pressed the back to lobby button, the game made both players available for a new game. But Bob had no idea, because he was still on the end-game modal.

We then had the host match the hapless Bob into a new game with Charlie. Off to the next game they went.

Charlie began playing, until it became Bob’s turn, and then became stuck. Charlie’s opponent was nowhere to be found, because he had no idea that a new game had started. He would only find out once he completed the quiz, and pressed back to lobby—and only then would a modal pop up announcing the new game with Charlie. But the modal was not triggered, and Bob remained ignorant of the new game he was in.

Frustrated, Charlie now quits the game. Depending on how he quit (return to lobby, browser back button, or close browser), the game would now ghost Bob.

From the host’s point of view, Charlie could be seen playing a ghost: Charlie was in a game, but his opponent was a blank name. Then, when Charlie quit, his name would also disappear, and we had a double ghost match. In other words, the game still existed, but with no players.

We looked at the code behind the game, specifically when a game ended. The first discovery was there were two mentions in the code of ending a game: “game end” and “end game.” You can guess that confusing “game end” with “end game” in code would cause trouble. The second discovery was that once a player ended the quiz, and pressed return to lobby, the code indeed ended the match for both players, and returned them both to the lobby. This bit of code was the root cause: Alice, who had finished the quiz, was ready for a new game. But Bob, who had not, should not be returned to the lobby, because he was still busy.

These scenarios explained ghost matches, double-ghost matches, and stuck players. On the host’s dashboard, a game with a stuck player could not be killed. All of these problems had a single root cause.

Finding the trigger, which had eluded our team for months, and finding the culprit code, was our white whale.

There were four critical elements to this discovery of this white whale, and hunting it down:

  1. Customers found the bug first. It escaped our quality assurance testing.
  2. We had screenshots of what it looked like from the host’s lobby, showing a player matched against a blank name (a single ghost match), and showing two blank names playing (a double ghost match).
  3. In the classroom, we witnessed players who were confused by the end-game modal. This confusing user interface was an important clue.
  4. Also in the classroom, we tried ourselves to kill the single- and double-ghost matches from the dashboard, and could not.

These clues were the ingredients to the trigger. But the recipe, the specific steps to triggering the but, were discovered through patient investigation and dialogue.

And we survived to tell the tale.