The Problem
Have you ever seen a message like this?
You think, “Hey, thanks for telling me! I’ll be watching out for that ‘something’ in the future.”
I was intimately involved with writing software my entire career, and error handling when something goes wrong was an inescapable part of it. When you’re writing code, there are always branches of the code that “should” never be executed, as far as you can tell. There’s no rational reason why the computer should ever be there.
Yet here you are, writing code for when it is there. What should you write? I’m going to explain why the result that you, the user, see is not only still terrible after all these years, it’s actually getting worse. But unlike the generic, “Error messages are so useless!” article (not that they aren’t), I’m going to explain why they are.
Historical Context
The Olden Days: IBM “Messages and Codes”
Let’s say you’re a programmer on an IBM mainframe (yes, those still exist and still do important work), and you get a message that just says “DLZ0041.”
But being a longtime mainframer, you know to pull up your “Messages and Codes” book, and look up DLZ0041, and find this:
IBM takes years to create a new release of its mainframe software. They have enough time to get a book like this together and make sure all the programmers know to log a DLZ0041 message when that condition happens. If a programmer needs a new error code not in the book, he or she has a process to get one approved and added. All this takes a lot of time, but the giant corporations who buy this software expect a book like this, and pay through the nose for it.
Of course, as a regular user, you have no idea what DLZ0041 means, nor should you. But what if instead of “Something went wrong” the computer said:
Error DLZ0041: Contact Support
and “Support” was a link that told IBM what software was running, what you did to trigger the error, and any other relevant information, along with “DLZ0041”? Some person or program receives that message, passes it on to Engineering, and hopefully the problem gets fixed. Someday.
IBM used to be dedicated to doing The Right Thing no matter what it cost, and it cost a lot. The right thing is: tell the user what happened and be sure the right engineer knows it happened, and fixes it.
Why It Persists
Where Did This Message Come From?
Here are two bad examples, and I’ll explain how they might have happened.
"An error has occurred. Error: error." and the similar
"Cannot find the file 'null'."
What are you, the user, supposed to do with that? Even if you go to the Support page (assuming you even find one), it’s not going to be much help.
Raising an Exception
Remember I said:
When you’re writing code, there are always branches of the code that “should” never be executed. …There’s no rational reason why the computer should ever be there.
Yet here you are, writing code for when it is there.
You’re a programmer, sitting at your keyboard, deep into the logic for how the program should work, and yet this annoying problem has to be handled. You have to do something about it. So you “raise an exception” which means, “get out of here and make it someone else’s problem!” Now you can go back to the interesting part, the “normal” logic.
A Techie Diversion
This explanation from Tufts University is a little programmer-ish, so please bear with me, and I’ll try to explain things. It will be over quickly.
def process_file(name):
f = open(fname, 'rb')
...read data from open file f here...
try:
process_file(fname)
except IOError:
print "Could not read file:", fname
sys.exit()
The part after “except IOError:
” is the error-handling for the IOError
exception. In this case, the program prints out the name of the file it couldn’t read, and stops (sys.exit()
). Note that this error is only handled because of the except
clause.
In that example, though, if the error was anything but IOError
, it would propagates up to the procedure that called process_file
, giving it a chance to handle it. If it doesn’t, then the error goes to the procedure that called that, and so on. Eventually, it should hit some chunk of code that knows what the problem was, what to do about it, and what to say to the human. Emphasize that word “should.”
Back to the Human Problem
If you go to StackOverflow (which used to be the programmer’s bible but is now circling the drain with the advent of AI), and query “exception handling” you get 160,343 results. It’s an immense technical problem and has provided full employment for generations of programmers. Programmers nearly always treat it as a purely technical problem, but it isn’t.
How It Gets Handled
Some programmer has to do something about that exception. Frequently it’s a different person than the one who encountered it. Maybe that person does something intelligent with it. Or maybe he doesn’t.
Frequently you get that message I showed at the beginning: “Something went wrong.” Because that’s all that person knows. But what should he do?
Solutions
It’s Not Complicated
As long as software has existed, it’s displayed error messages. They’ve nearly always been terrible, and the principles for making them not terrible have been known. Here’s a book that was first published in 1987, 38 years ago. Although Dr. Shneiderman is mainly imagining an audience of technical people, his suggestions are completely applicable to a modern Web interface.
In this book, Dr. Shneiderman lays out these guidelines:
Product
Be as specific and precise as possible
Be constructive: indicate what needs to be done
Use a positive tone: avoid condemnation
Choose user-centered phrasing
Consider multiple levels of messages
Keep consistent grammatical form, terminology, and abbreviations
Keep consistent visual format and placement
Process
Establish a message quality control group
Include messages in the design phase
Place all messages in a file
Review messages during development
Attempt to eliminate the need for messages
Carry out acceptance tests
Collect frequency data for each message
Review and revise messages over time
Now let’s look again at those two bad examples:
"An error has occurred. Error: error."
Violates all the first four suggestions under “Product.”
"Cannot find the file 'null'."
Same.
Technical Writers
Now you’re probably wondering: don’t they have someone writing those messages who can communicate?
Yes, many companies do, and they’re called “tech writers.” Here’s how those two horrible examples might have made it into the final product despite the best efforts of those talented writers:
I’ve been retired for a few years. If you’re reading this and you’re more current, please correct any anachronisms in the comments.
1) Tech writer writes customer documentation
A qualified technical writer reads the internal docs, talks to the Engineering head, and turns out a draft of the manual and/or online help. Their work gets circulated to the Product Management and Engineering teams, reviewed for correctness, and revised accordingly.
2) Sometimes error messages get the proper attention …
There are certain parts of any product that everyone just knows are important: the get-started and login dialogs, especially. I’m going to share one that I was personally involved with, and which you can still try on Google Maps! (I’m not the hero in this story, in case you’re wondering.)
In this article, I explain a feature I worked on, which you can still try. You can upload a spreadsheet or CSV file of all your places, and My Maps will make them into pins on your very own private map. It does this by converting all your addresses into latitude - longitude pairs (“lat-long” in shorthand).
The interesting thing about this is the process Google went through to make sure users could figure out how to do this.
I’m not claiming Google always does this, by the way. Frequently their user interface is terrible.
The user provides a list of addresses, and some will work while others will not. How does the user even start the process? If it didn’t quite work, how should they be told about it?
It wouldn’t be surprising if some websites just told you, “37 out of 38 of your addresses worked” and that’s all. Or maybe they just give you an error if ANY of the addresses failed. Or put up an error box. There are lots of ways to do this wrong.
We had a user-interface specialist, and he would bring in groups of non-Google users, pay them a nominal fee, and ask them to use the product, speaking their thoughts out loud and being videotaped.
This is the Cadillac version of user testing. As a developer, you think you know what will be intelligible, but you need to be humble and willing to admit you were wrong. You can’t judge that for yourself.
We tried three or four ways of doing this, and none were really satisfactory. The one you see now (see it in the post) was something they came up with after I left. If you think, “Of course. That’s obvious!”: well, everything is obvious once you see it.
Anyhow, sometimes the right people get involved and the results are good. But what happens to the errors the programmers discover after the help page is done? Then you might get messages like this:
"This file is empty and will be deleted. The file cannot be saved because it is empty."
"Uninstallation is complete. You must install everything what was removed before you can use it again."
"This may take a some minutes – DO NOT TURN OFF YOUR XBOX..."
We shouldn’t make fun of non-English speakers for their English. I can’t speak their language, either. Nonetheless, if the audience speaks English, you need a writer who does, too.
3) Some times they don’t
Quite often at a previous job, the tech writers would push back against any suggestion that they review error messages or include them, with troubleshooting help, in the manual. Their objections might be:
“We have a 100-page limit on the manual. That would push it over.”
“It looks bad to emphasize all the bad things that might happen.” Product managers are especially prone to think this way.
“The manual is done and off at the printers. Sorry.” (Nowadays, it would be “The Help pages have been incorporated into Production. It’s too late to make changes.”)
4) What if it’s a bug? Two possibilities.
Sometimes, the software fails spectacularly. You might be tempted to blame yourself, but it’s not your fault; it’s theirs.
It might crash so that the system notices it, and you get something like this:
How many times have you filled out one of these and sent it in? I’ll wait while you tally them up.
If I thought I’d get a prompt reply from an engineer inquiring how to reproduce this, because he was offended that a bug in his code reached the public, then I might. If I knew it would just be an automated “thank you”, then probably not.
Other times: I put this in to balance out the pro-Google story above. In one job at Google, I was using Google Drive almost all day every day, and very often it would do something wrong: not show a file I’d just uploaded, freeze up, etc. Most people in the department would just ignore a bug like that, figuring there was no point in reporting them. Occasionally I’d file a bug. Invariably, it would come back “could not reproduce.”
It’s true: as a programmer I’d often get bug reports that I couldn’t reproduce. But the thing to do is try, not give up. You question the person who reported it: what were you doing? has this happened before? You try to imagine what could have caused it and put in debugging or logging code to try to catch it. Sometimes I’d spend days with a cooperative user, trying to nail it down. They’d be my partner for a few days. Good times.
Technology to the Rescue (not)
There are technical “solutions” to these problems. Here’s a comparison of two of them. Or maybe artificial intelligence will do it! New Relic and Maze are two of the products exemplifying this approach.
Ultimately it’s a human and economic problem, though, not a technological problem. If you don’t care about quality, you won’t achieve it.
So It’s Deliberate
I think I’ve demonstrated here why the error-handling in modern software is so bad: it’s because the app makers want it that way. Or more accurately, they don’t want to spend the time and money to fix it. There is nothing in modern software that makes it more difficult than it was in the IBM mainframe days; in fact, IBM didn’t have all those modern tools I just mentioned, which have more than kept up with the pace of change.
What’s changed is the economics, which features competitors who’ll get to market faster than you do and charge less if you make the effort to get it right.
I've seen code like this:
```
def process_file(fname):
try:
f = open(fname, 'rb')
process_opened_file(f)
close(f)
except IOError as e:
print('Could not read file:', fname)
sys.exit()
except:
pass
```
(for the non-technical: this outputs a message if there's a problem opening or reading the file; and silently keeps going for any other exception)
The person who wrote this was lauded for how quickly they cranked out code that (mostly) worked. I was criticized for filing 300 bugs against that code when I had to support it at a customer's site.
[This code doesn't even output what went wrong with opening or reading the file; providing that extra information would take less than 10 more keystrokes]
I am curious if there is an error message that unpacks into like 3 or 4 tiers of error instructions.
Like, "Click here to read error message for end-user: "(e.g. laymen instructions, call this number and mention this error code, or contact the system administrator)"
"Click here to read error message for developer:" "(Boop beep boop beep jargon code)
"Click here to read error message for manager: (type passcode and contact Steve at ext. 211)"
Or "press 1,2, or 3 depending on the user"