A story about consistency, the usability of error messages (and attention to detail)
Once upon a time, the planets aligned in such a way that a MySQL cluster had to be set up, on top of a Debian server. A colleague had to handle the matter and I was there to provide a second opinion.
The story you are about to read is my account of an epic struggle and the immediate aftermath thereof.
The clustering packages were not in the repositories, so we relied on binaries obtained from the MySQL site. The web-page tricked me into creating an account, just like it did last time I had to download something from it, because the "skip registration" button was cleverly placed in my blind spot.
After a long RTFM session, the configuration files were planted at the right locations, the daemon was started, and it was good.
As the cluster consists of multiple nodes, magic had to be done on other machines too. Eventually we were ready to give a try to the entire set-up.
The cluster behaved as if it had a mind of its own. I will not reveal all the details, instead I will focus one the problem itself - the thing wasn't listening at the right interface. Double-checking the configuration file brought no results - everything looked right.
Having had previous experience of errors caused by misplaced a misplaced \n, \t, comma or dot, both of us read the configuration files again, carefully. We then reviewed the sample configuration file to ensure we didn't omit any parameter. Everything was fine.
The first guess was related to the fact that the config file was not in the right place. One expects it to reside in somewhere in /etc, but the package does not entirely follow the conventions of the distro, and the file is expected to be elsewhere. We double-checked it - the config file was where it was expected to be.
Despite that, the daemon acted as if the file wasn't there, applying other settings instead of the ones we chose. We re-verified everything. Within a few minutes, we realized we had a this should never happen™ situation.
if (False) { printf("this should never happen"); ... } >>> this should never happen
I reached for the gun, telling myself that someone had to pay for this. Aahahha, just kidding!
Don't we all love challenges? We turned to the collective wisdom of the planet, binging☦ the hell out of the web, but this was fruitless.
We had to go deeper. I wanted to make sure the daemon loads the configuration file from the directory it was placed in. Yes, the manual says it should be in /X, and it is there, but what if the manual is wrong? That has happened before and it will happen again!
On Windows I would fire up Process Monitor and have a look at the file-system activity, but the battlefield was a Linux server. I am familiar with dtrace on Solaris, it does something similar, so I tried that. It turns out to be a Solaris-only tool. The closest alternatives are strace and ltrace, which are good enough for what I had in mind.
My intention was to find a log line that would look like:
open("/another/unexpected/directory/file.conf", O_RDONLY) = 2
Which means "I tried to open /another/unexpected/directory/file.conf and it does not exist". 2 is defined as ENOENT somewhere in errno.h (which is an equivalent of winerror.h on Windows) - a place for defining standard error codes. It is a good idea to follow the platform conventions and use the same error codes everywhere - consistency is a really good thing.
The point is that ltrace or strace is a low-level tool and it would show me what happens at the file-system level - then I would know exactly where the config file is loaded from. This way any inconsistency between the code and the documentation would not affect me.
***
I did not have to run the trace tool though, because the mistake was spotted with an unarmed eye, when the stars smiled upon us and their light hit the screen at a perfect angle! MySQL's config file is called "my.cnf", whereas the one we created was called "my.conf". Yep, that's it, all of this was happening because of the 'o'. The file wasn't there, because it wasn't.
Is it a human factor problem? Yes... But it could have been prevented by the developers:
- follow conventions☩ and use the same extension as everybody else does in this part of the galaxy
- don't be silent about a missing configuration file; quietly loading default values will backfire
- design the installer such that it creates a default configuration file if it is not there already
The lessons:
- Pay attention to teh details™!
- A system that can fail horribly because of a misplaced parenthesis is a bad system.
my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cmf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf ny.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf my.cnf
One of the systems that I recently created runs a series of sanity checks before loading configuration files. If there is an issue, a helpful error message is displayed, it includes the path to the config file, the line where the problem is and an explanation of what went wrong. The server refuses to start if the config file has problems.
Yes, it took more time and lines of code to implement the sanity checks and the informative errors, than to implement the part that actually uses those settings; but this investment will pay off in the future, reducing maintenance efforts.
- ☦ Sounds better than "duckduckgoing" (-:
- ☩ Some servers use .conf, others use .cfg, others use no extension at all. Another school of thought relies on .*rc files. No formal convention exists, but my gut feeling tells me that no one else uses .cnf; what does your gut say about it?
- The XDG base directory specification seems to be the spec, but the Debian server I tested it on did not contain those environment variables.
As a developer of web application I get your point very well :-)
good error management, sanity checks, informative log are all a must.
A good example in the web-world is the usage of start-up listeners which run when the web the application is loaded. There you can make all the checks you need, report any errors (in the stdout I guess, just to be sure the messages will not be lost) and even interrupt the deployment if something is missing/broken.
Such as - a crucial configuration file.
I would go further - even if the configuration file for logging is missing or malformed, the application should not run.
As of error messages, I can’t stress enough how important is to be aware of the context. I recall www.999.md having troubles with their DB, and you got a nasty error page with the stacktrace on it.
That is stuff that should be logged/emailed but not shown to the user.
User must see only what is appropriate for him - a nice modal box with a message like this: An unexpected error occurred, please contact the site admin at: blabla
Not having a message logged/shown is as bad, or even worse.
I learned a lot about this kind of stuff, especially when I had to make web 2.0 apps.
It was hard, but enjoyed it thoroughly.
nice article by the way, I soaked it a heartbeat.