A story about attention to detail and usability
It was a regular day, like any other... Until I received an email from the person who controls an upstream VoIP server I use in my company: "We see a lot of failed connection attempts from your IP address. If this goes on, you'll end up in our blacklist and you'll have a hard time getting out of it".
This did not sound right, so I had a look at the system log:
[2012-12-21 20:34:39] NOTICE[1875] chan_sip.c: -- Registration for 'number@anotherserver.sip' timed out, trying again (Attempt #18)
...
[xxxxxxxxxxxxxxx] NOTICE[1875] chan_sip.c: -- Registration for 'number@anotherserver.sip' timed out, trying again (Attempt #BIGNUMBER)
It was filled with a bazillion of such entries, yet calls could be placed and received, the voice quality was excellent: no delays, gaps, echoes or "I hear them, but they don't hear me" effects.
"What the hell is going on?", the cat asked me with a puzzled face. I had no idea. This was a new shiny server built from scratch for the sole purpose of VoIP, to offload the old server (which provided a lot of other services too).
I must admit that the first thing I did was nothing. If it isn't broken, don't fix it, the calls work, why touch it?
But "you'll have a hard time getting out of it" echoed in my head. "Hard... hard... hard... out of it... out of it...."
Any normal person would probably try to ping the upstream server, then run traceroute to make sure the computers can talk to each other. Sure enough - they could. And that made sense, how else would I be able to make calls?
I then reviewed the port-forwarding rules on the router. As I was logging onto the box, I had flashbacks of "Asterisk behind NAT" stories and how this was one of the challenges Sysyphus was facing in Tartar. Alas... the settings were correct.
My next idea was to make a list of all the possible infrastructure changes that could have had an effect like this one. The prime suspect was the new router - because this didn't happen with the old one. But... neither did it happen after we switched the network equipment.. Hmm....
Triple-checking Asterisk's configuration files did not reveal anything unusual.
I configured a software SIP client on my own computer and used the same SIP credentials to see if I could register - yes, I could. This means that other SIP clients behind the same NAT worked fine.
"Sleep over it", the cat said, nesting herself in the geographical center of the bed. "You're right", I thought, curling around her on the small patch of land that she left available.
Beeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeep...... This is what it is like to fall asleep with tinnitus.
***
I wrote back to the upstream server administrator, explaining what I did and asking him about his best guess concerning the nature of the problem. After all, he's doing VoIP for a living- he must be dealing with such issues on a daily basis.
***
On the next day I rechecked the configuration files. I decided it was time to refresh my knowledge of the SIP protocol, so I would know what to look for when I "went deep". This method would take a lot of time, so I decided to try one more thing before doing it "the hard way" - ask a question on Serverfault; perhaps the collective wisdom of the planet will offer some hints while I'm analyzing the situation.
**Server**
- Asterisk 1.x.x
- on Linux Y
- it is registering on another Asterisk server, anotherserver.sip, owned by a VoIP provider
**Network**
- the Asterisk server is on a system running behind NAT
- the router is an ACME XX-x, running the Y firmware
- it is configured to forward TCP/UDP 5060 to the Asterisk server
- as well as a range of UDP ports for RTP (from x0000 to y0000)
**Problem**
- Asterisk's log is filled with entries like `[2012-12-21 20:34:39] NOTICE[1875] chan_sip.c: -- Registration for 'number@anotherserver.sip' timed out, trying again (Attempt #18)`
- over a period of time, the log piles up with thousands of such entries
**The weird part**
- My Asterisk server can ping anotherserver.sip, traceroute indicates that it reaches the right server, going over the right path
- Running `sip show peers` in the CLI shows me something that I interpret as "it works fine": `Out/123456 anotherserver.sip N 5060 OK (2 ms)`
- I am able to place and receive calls, the connection is fine, there are no delays or echoes
Having thought about what else I could mention to make it a reasonable and "answerable" question, I decided to try a few more things... Perhaps posting some configuration excerpts?
I SSH'ed into the server and had another look at Asterisk's sip.conf. I noticed that it was binding to an incorrect interface (in fact, this was the IP address of another server in the LAN). How? Why? This wasn't there when I checked last time, and most importantly - I can make my calls despite that?! I ran "sip show peers" again - yep, everything was fine. A mental note was made, but nothing more.
What else? The cat suggested me to telnet to port 5060 of the upstream server, maybe something weird would happen. Of course, I had to go for port 5060 UDP, so telnet would not be suitable. I knew I needed netcat (a cat's weapon of choice, I must add), but I didn't remember the command line arguments. "man nc" didn't give me the manual, instead the system said it had no idea what the hell I was talking about. That's strange, because I used it just the other day when tinkering with something else. Wait... wait a minute...
A few moments later, it hit me.
- After we migrated the Asterisk server to a new computer, the service was stopped on the old machine - but all of its configurations were left intact; such that we could switch back swiftly if the new server didn't work well.
- Everything was fine with the new server and the new router.
- But then there was a power outage and the systems were shut down.
- When power was restored, Asterisk started automatically on both servers.
- Ports were forwarded to the new server, which is why it registered and could make calls.
- The old server was online and the daemon kept sending connection requests. The upstream provider replied to them, but it never heard back - which is why they were seeing a lot of dropped connections.
My mistake was that when I ran my checks, I SSH'ed into a server and then SSH'ed through that into the new Asterisk server - that's where I saw nothing wrong with the settings, and the bind address was right.
Some time later, I SSH'ed into the server but forgot to further SSH into the new Asterisk. The server I was tunneling through happened to also be the old Asterisk server. Which is why all the configs were there and everything was fine (except the different listening interface). This is the server where the logs had a lot of "timed out" entries.
I then used sysv-rc-conf to prevent Asterisk from booting on the old server automatically. The day was saved.
Cегодня в 5:30 утра закончились неправильные запросы с вашей стороны.
Lessons learned:
- Pay attention to the details
- Pay attention to the details
- Pay attention to teh details
One would say this is a human factor issue. In fact, I anticipated such a thing, so my SSH client is configured to use different background colours for different servers - such that I get a visual hint when I think of "which server am I looking at?". I know that one server is white, another one is yellow...
.
This works great when you're in the LAN and you connect directly. But when you're away and you have to tunnel via other servers - if I SSH into A, then SSH into B via A - the window will have the colour assigned to A, rather than the one assigned to B.
I thought of this scenario in the past, but I decided not to expose the new machine's SSH port to the world, to reduce the attack surface - so I would always connect to A, then to B. I even set up a VPN server, so I could make it all look local when I am far away... Heck, I was connected to the VPN when doing these things, but I chose SSH tunnels out of habit.
If I were the author of an SSH client, I would make it possible to define colour-schemes for connections established via SSH tunnels.
p.s. yes, one should also look at the command prompt and notice the difference between "user@A" and "user@B". However, I connect to these servers very often, and both of them were used as Asterisk servers for a long time, so this detail was easily overlooked - I got used to both of them handling VoIP.
3 comments
Comment from: feasibletrash0 Visitor
Comment from: Constantin Visitor
Great story.
Here’s one suggestion: Instead of color-coding the entire screen, just color-code the “@server” part of “user@server". (Search for “bash color prompt” to find the incantations to do this.) Advantages: (1) the configuration resides on the server, not on the client, so you don’t have to (re-)configure your SSH clients; (2) this should be easier to notice at a glance, because our brains detect colored outliers in “fast O(1) mode” (pre-attentive processing); (3) this will work regardless of how many intermediate servers you SSH through.
Comment from: gr8dude Member
feasibletrash0, I did not get any replies because I didn’t post it. I understood the cause of the problem while I was explaining it to the hypothetical serverfault reader, as I was re-reading what I planned to ask.
Constantin, that is one uber-tip! I will definitely do this from now on! I’ll post an update when I have a spare moment to configure it the way you described.
Did you get any replies from serverfault? Seems like lots of fun