Sniffing the network for fun and profitPosted: September 12, 2007
Every DBA should know something about the network (and OS, and storage, and application development – good DBAs are multi-disciplinary). Probably everyone knows how to debug connectivity issues with ping and trace route, how to diagnose problems with the DNS, and how clients are connecting to the listener.
DBAs with RAC need to know much more about their network – they have virtual ips and interconnect and usually a storage network too. RAC systems can crash due to network latency problems, and the DBA may need to diagnose this. RAC also has this entire load balancing and failover thing where network addresses can move around and you need LOCAL_LISTENER and REMOTE_LISTENER parameters configured.
However, it is very rare that things get as complicated as they did this week:
On Thursday we configured a server with LOCAL_LISTENER and REMOTE_LISTENER values.
Almost immediately a customer started experiencing occasional connection failures. Naturally, we (DBAs) didn’t hear about it until it was escalated to an emergency call at Saturday afternoon. I had a date on Saturday night and didn’t want to be late due to a long debugging session, so I rolled back the change, asked the customer to send me his tnsnames.ora, hosts file and screenshots with the failure, and told them I’ll look into the issue on Monday.
Monday morning arrived sooner than I would have preferred. Their tnsnames.ora that was supposed to connect to our database actually contained an address I did not recognize. A quick meeting with the local network manager, revealed that these guys have VPN, they connect through a NAT and they also have a special firewall configuration. Remember I said that every DBA should know networks? Well, I didn’t mean NATs and VPNs. So I don’t know enough about networks, and the network manager doesn’t understand listeners and RAC, but we had to find a solution together.
It appeared that after I configured LOCAL_LISTENER values, when the listener attempted to redirect their connection to another server, it sent them an address (or maybe IP?) that their client couldn’t connect to and therefore failed. But why did everything work before we configured the LOCAL_LISTENER? According to the documentation we would still send addresses the client can’t connect to, just the local addresses instead of the vip. The network administrator had a theory that maybe the NAT translated the local address as it was sent back to the client to something the client understands, but this is really far fetched.
This is where the sniffer comes into the picture. When you have a complex setup and you have to know exactly what is the behavior – who is initiating the connection, what is the reply, where are the redirects, who closes the connection and how. The sniffer will listen to the network and give you the complete picture.
I use Ethereal, which is graphical and relatively clear and friendly, but still very powerful.
In this case, Ethereal was especially useful – using the customer network capture, we could easily see that we were debugging the wrong issue all along. The tnsnames.ora file he sent us belonged to a different machine that didn’t experience any problem. The machine that experienced the issue connected to a different ip, which no one really knew about until this time. We are still not sure how it is related.
What we do know is that if you have a complicated network configuration, that changed several times, in very inconsistent ways and that no one documented – a sniffer is your only friend.