Sniffing the network for fun and profit

Every DBA should know something about the network (and OS, and storage, and application development – good DBAs are multi-disciplinary). Probably everyone knows how to debug connectivity issues with ping and trace route, how to diagnose problems with the DNS, and how clients are connecting to the listener.

DBAs with RAC need to know much more about their network – they have virtual ips and interconnect and usually a storage network too. RAC systems can crash due to network latency problems, and the DBA may need to diagnose this. RAC also has this entire load balancing and failover thing where network addresses can move around and you need LOCAL_LISTENER and REMOTE_LISTENER parameters configured.

However, it is very rare that things get as complicated as they did this week:
On Thursday we configured a server with LOCAL_LISTENER and REMOTE_LISTENER values.
Almost immediately a customer started experiencing occasional connection failures. Naturally, we (DBAs) didn’t hear about it until it was escalated to an emergency call at Saturday afternoon. I had a date on Saturday night and didn’t want to be late due to a long debugging session, so I rolled back the change, asked the customer to send me his tnsnames.ora, hosts file and screenshots with the failure, and told them I’ll look into the issue on Monday.

Monday morning arrived sooner than I would have preferred. Their tnsnames.ora that was supposed to connect to our database actually contained an address I did not recognize. A quick meeting with the local network manager, revealed that these guys have VPN, they connect through a NAT and they also have a special firewall configuration. Remember I said that every DBA should know networks? Well, I didn’t mean NATs and VPNs. So I don’t know enough about networks, and the network manager doesn’t understand listeners and RAC, but we had to find a solution together.

It appeared that after I configured LOCAL_LISTENER values, when the listener attempted to redirect their connection to another server, it sent them an address (or maybe IP?) that their client couldn’t connect to and therefore failed. But why did everything work before we configured the LOCAL_LISTENER? According to the documentation we would still send addresses the client can’t connect to, just the local addresses instead of the vip. The network administrator had a theory that maybe the NAT translated the local address as it was sent back to the client to something the client understands, but this is really far fetched.

This is where the sniffer comes into the picture. When you have a complex setup and you have to know exactly what is the behavior – who is initiating the connection, what is the reply, where are the redirects, who closes the connection and how. The sniffer will listen to the network and give you the complete picture.

I use Ethereal, which is graphical and relatively clear and friendly, but still  very powerful.

In this case, Ethereal was especially useful – using the customer network capture, we could easily see that we were debugging the wrong issue all along. The tnsnames.ora file he sent us belonged to a different machine that didn’t experience any problem. The machine that experienced the issue connected to a different ip, which no one really knew about until this time. We are still not sure how it is related.
What we do know is that if you have a complicated network configuration, that changed several times, in very inconsistent ways and that no one documented – a sniffer is your only friend.

Advertisements

6 Comments on “Sniffing the network for fun and profit”

  1. Paul says:

    Maybe this trap that I blogged about? Setting the sqlnet.ora to enable client tracing can also be helpful.

    … but I agree, having a sniffer to hand can be a great tool for cutting to the root of a network-related problem, rather than be stuck trying to diagnose purely on the basis of symptoms.

  2. Noons says:

    Just had a similar weird problem in one of our subnets.

    A given workstation in a given subnet started to show errors when connecting to Oracle after a reboot of the db server. Other workstations on same subnet had similar problem when we looked, every other workstation was fine.

    A tnsping from the workstations involved showed everything was fine.

    I did a ping from that ws to the db server and every second packet got dropped!

    The reason tnsping didn’t show the problem is that it doesn’t negotiate a connection like a login does: it simply peeks at port 1521 and leaves it at that. So the loss of every second packet didn’t matter to it.

    On investigation, it was found that the db server had two gateway addresses configured and this was acting like a “load balancer”, splitting the traffic across each gateway. Because the follow-on ip address of one of the gateways was invalid, we lost every second packet!

    One of the weirdest problems I’ve had, so far…

  3. Please mind ethereal has been renamed wireshark.

    The knowledge noons displays is also very important to know (the server builds a connection back up in order to setup a connection to the client, with the exception of a tnsping, which acts as regular network traffic. luckily, most firewalls have specific rules for sqlnet traffic)

    I’ve had the same load balancing/double gateway problem with a DNS server (every second request was sent to void)

  4. Don Seiler says:

    There is no ethereal, only ZUUL. And, by “ZUUL”, I mean “wireshark”. Trademark issues led to a rebranding.

    http://www.wireshark.org/faq.html#q1.2

  5. prodlife says:

    Hi Paul,

    No, its not that trap. Actually, your document was a great help (especially the graphics), it was a very clear explanation of what those parameters mean. Clear enough that I could print it and give to my managers and the customer representatives.

    We actually ran into the reverse problem – the client recognized the internal IP, but not the VIP. Since there were 2 NATs, 2 firewalls and a VPN involved, it is very hard to figure out just where things need to be fixed.

  6. prodlife says:

    Hi Noons,

    Yes, TNSPing is very deceptive this way – it often works when nothing else does.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s