What is “Network Delay”?Posted: June 25, 2009
Data-Guard is often used for disaster planning, where the primary server is usual very far from the failover. Often on a different continent.
We are also planning to data-guard, but for a different disaster. We move about 90% of our operation to a new data center. About 3000 miles away. We have 4 hours of downtime. There is an old Russian saying – “two moves equal one fire”, so I guess we are planning for 50% disaster 🙂
One of the questions that pop-up a lot is the question of network delays. Its not a very well defined question. The person asking usually starts with “A change was done on production at 3pm. When will it be applied on the failover?” and the next question is: “How much time out of this gap is spent on network-related waits, rather than oracle-waits or disk-waits”?
There are 3 important factors that will influence the answer:
- Network Latency. The time it takes to send 1 packet of data from target to destination. It is the easiet to measure – just ping the destination, and it is influenced mostly by the physical distance between the two locations and the speed of light.
- Bandwidth utilization. The time it takes to send 1 packet is interesting, but we are more interested in the time it takes to send 500M of redo log. We have a nice OC3, theoretically capable of doing 155Mb/s. So theoretically 500M should around take around 15 seconds? Not really. First of all, network is a queue system, and we all know that we shouldn’t really run our queue system at 100% capacity, so we can’t use all 155 Mb/s. TCP has congestion control implemented, so it wouldn’t let you send all your data at once, it will make you wait for the other side to start acknowledging first and carefully control the amounts of data you are allowed to send at one time. Oh, and maybe other applications are also using the line. Given all that, it should be obvious that the percentage of the given bandwidth that you manage to actually use has a huge impact on the transfer rates. I’ve no idea how to check the line utilization myself – but my network manager can send me very nice graphs 🙂
- Congestion and errors. You have an SLA with your provider and all, but its a fact of life that not all the packets that leave source arrive safely to destination. Sometimes they get lost, or arrive in the wrong order. All these errors have an proportional impact on transfer time – 1% lost packets can cause 200% change in transfer times. Because once TCP has to retransmit lost packets it will start sending data very slowly, waiting to make sure it is received on the other side before sending more – and the utilization will drop like a rock.
If you are in the business of getting data across the ocean at a decent speed, you should also know that there are some companies that do WAN acceleration by addressing the causes for delay I mentioned about, and by introducing compression and proxies to the game. Its worth taking a look.
Most of all, don’t estimate delays by pinging remote machines. Talk to your network manager.