You Can’t DeDupe Oracle DB Files

One of our storage vendors has DeDupe technology. DeDupe is short for deduplication – the idea is that when you have identical blocks on the storage, it will only keep one copy of the block. This is a nice idea that saves on storage. Its especially good on shared filesystems where many users keep copies of identical files.

Our vendor loves DeDupe and managed to convince my storage manager that DeDupe will lead to amazing storage savings. Even on the DB volumes! No matter how much the DBAs protested that DBs rarely have full blocks that are identical to each other, the vendor kept insisting that many other customers have seen amazing storage savings using this technology on their data files. “Databases have many empty blocks”, the storage manager said after lunch with the vendor “And they are all identical! Think how much space you can save by keeping just one empty block!”.

We agreed to test DeDupe. As expected, we saw about  2% of space savings. Not exactly what the storage manager expected.

I wasn’t surprised. Even empty data blocks in Oracle DB files are not really identical. They have a header, which contains a relative address, which makes each empty block slightly different.

So, no DeDupe. Thought you may want to know, so you won’t have to repeat this experience. Maybe even send a link to your vendor 🙂

If your experience was different though, I’d love to know. The vendor insisted that he had many custormers happily deduping their databases.

Advertisements

Oracle on NFS and TCP Throttling

In an old post about the futility of best practices, I mentioned a strange best practice that is used in our organization: “mounting a single NFS volume to multiple mount points and configuring the DB to use them as though they were separate volumes (i.e. put different data files on different mount points).”

I was 100% sure that there was absolutely no reason for this practice. I thought it was there because someone misunderstood OFA, and no one changed it because they worried about imaginary performance implications. Reader comments on this blog farther convinced me of this.

This week I had lunch with a former colleague. He worked as a DBA in the same organization for at least 10 years before I joined, and he is a real expert about storage (A rare gift – DBA who knows storage). I had to ask him if this strange best practice was in effect when he was working here, and what did he think of it. As a response he burst out laughing. I thought it was because he also found this practice ridiculous, but it turned out (once he stopped laughing and was able to talk again) that he was the one who invented this best practice. He had excellent reasons for doing it. It is not his fault that the practice was kept long after the reasons were no longer relevant.

So, why would you want to mount the same volume on different mount points?

If you use NFS on top of TCP (I think no one does NFS on top of UDP anymore), and you have a heavy throughput system (like a data warehouse), you risk reaching the point that the ACKs from the Netapp are not arriving fast enough, and Linux will apply throttling on your connection.

The reason for this behavior lies in the TCP Congestion Control. The Congestion Control was introduced in eighties to prevent the internet from choking on noisy lines, and it is built around a dynamic value called TCP Congestion Window. TCP Congestion Window is the amount of data a server will agree to send without receiving any ACKs. If this amount of data was sent and no ACK arrived yet, the connection will patiently wait until the ACKs arrive. There are good reasons for this behavior:  First, we don’t want the  OS to risk run out of memory for keeping all those packets. But even more important is that it is good network behavior, maybe there is a bottleneck on the way, and the packets really never reach the target, if you continue to send them, the bottleneck will only get worse.

However, Linux defaults are really too conservative for database systems, which is why Oracle’s installation instructions include changes to these defaults:

net.core.rmem_default=262144
net.core.wmem_default=262144
net.core.rmem_max=1048576
net.core.wmem_max=1048576

These parameters control the send and receive buffer sizes. In this post, I’m talking about the send buffer (wmem) which is used to govern the TCP Congestion Window. The receive buffer (rmem) is related how much data the machine will accept when acting as a client and is out of scope here. Oracle’s recommendations for the buffer sizes are a nice start, but you can change these values to match the throughput your system needs and your hardware supports.

So, now days improving throughput by playing with window sizes is all fun and games. But according to the old and wise DBA, back in the days of Linux 2.2, you did not want to change them. You had to work around the issue in a different way.

By mounting your single NFS volume on different mount points you could trick Linux into creating a separate TCP buffer for each separate connection (mount point), and now you have as many times the size of the original window as you want. As long as the physical line is keeping up, you are fine.

Great solution. Cheap, simple, scalable, maintainable, stable, everything you want.

Except that this same practice is used 10 years later, on a low-throughput OLTP systems that are not even close to saturating Oracle’s default window sizes and when there is an easier way to adjust the window anyway. Because no one knew why this was done.

Aside from the cool network details (I love cool network details, if you enjoy them as well, you probably want to read Tanel Poder’s blog), I think this illustrates the story behind many best practices – Once upon the time, it was a great solution to a problem. Then the problem was forgotten, the technology changed but the solution stayed.


Things you learn while studying for OCP

So, these days I’m studying for my Oracle certification. I’m studying with a friend, and we use a book from Oracle Press to prepare. We are both experienced DBAs, so we make a game of finding mistakes in the book. Its a fun game, and keeps us alert while going over rather boring material.

Yesterday, I’ve read that Oracle doesn’t allocate space for a column, even if it is fixed size, until there is data in it.
While this is certainly space efficient, it seemed like a very difficult way to manage space – you have to keep moving columns around when people update them. So, we suspected that the book is making a mistake, and that Oracle allocates the space for fixed size column when the row is inserted, even if the column is empty.

Time for verification:
create table t1
(X integer, Y integer)

I created an empty table, with two fixed size columns, and checked the size:

SQL> select owner, segment_name, bytes,blocks, extents from DBA_SEGMENTS
where owner='CHEN';

OWNER      SEGMENT_NAME         BYTES         BLOCKS     EXTENTS
----------  ----------------    ----------   ---------   -----------
CHEN         T1                 65536         8             1

Nothing surprising. So lets insert some data, but only to the first column. Keep the second empty:

begin
for i in 1..30000 loop
    insert into t1 (X) values (i);
end loop;
end;

And check size again:

SQL> select owner, segment_name, bytes,blocks, extents from DBA_SEGMENTS
where owner='CHEN';

OWNER      SEGMENT_NAME         BYTES         BLOCKS     EXTENTS
----------  ----------------    ----------   ---------   -----------
CHEN         T1                 458752        56             7

Look, the table just got bigger.
Just for the heck of it, lets see what happens when I insert one row with both columns. Will this be enough to allocate trigger allocation in all rows?

insert into t1 values (1,1);
SQL> select owner, segment_name, bytes,blocks, extents from DBA_SEGMENTS
where owner='CHEN';

OWNER      SEGMENT_NAME         BYTES         BLOCKS     EXTENTS
----------  ----------------    ----------   ---------   -----------
CHEN         T1                 458752        56             7

Nope. Nothing much changed.
So lets update all rows with value in the second column and see what happens:

update t1 set Y=1;
SQL> select owner, segment_name, bytes,blocks, extents from DBA_SEGMENTS
where owner='CHEN';

OWNER      SEGMENT_NAME         BYTES         BLOCKS     EXTENTS
----------  ----------------    ----------   ---------   -----------
CHEN         T1                 851968        104            13

So, the table doubled. The book was correct – the second table was not allocated when I did the first insert, only after the update.
Since I assume that Oracle will now put one column of a row in one extent and the second in another extent, the update probably involved moving a bunch of rows around to the new extents. Something to keep in mind when trying to figure out why update is using to much IO.


The politics of running Oracle on NFS

Kevin Closson wrote an article making the economic case for running Oracle on NFS. Its a great article, go read it. It shows how much cheaper, more scalable and easier to configure and administer NFS can be compared to SAN solutions, even when taking possible performance degredations into account.

One thing that Kevin failed to mention is that choosing NFS requires a strong backbone and not every DBA would wish to test his or her mettle by attempting to introduce NFS into production enviroment.

You know how the database is always blamed for every application issue? How the DBAs always have to run around and prove that the DB is innocent while everyone else is pointing their fingers at the them? I’m sure everyone reading this blog is familar with “The database is guilty until proven innocent beyond all shadow of doubt” mentality.

Well, its the same with NFS only a bit worse. We’ve been running several large production RAC systems on NFS for the last two years, and for the past two years Netapp and the NFS protocol has been blamed for OS crashes, cluster errors, Oracle bugs, application errors, database misconfiguration, badly written backup scripts, errors that occurd within third party network devices, and the list can go on and on.

Our RAC systems had tons of issues – Netapp and the NFS protocol has been blamed for about 90% of them. I don’t believe we found even one case where it was clearly a storage issue. There used to be a saying “No one was ever fired for deciding to use IBM”. I can’t say the same for deciding to use Oracle on NFS.

The already sensitive situation is made even worse when we were unable to find any consultant – from Oracle, Netapp or 3rd party that was willing to review our storage configuration top to bottom and reassure us (and mostly our twitchy management) that we are doing things fine, everything is configured as it should be and if your Linux servers and Oracle Cluster crashes twice a day, the problem must be somewhere else,

Maybe Kevin should start NFS auditing services.

(I’ve actually planned to write about auditing today, but I got so worked up about storage that I totally forgot what is so interesting about audits anyway. Don’t worry, I’m testing fine grained auditing for a customer these days so some interesting bits about it are sure to come up soon)