Concurrent Joins – Coming Soon to a DW near you?

Posted: August 5, 2009 | Author: prodlife | Filed under: hardcore, musing, nerdism, optimizer | 11 Comments

Oracle’s Concepts book says the following about the characteristics of Data Warehouses:

“A typical data warehouse query scans thousands or millions of rows.For example, “Find the total sales for all customers last month.””

They also say:

“Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations.”

So we have multiple ad-hoc queries, each with its own plan, unaware of other similar queries that may be running at the same time, each reading millions of rows. Obviously they compete for resources. What if several concurrent queries are full-scanning the same table? What if this table is too large to fit entirely into memory?

It sounds obvious that in this case, all the separate execution threads will read their own blocks into memory, often “aging out” blocks that will be needed in just few seconds by another query. All competing for buffer-cache latches and blocks. The more queries we have concurrently on the system, the slower response times will be, due to competition on limited memory and IO resources.

Jonathan Lewis famously said: “The performance of a query should be related to the size of the data set you’re interested in, not to the size of the database”

I would like to add “… or to the number of concurrent users in the system”, but this is obviously untrue. Resource contention when the number of users rises has dramatic negative impact on performance. This is why we run load tests before putting new systems in production. We know that good performance with use in the lab does not guarantee good performance with 40 users in production.

But what if it doesn’t have to be this way? What if your database could, instead of optimizing each query seperately, could optimize the entire workload? So that with one or 40 or 256 users in the system we will still see very similar response times? What if the queries could share resources instead of competing for them?

All this is a rather lenghthy introduction to a cool idea I’ve ran into when scanning the program of the upcoming VLDB conference.

The paper I’ve read is called “A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses” and it is by George Candea, Neoklis Polyzotis and Radek Vingralek.

In the paper, the authors introduce CJOIN – a physical operator (i.e. an implementation of a relational operator) that can evaluate concurrent joins efficiently. It was written to allow sharing of CPU and IO resources and to fit modern DW systems – star schema, multiple cores, fast sequential scans and large memory.

The idea behind the CJOIN design is that there is a single physical plan that is “always on” and is optimized based on run-time statistics. Each new query can use this plan at any time and start sharing work with concurrent queries in the same plan.

Since all this sounds a bit vague, let me present the example that is given in the paper to demonstrate the idea. Then I’ll mention few of the more interesting points that are detailed in the paper, and hopefully after reading my descriptions you’ll decide to read the paper (which requires some effort on the part of the reader).

The design on CJOIN is based on a simple observation: Star queries all work by filtering a fact table through dimension tables and aggregating the results.

CJOIN works as a pipeline – receiving the input from a continuous scan of the fact table, passing the data through a set of filters, one for each dimension table, and distributing the results to aggregation operators that produce the output for *all the queries* using the operator.

Since the scan of the fact table is continuous, a query can start using the operator at any time, by remembering the point it registered and completing when the scan reaches this point again.

Suppose we have a fact table “Sales” with dimension tables “Customers” and “Products”.
Lets imagine the following two queries running concurrently:

Select sum(quantity) from sales, customers, products
where sales.customer_id=customer.customer_id and sales.product_id=products.product_id
and customers.city=’Japan’ and products.type=’High-Price’;

Select avg(dollar) from sales, customers, products
where sales.customer_id=customer.customer_id and sales.product_id=products.product_id
and customers.service_level=’Gold’ and products.type=’High-Price’;

As you can see, they share the same data source, but apply different filters and predicates.

Here’s how the CJOIN pipe will work:

The pipeline starts with a pre-processor, which receives rows from the continuous scan of the fact table and forwards them to the filtering part of the pipeline. Before doing so, the pre-processor adds few bits to each row – one bit for every query that is registered on the pipeline (i.e. queries that are interested in rows from this fact table). All the bits start out as “1”, signifying that at this stage all queries are interested in every row.

Now lets take a look at the filters:
We have a filter for each dimension table. The filter is a hash table that stores all the rows of that dimension that are of interest to any of our queries. Remember that while the fact table is too big to fit into memory, dimension tables are typically small enough to fit the memory of a nice DW server. Like the fact rows, the filter rows also have an additional byte per query.

So in our example, the “customers” filter will contain all the customers from Japan and customers with service_level “Gold”. The rows for customer from Japan will be have the first bit turned on and the second turned off, the row for Gold customers will have the reverse, because only the first query checks for customers from Japan and only the second checks for Gold customers. Products filter will contain the products of type “High Price” and both bits will be on, as both queries check for High Price products.

Note that when we start a new query, we need to add the appropriate rows from the dimension tables to the filters and remove them when the query is finished running. This is relatively quick because dimension tables are relatively small.

Now a row from the fact table arrives at the Customers filter. We will quickly check the customer_id on this row and see if it matches any row in the filter. If it exists, we know that at least one query wants this fact row. We can then check the query bits in the matching filter row to see which query needs it. If we see that only query 1 needs this fact row, then this row no longer interests query 2 and we can mark the second bit of the fact row as 0. If all query bits are marked as 0 , we can throw the fact row away. No one will need it.

In this way the row from the fact table passes through all the filters and arrives at the distributor. The distributor recieves fact rows that are relevant for at least one query in the current work load. It checks the bits to see which queries are interested in this row and sends it to the aggregators for these queries.

Once you got this example, you should be able to enjoy the paper. The paper actually contains this example, but with D1 instead of customers and d21 instead of Gold. I’m just a simple DBA and I understand better with a more concrete example.
You probably want to read the paper because it contains the algorithms for adding and removing queries from the pipe, so you’ll be convinced of how fast and clever this is.

The paper also contains a discussion of how to best parallelize this pipeline. Parallelization is very important in DW, and the paper offers several ideas and picks the best. It also has some ideas on how to handle concurrent updates to the tables, and ideas of how to adapt the CJOIN to other models except star schema.

Finally, the authors of the paper implemented their idea on top of PostgreSQL database, and they have extensive analysis of how the CJOIN indeed improve performance for concurrent workload (They seem to achieve almost linear growth in throughput as the number of concurrent queries grow!).

I hope you enjoyed this peek into the future of DBMS as much as I did and I hope to see CJOIN in Oracle soon 🙂

Chicken and Egg Troubleshooting

Posted: December 15, 2008 | Author: prodlife | Filed under: hardcore, network | 6 Comments

It all started about 10 days ago, with a phone call at 8am. 8am phone calls mean that something bad happened at 4am, but the system kept functioning, so it wasn’t worth waking anyone up about it.

One of our stand alone databases was swapping like there is no tomorrow (or more precisely, no free memory). Load average was up at around 20. On a 2 cpu machine. We could not find out any process that we could blame for the current situation, so we did the usual – restart the instance. This solved the immediate problem.

During the post mortem, one of the night DBAs reported that at some point during the night, this machine lost connectivity to the Netapp for about 10 minutes. He did not have an exact time when it happened, we couldn’t find any trace for this in the database logs, and the system people denied it completely – it did not happen, it could not happen. Except that we already seen that loss of connectivity to the Netapp will result in large number of processes in “uninterruptible sleep” mode, which will bring load average up… So this was still our best explanation. And it was not a very good one.

During the next few days, at least twice a day, load average and swap went way way up on that machine. Most of the CPU time was system time, and we also couldn’t find a single process that was taking too much memory. Sometimes the situation went away quickly, sometimes I had to kill a bunch of old connections to stabilize it. Naturally, I was unhappy about it. When I’m unhappy, I start looking at numbers. I compared the broken stand-alone to two other stand alone servers that were doing similar work. Good servers had an average of 200 concurrent connections. Each serving about 20 customers with 10 connections each. The broken server had 500 connections. It had 30 customers each with 15 connections. It also has only 12G of memory, compared to 16G on other servers.

Aha! I said to my boss. The machine is overloaded. We can try to move few customers to a less loaded server, but we should also consider moving to a better machine, or add another server.

My boss was skeptical – if this is just connectivity load, why did it spring on us all of the sudden? Shouldn’t we have seen a slow degradation? Well, each connection consumes some memory. As long as the machine had enough memory, there were no issue. Once it started swapping, things went wrong very fast.

My boss was not convinced, and I was not convinced either. There was the possibility that the high load was the cause of too many connections, not the result of it. When each task takes much longer to complete, you will have more users trying to complete their tasks at the same time. But my plan was the only thing that was remotely close to a solution, so we decided to try anyway.

Lucky for us, that night the machine lost connectivity to the Netapp again. This time the issue did not resolve itself, and the sysadmins had to replace the network cards on the server to solve the problem.

Needless to say, all my problems – load, swap, too many connections, went away that night and never came back. I’ve been staring at a bunch of eggs for days, trying to decide which one of them was the chicken.

How did the broken network card caused all these problems? Two ways, I think.

The first one is quite simple – every IO operation the DB had to do took much longer due to broken network connectivity. This means that everything takes longer, which means more connections. All of them waiting for IO most of the time.

Which leads straight into the second problem – since the network was now unreliable, and since TCP tries to cover up for the problems and keep communication reliable, each socket now had to keep huge buffer of information that has to be retransmitted, again and again. With more connections, each taking much more memory, we went into very high swap. Now our OS is so busy with managing the broken connections and swapping, that it has very little time for the work we actually want it to do, and load average is way up too. We have more and more connections, taking more and more memory and getting less and less IO and CPU to do their work.

Next time I see high swap and system taking lots of CPU, I’ll remember to test all my network connections.

Unusual IO activity on shared clusterware home

Posted: November 15, 2008 | Author: prodlife | Filed under: hardcore, Linux, nerdism, netapp, nfs | 6 Comments

Sometimes problem exist in a system for years, but only become apparent when you prepare for a big change. This war story begins when our storage admin decided to replace our Netapp disks with new disks, twice as large. It is a cheap way to increase disk space and IO wait times.

While assessing the impact of this change, he found out that the volumes where we put shared oracle home for our RAC clusters have 6000 IO operations per second (IOPS). The data and redo volumes never exceeded 2000 IOPS, so 6000 is quite significant, especially on disks that should be practically idle.

First debug showed that almost all the IO was neither read nor write, but things like “get attribute” and “access”. At this point I discovered that there is almost no way to get any information about IO activity on NFS mounts. I could not see which processes do this activity, nor on which files or directories it was done.

Time to get advice from the experts on Oracle-L. Vasu Balla of Pythian provided the solution:

“Oracle recommends using noac or
actime=o options when mounting nfs for Datafiles, Voting Disk and OCR. Noac
means “no attribute cache” means none of the file attributes are cached in
the filesystem cache, which is very much needed for RAC. If you put your
shared oracle home also in that mountpoint which is mounted noac, every
access to a file in the oracle home requires a physical IO at the netapp. So
I recommend moving all software directories ( db oracle home, asm oracle
home and crs oracle home etc ) to a nfs mount which is not mounted with noac
or actime=o.”

What a wonderful explanation. I now understand the issue and know what to do to solve it. I took me about 3 minutes to test this solution on our staging environment, and it worked like charm.

Unfortunately, both Netapp and Oracle insisted that shared oracle home on Netapp must be mounted with actimeo=0, and that if this is causing me trouble, I should move to local home instead of shared. Only after very long discussions with two experts from Oracle I got a non-official confirmation that the official documentation is probably wrong and that mounting oracle home with actimeo=0 is a bad idea.

To my surprise, my boss agreed to go ahead with the unofficial but working solution and change NFS mounts to remove “actimeo=0”.

So, we schedule downtime on our production RACs, and we change the mount options, and… Nothing happens. At all. 6000 IOPS before and after the change. If I wasn’t so shocked, I might have noticed my professional credibility taking a hit there.

Why didn’t it work on production? For weeks I had no answer. Until our network admin mentioned that I could use rpcdebug to get more insight about the issue. Turns out that NFS is RPC, and that Linux has flags for debugging RPC. By throwing magic numbers into /proc/sys/sunrpc/nfs_debug I could get NFS trace messages throwin into /var/log/messages. Now we are getting somewhere.

Except that it didn’t get me very far. I could see which devices NFS access, but I already knew that. I could see that our prod server had many many calls to “getattr”, while our staging system didn’t. To complete my tests I decided to turn off the attribute caching on staging again and compare the logs. Just to see what it looks like when both systems are in the same state.

Strange difference caught my eye: The staging systems had messages saying “NFS: Refresh_inode” which did not exist in production. Tiny difference, but maybe it has an impact? What does refresh inode mean? Time to go to lxr.linux.no and look at the Linux kernel code for clues. I just need to recall which version to look at.

When the lightbulb went off it nearly blinded me. Staging system has Linux 2.4.27, production is running 2.6.9. I was the one who pushed for the upgrade. I said “There are many NFS improvements in the new kernel versions.”

From here it was easy to find the change. In 2.4 the code for getting file attributes from the server looked like this:

 static inline int
 nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
 {
         if (time_before(jiffies, NFS_READTIME(inode)+NFS_ATTRTIMEO(inode)))
                return NFS_STALE(inode) ? -ESTALE : 0;
         return __nfs_revalidate_inode(server, inode);
 }

Which basically means – get the new attributes if the cache has timed out.

In 2.6 the code changed and the following check was added:

/* We may force a getattr if the user cares about atime */
       if (need_atime)
                err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
        else
                err = nfs_revalidate_inode(NFS_SERVER(inode), inode);

Which means that if the user needs to know the last time the attribute changed, we skip the cache time check and force a get attribute from the server. Another IO operations. Even if the cache did not time out.

Luckily, the fix is also easy. Just add “noatime” to the nfs mount, to let the kernel know that we don’t care about the last time attributes changed, and therefore it can go back and use the cache.

So easy once you know what to look for!

Newsflash – 100% CPU is worse than you think!

Posted: May 6, 2008 | Author: prodlife | Filed under: hardcore, performance | 9 Comments

Found a recent thread on OTN that discusses my recent obsession: http://forums.oracle.com/forums/thread.jspa?threadID=650986&start=0&tstart=0

Jonathan Lewis points out a major issue with running your DB at 100% cpu:

“Consider the simple case of 8 queries running on 8 CPUs. They will be competing for the same cache buffers chains latches – which means that seven processes could be spinning on the same latch while the eighth is holding it. None of the processes ever need wait, but most of them could be wasting CPU most of the time.”

Amazing. During my entire discussion of CPU load and process priorities I completely ignored the fact that I’m using 2 dual core cpus on that system, and that all Oracle processes use shared memory, which means shared resource, which means locks, which means resource wasting by waiting for locks.

And this complicated the discussion, because 6 processes on 8 CPUs will also waste time waiting for locks. You don’t need 100% CPU to suffer from this. The thread mentions that the book “Forecasting Oracle Performance” discusses this issue and mentions 75% cpu as the tipping point, but I’d assume that the number would be different for systems with different numbers of CPUs. I definitely need to read that book.

I also was not aware that processes stay on CPU while waiting for latch. I’d assume the CPU will replace it with runable process? Of course the switch will also cost resources, so you lose either way.

I can’t believe I ignored this until now (and that not one of my readers mentioned this!). The thread is well worth reading.

Oracle on NFS and TCP Throttling

Posted: February 16, 2008 | Author: prodlife | Filed under: hardcore, musing, nerdism, network, nfs, performance, Storage | 7 Comments

In an old post about the futility of best practices, I mentioned a strange best practice that is used in our organization: “mounting a single NFS volume to multiple mount points and configuring the DB to use them as though they were separate volumes (i.e. put different data files on different mount points).”

I was 100% sure that there was absolutely no reason for this practice. I thought it was there because someone misunderstood OFA, and no one changed it because they worried about imaginary performance implications. Reader comments on this blog farther convinced me of this.

This week I had lunch with a former colleague. He worked as a DBA in the same organization for at least 10 years before I joined, and he is a real expert about storage (A rare gift – DBA who knows storage). I had to ask him if this strange best practice was in effect when he was working here, and what did he think of it. As a response he burst out laughing. I thought it was because he also found this practice ridiculous, but it turned out (once he stopped laughing and was able to talk again) that he was the one who invented this best practice. He had excellent reasons for doing it. It is not his fault that the practice was kept long after the reasons were no longer relevant.

So, why would you want to mount the same volume on different mount points?

If you use NFS on top of TCP (I think no one does NFS on top of UDP anymore), and you have a heavy throughput system (like a data warehouse), you risk reaching the point that the ACKs from the Netapp are not arriving fast enough, and Linux will apply throttling on your connection.

The reason for this behavior lies in the TCP Congestion Control. The Congestion Control was introduced in eighties to prevent the internet from choking on noisy lines, and it is built around a dynamic value called TCP Congestion Window. TCP Congestion Window is the amount of data a server will agree to send without receiving any ACKs. If this amount of data was sent and no ACK arrived yet, the connection will patiently wait until the ACKs arrive. There are good reasons for this behavior: First, we don’t want the OS to risk run out of memory for keeping all those packets. But even more important is that it is good network behavior, maybe there is a bottleneck on the way, and the packets really never reach the target, if you continue to send them, the bottleneck will only get worse.

However, Linux defaults are really too conservative for database systems, which is why Oracle’s installation instructions include changes to these defaults:

net.core.rmem_default=262144
net.core.wmem_default=262144
net.core.rmem_max=1048576
net.core.wmem_max=1048576

These parameters control the send and receive buffer sizes. In this post, I’m talking about the send buffer (wmem) which is used to govern the TCP Congestion Window. The receive buffer (rmem) is related how much data the machine will accept when acting as a client and is out of scope here. Oracle’s recommendations for the buffer sizes are a nice start, but you can change these values to match the throughput your system needs and your hardware supports.

So, now days improving throughput by playing with window sizes is all fun and games. But according to the old and wise DBA, back in the days of Linux 2.2, you did not want to change them. You had to work around the issue in a different way.

By mounting your single NFS volume on different mount points you could trick Linux into creating a separate TCP buffer for each separate connection (mount point), and now you have as many times the size of the original window as you want. As long as the physical line is keeping up, you are fine.

Great solution. Cheap, simple, scalable, maintainable, stable, everything you want.

Except that this same practice is used 10 years later, on a low-throughput OLTP systems that are not even close to saturating Oracle’s default window sizes and when there is an easier way to adjust the window anyway. Because no one knew why this was done.

Aside from the cool network details (I love cool network details, if you enjoy them as well, you probably want to read Tanel Poder’s blog), I think this illustrates the story behind many best practices – Once upon the time, it was a great solution to a problem. Then the problem was forgotten, the technology changed but the solution stayed.

Nice Processes Finish Last

Posted: January 3, 2008 | Author: prodlife | Filed under: hardcore, Linux | 3 Comments

I’ve ran into lots of myths and misconceptions about the Linux “nice” command. Lots of people seem to view it as a magic wand that can be waved at misbehaving processes – they are not sure what it really does but maybe it will make the process nicer?

Indeed there is so much confusion around it, that it is incredibly difficult to find good information about it. Especially since the behavior of nice changes between kernel releases and information that was great two years ago, may not be true today.

Fortunately, Linux is open source and one can always read the source code of the relevant kernel version to know exactly what Nice is doing today. I mean it – the relevant code is all in one file (kernel/sched.c) and is very readable.

Since I’ve already did some reading on the 2.4.31 kernel, I’ll summarize here the behavior of nice in that version and then shoot down specific myths. To explain what Nice does I’ll have to start by a short overview of how Linux Scheduler works. Should be entirely painless and it makes things much clearer later on.

The important part of the scheduler is the schedule() function. This runs every once in a while (I’m not getting into that, except to say that since the scheduler takes CPU, we definitely don’t want it to run too often!), stops the process that is taking CPU now and gives the CPU to a new process. The main question is – which process is going to go next. That’s where the scheduling algorithm comes in.

Scheduling algorithm in Linux works in “epochs” – when the epoch starts, each process in the run queue is allocated certain number of CPU ticks that it can use this epoch. When a process has used all its ticks it needs to give the CPU to another process. If it waits for IO, it doesn’t use any ticks and another process can run at that time, but once it finishes waiting it can use the remaining ticks during the same epoch. When all the processes use all their ticks, the epoch is over and everyone gets more ticks.

So, which process runs next? The algorithm is implemented in goodness() function, which returns a number for each process. The higher the goodness of the process, the more likely it is to run next. Goodness is highest for special real time processes, which always run first. Higher for the process currently using the CPU (because it has things on cache and you don’t want to mess up the cache), high for processes with many ticks left as opposed to those with only few ticks, and 0 for processes that are out of ticks.

Now we get to Nice. Nice impacts the number of ticks a process gets from the CPU. We assign ticks in batches that amount to 50ms, and therefore contain different amounts of ticks for different CPUs, but the basic formula in 2.4 is this:

#define NICE_TO_TICKS(nice)     (TICK_SCALE(20-(nice))+1

TICK_SCALE is where we match the number of ticks to the CPU speed, but as you can see there is an inverse relation between nice value and the number of ticks, where nice 20 gets you the least ticks and -20 gets you the most, and it is a linear relation – when you move from nice 10 to nice 11 it is the same as moving from -5 to -4. This is no longer true in 2.6 where the slope is different for positive and negative values.

To summarize – the nicer you are, the less CPU you get and the less often you will be scheduled. Nice processes will finish last.

What Nice does not do:

Many people (including Linux kernel experts!) believe that “nice 20” means your process will only run if no one else wants the CPU. Not exactly. Every process has to use up all their tick counts before the scheduler starts a new epoch (unless the process is sleeping), so while your nice process will probably run last at each epoch, other processes will still wait for it to finish before the new epoch starts and they get a new tick quota.
Nice will not have much effect on a low utilization system. Your CPU is at 0% now. You start a heavy process (maybe export) with nice 20, what will happen? It will take its usual 70% CPU. Because as long as there are no queues this nice process will be rescheduled again and again. What’s the rest of the CPU doing? running the scheduler, which is now working much harder since it has to reschedule this process again and again with a tiny quota each time.
Top shows 100% cpu utilization, and you see that 80% of it is taken by a single process. Should you nice it? Not always. Top will show IOWait time as part of the CPU time, but processes don’t use ticks while they wait for IO, so if the process is IO bound, it will show a very similar behavior regardless of nice. The behavior is to use very little CPU, wait long time for IO, and show on Top as a very busy process.
Export takes tons of CPU and I don’t want other processes to wait for it, I don’t mind if export takes a bit longer. Should I nice? Probably not. For one thing, export is IO bound, which makes it insensitive to nice anyway. What is more important – export doesn’t do most of the work. What is doing the work is Oracle’s server process that is opened to work with export, and you actually have to “renice” the oracle child process if you hope for anything at all to happen. What you can nice with pretty good effect it the “zip” process that you run after the export.
And if we mentioned the child processes, keep in mind that when a process forks the child process gets the same “nice” value as the parent, but from that point on you can renice each of them independently.

Just a simple Hadoop DBA

Adventures with Data and Massively Parallel Databases