Testing. Again. And Again.

Suppose your boss asks you how long will it take to run a certain query, or update, or maybe an export.

You run the query (or update, or export) and it takes 5 minutes to complete. You can tell you boss “The query takes 5 minutes to complete”. But of course, you can’t be sure it will take 5 minutes every time. What if your boss runs it and it takes 6 minutes and 10 seconds? We all know that it is possible. For example, I may hack into your server and lock the table for one minute and 10 seconds just to make sure 🙂

A more experienced DBA (or developer) may run the query 10 times, average the results and tell the boss “I ran the query 10 times, and it took on average 5 minutes to complete”.

But this won’t tell the whole story. Maybe you should say “I ran the test 10 times. The first run took 8 minutes, because most of the data wasn’t in the cache. The rest of the runs took 4 minutes and 40 seconds on average.”

Or maybe even “I ran the test 10 times. It took between 3 to 8 minutes”. Maybe you even want to send him a small histogram of your results.

If you studied statistics, you can do even better:

First, you want to agree with your boss on an acceptable error. For example, if the average response times in your test end up to be 5 minutes, you can agree that any response time between 4 and 6 minutes is acceptable. This gives you 20% margine of error.

You also want to agree on a confidence level. Confidence level says how sure you are that your results are meaningful and not the product of random luck. You know, you can throw a perfectly balanced coin 10 times and get heads every single time. Obviously, the more you run the test, the more confident you can be that your results are valid.

Then, using your error margin and confidence intervals, you can decide how many times you want to repeat the test. Actually, you’ll also need the to know the variance between your tests. If you run the test 5 times and get wildly varying response times, you’ll want to run it many many times before you are sure of your averages. If you run it 5 times and get the same result – you are either doing something very wrong, or you can just send a very reliable answer to your boss.

So, run the query few times, estimate the standard deviation (a process known as bootstrapping) and then use R to give us our sample size:

n <- (qnorm(confidence level)*stddev)/(error margin))^2
(for example: (qnorm(0.99)*57.2)/(60))^2)

(qnorm is how R translates our confidence levels onto the normal curve).

If you think statisticians are anal, its only because you never worked with a real performance professional. The real professional is going to demand all of the above, and also grill you about the following:

  • Are you performing the tests sequentially or in parallel? How many sessions? Why?
  • Are there other load in the system while you are running the test? If there is, is this your typical load?
  • What if someone else modifies the same data in parallel?
  • Do you need to know how the query will perform during peak hours? If so, we need to test durin peak hours.
  • Planning to clean the cache between runs? why or why not?
  • Does your system need a ramp-up period before performance stabilizes?
  • Do you have other differences between the test system and production that can impact the results?
  • Are you sure you only want to look into the averages? After all, the outliers, 90th percentile and variance values can also indicate problems and future user complaints.
  • What about a nice histogram and 3 graphs to justify my consulting fees?

I’m not saying you should do all that whenever you need a quick estimate. Its just that sometimes you want to be a little more accurate – and you should know how to do it.

Concurrent Joins – Coming Soon to a DW near you?

Oracle’s Concepts book says the following about the characteristics of Data Warehouses:

“A typical data warehouse query scans thousands or millions of rows.For example, “Find the total sales for all customers last month.””

They also say:

“Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations.”

So we have multiple ad-hoc queries, each with its own plan, unaware of other similar queries that may be running at the same time, each reading millions of rows. Obviously they compete for resources. What if several concurrent queries are full-scanning the same table? What if this table is too large to fit entirely into memory?

It sounds obvious that in this case, all the separate execution threads will read their own blocks into memory, often “aging out” blocks that will be needed in just few seconds by another query. All competing for buffer-cache latches and blocks. The more queries we have concurrently on the system, the slower response times will be, due to competition on limited memory and IO resources.

Jonathan Lewis famously said: “The performance of a query should be related to the size of the data set you’re interested in, not to the size of the database”

I would like to add “… or to the number of concurrent users in the system”, but this is obviously untrue. Resource contention when the number of users rises has dramatic negative impact on performance. This is why we run load tests before putting new systems in production. We know that good performance with use in the lab does not guarantee good performance with 40 users in production.

But what if it doesn’t have to be this way? What if your database could, instead of optimizing each query seperately, could optimize the entire workload? So that with one or 40 or 256 users in the system we will still see very similar response times? What if the queries could share resources instead of competing for them?

All this is a rather lenghthy introduction to a cool idea I’ve ran into when scanning the program of the upcoming VLDB conference.

The paper I’ve read is called “A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses” and it is by George Candea, Neoklis Polyzotis and Radek Vingralek.

In the paper, the authors introduce CJOIN – a physical operator (i.e. an implementation of a relational operator) that can evaluate concurrent joins efficiently. It was written to allow sharing of CPU and IO resources and to fit modern DW systems – star schema, multiple cores, fast sequential scans and large memory.

The idea behind the CJOIN design is that there is a single physical plan that is “always on” and is optimized based on run-time statistics. Each new query can use this plan at any time and start sharing work with concurrent queries in the same plan.

Since all this sounds a bit vague, let me present the example that is given in the paper to demonstrate the idea. Then I’ll mention few of the more interesting points that are detailed in the paper, and hopefully after reading my descriptions you’ll decide to read the paper (which requires some effort on the part of the reader).

The design on CJOIN is based on a simple observation: Star queries all work by filtering a fact table through dimension tables and aggregating the results.

CJOIN works as a pipeline – receiving the input from a continuous scan of the fact table, passing the data through a set of filters, one for each dimension table, and distributing the results to aggregation operators that produce the output for *all the queries* using the operator.

Since the scan of the fact table is continuous, a query can start using the operator at any time, by remembering the point it registered and completing when the scan reaches this point again.

Suppose we have a fact table “Sales” with dimension tables “Customers” and “Products”.
Lets imagine the following two queries running concurrently:

Select sum(quantity) from sales, customers, products
where sales.customer_id=customer.customer_id and sales.product_id=products.product_id
and customers.city=’Japan’ and products.type=’High-Price’;

Select avg(dollar) from sales, customers, products
where sales.customer_id=customer.customer_id and sales.product_id=products.product_id
and customers.service_level=’Gold’ and products.type=’High-Price’;

As you can see, they share the same data source, but apply different filters and predicates.

Here’s how the CJOIN pipe will work:

The pipeline starts with a pre-processor, which receives rows from the continuous scan of the fact table and forwards them to the filtering part of the pipeline. Before doing so, the pre-processor adds few bits to each row – one bit for every query that is registered on the pipeline (i.e. queries that are interested in rows from this fact table). All the bits start out as “1”, signifying that at this stage all queries are interested in every row.

Now lets take a look at the filters:
We have a filter for each dimension table. The filter is a hash table that stores all the rows of that dimension that are of interest to any of our queries. Remember that while the fact table is too big to fit into memory, dimension tables are typically small enough to fit the memory of a nice DW server. Like the fact rows, the filter rows also have an additional byte per query.

So in our example, the “customers” filter will contain all the customers from Japan and customers with service_level “Gold”. The rows for customer from Japan will be have the first bit turned on and the second turned off, the row for Gold customers will have the reverse, because only the first query checks for customers from Japan and only the second checks for Gold customers. Products filter will contain the products of type “High Price” and both bits will be on, as both queries check for High Price products.

Note that when we start a new query, we need to add the appropriate rows from the dimension tables to the filters and remove them when the query is finished running. This is relatively quick because dimension tables are relatively small.

Now a row from the fact table arrives at the Customers filter. We will quickly check the customer_id on this row and see if it matches any row in the filter. If it exists, we know that at least one query wants this fact row. We can then check the query bits in the matching filter row to see which query needs it. If we see that only query 1 needs this fact row, then this row no longer interests query 2 and we can mark the second bit of the fact row as 0. If all query bits are marked as 0 , we can throw the fact row away. No one will need it.

In this way the row from the fact table passes through all the filters and arrives at the distributor. The distributor recieves fact rows that are relevant for at least one query in the current work load. It checks the bits to see which queries are interested in this row and sends it to the aggregators for these queries.

Once you got this example, you should be able to enjoy the paper. The paper actually contains this example, but with D1 instead of customers and d21 instead of Gold. I’m just a simple DBA and I understand better with a more concrete example.
You probably want to read the paper because it contains the algorithms for adding and removing queries from the pipe, so you’ll be convinced of how fast and clever this is.

The paper also contains a discussion of how to best parallelize this pipeline. Parallelization is very important in DW, and the paper offers several ideas and picks the best. It also has some ideas on how to handle concurrent updates to the tables, and ideas of how to adapt the CJOIN to other models except star schema.

Finally, the authors of the paper implemented their idea on top of PostgreSQL database, and they have extensive analysis of how the CJOIN indeed improve performance for concurrent workload (They seem to achieve almost linear growth in throughput as the number of concurrent queries grow!).

I hope you enjoyed this peek into the future of DBMS as much as I did and I hope to see CJOIN in Oracle soon 🙂

Latches, Spinning and Queues

You know that you care a lot about a topic, if you find yourself thinking about it again and again, each time hoping to gain enough insights to stop this cycle.

Turns out, I care a lot about what my CPUs are doing. Last time I came up with the epiphany that 100% CPU utilization is a bad idea. During the discussion that followed, Jonathan Lewis and Noons took the time to explain to me the difference between waiting and spinning.

The topic came up again as I’m digging for the worse ways concurrency can go wrong.

Concurrency becomes interesting when the concurrent processes attempt to access shared resources, and since Oracle has shared memory, the shared resources tend to be areas in the memory.

We are in 11g now, so we have 3 Oracle ways to protect memory structures – Locks, latches and mutexes (Oracle mutexes, which should not be confused with OS mutexes.). Below, I’ll summarize the main differences between them. Nearly everything I wrote (and a lot more including cool examples) is covered by Tom Kyte’s Expert Oracle Database Architecture book. I’m just summarizing the importnat points below for my (and your) convinience.

When you read about latches, the first thing you hear is that “Latches are lightweight locks”. Lightweight in this case means “Takes less bits in memory”. Why do we care about our locking structures being small? Small memory footprint of the locking mechanism will translate to faster checks and changes to it. Latches are smaller than locks, and the new 10g mutexes are even smaller than latches.

Locks work by queueing. One process holds the lock for a resource, everyone else who tries to access the lock queues up and goes to sleep (i.e. off the CPU). When the current process finishes, the next in line becomes runnable and now owns the lock on the resource. Queuing is nice, because we have a bunch of queueing theory that lets us predict response times, and waits and such. It is also nice, because while Oracle manages the locks and queues it gives us tons of information about who is blocking and who is waiting. And as one last nice, while all those processes are waiting for locks, they are not using CPU, nor represent any cpu scheduling overhead.

Latches work by spinning (mostly). Think of a case when you know a process will need the memory structure for a very short amount of time. Do you really want to maintain queues, waste time on context switching, lose your CPU cache all for just few milliseconds of waiting? Latches exist for this reason. If a process tries to access the latch and its busy, it will keep on retrying for a while, still using the cpu. If during this “retry time” the latch became free, the process can take the latch and we are saved from the need to context switch. If it didn’t get the latch after several retries, the process goes off the CPU.

The important thing to note it that there is no queue. So there is a lot of uncertainty around when your process will finally get its latch. It is entirely possible that processes that started spinning on the latch later will get the latch first due to a fluke of luck. Because there is no queue, it seems that there is no good way to find a list of processes that are waiting for a latch (maybe by looking at statistics and wait tables?). You do have good information about how many requests, misses, spins and sleeps per latch, which is very useful information.

It is interesting to see how Oracle attempts to prevent the situation where a process waits forever for a latch, and keeps missing it because newer processes keep snatching the latch away as soon as it is freed. When reading about “latch free” wait events, the documentation says: ” The wait time increases exponentially and does not include spinning on the latch (active waiting). The maximum wait time also depends on the number of latches that the process is holding. There is an incremental wait of up to 2 seconds.” It is nearly the same mechanism ethernet uses to avoid machines starving for network connections (“truncated binary exponential backoff“) . Incrementally increasing the wait times reduces the probability of collision.

Mutexes are best covered by Tanel Poder.  They are even smaller and faster to take than latches, they also work as a cursor pin (signifying shared or exclusive ownership of a cursor), and they give you even less information about who is waiting for what and for how long. You have information about sleep times, but not number of requests and misses.

Margin of Error

Few weeks ago, I was at a friendly dinner party, discussing the upcoming elections, and specifically the results of recent voter surveys. One of the participants in the discussion said “I never pay attention to the error margins, since they apply to both candidates”.

I think he meant that if a specific poll said that 52% of the sampled voters preferred Obama and 46% preferred McCain, and the poll has a margin of error of 3%, then perhaps the “real” numbers are 55% for Obama and 49% for McCain, or maybe 49% Obama and 43% McCain, but it doesn’t really matter since the difference between them is constant.

This is of course, very false. For three important reasons:

  1. The margin of error is 3%, which means that the result of 49% for Obama and 49% for McCain cannot be ruled out. It is possible that Obama has no lead at all. It is important to understand that the 0% difference between the candidates is just at likely as the 6% difference the poll result actually show. There is no statistical way to differentiate detween these scenarios and both are just as real.
  2. 3% margin of error actually means that there are 95% chance that the “real” result is within 3% of the reported result. (Where “real” means what theresults would be if the entire adult population had been polled with complete accuracy). Remember that around election times, many polls are published. 5% of them have a bigger error than they report. How big? We have no idea.
  3. The reported margin of error is correct assuming that the sampling was perfect. Which means that no one refused to answer questions, no one lied, the questions were not worded or ordered in a way that caused bias, the selection of the sample was not biased, etc, etc. All these factors are likely to cause errors much larger than the theoretic sampling error, and what’s worse – we have no idea how big they can be and in which direction.

If you are really interested in the subject and not afraid of some mathematical notation, Terence Tao has a much deeper analysis of the subject.

Unusual IO activity on shared clusterware home

Sometimes problem exist in a system for years, but only become apparent when you prepare for a big change. This war story begins when our storage admin decided to replace our Netapp disks with new disks, twice as large. It is a cheap way to increase disk space and IO wait times.

While assessing the impact of this change, he found out that the volumes where we put shared oracle home for our RAC clusters have 6000 IO operations per second (IOPS). The data and redo volumes never exceeded 2000 IOPS, so 6000 is quite significant, especially on disks that should be practically idle.

First debug showed that almost all the IO was neither read nor write, but things like “get attribute” and “access”. At this point I discovered that there is almost no way to get any information about IO activity on NFS mounts. I could not see which processes do this activity, nor on which files or directories it was done.

Time to get advice from the experts on Oracle-L. Vasu Balla of Pythian provided the solution:

“Oracle recommends using noac or
actime=o options when mounting nfs for Datafiles, Voting Disk and OCR. Noac
means “no attribute cache” means none of the file attributes are cached in
the filesystem cache, which  is very much needed for RAC. If you put your
shared oracle home also in that mountpoint which is mounted noac, every
access to a file in the oracle home requires a physical IO at the netapp. So
I recommend moving all software directories ( db oracle home, asm oracle
home and crs oracle home etc ) to a nfs mount which is not mounted with noac
or actime=o.”

What a wonderful explanation. I now understand the issue and know what to do to solve it. I took me about 3 minutes to test this solution on our staging environment, and it worked like charm.

Unfortunately, both Netapp and Oracle insisted that shared oracle home on Netapp must be mounted with actimeo=0, and that if this is causing me trouble, I should move to local home instead of shared. Only after very long discussions with two experts from Oracle I got a non-official confirmation that the official documentation is probably wrong and that mounting oracle home with actimeo=0 is a bad idea.

To my surprise, my boss agreed to go ahead with the unofficial but working solution and change NFS mounts to remove “actimeo=0”.

So, we schedule downtime on our production RACs, and we change the mount options, and… Nothing happens. At all. 6000 IOPS before and after the change. If I wasn’t so shocked, I might have noticed my professional credibility taking a hit there.

Why didn’t it work on production? For weeks I had no answer. Until our network admin mentioned that I could use rpcdebug to get more insight about the issue. Turns out that NFS is RPC, and that Linux has flags for debugging RPC. By throwing magic numbers into /proc/sys/sunrpc/nfs_debug I could get NFS trace messages throwin into /var/log/messages. Now we are getting somewhere.

Except that it didn’t get me very far. I could see which devices NFS access, but I already knew that. I could see that our prod server had many many calls to “getattr”, while our staging system didn’t. To complete my tests I decided to turn off the attribute caching on staging again and compare the logs. Just to see what it looks like when both systems are in the same state.

Strange difference caught my eye: The staging systems had messages saying “NFS: Refresh_inode” which did not exist in production. Tiny difference, but maybe it has an impact? What does refresh inode mean? Time to go to lxr.linux.no and look at the Linux kernel code for clues. I just need to recall which version to look at.

When the lightbulb went off it nearly blinded me. Staging system has Linux 2.4.27, production is running 2.6.9. I was the one who pushed for the upgrade. I said “There are many NFS improvements in the new kernel versions.”

From here it was easy to find the change. In 2.4 the code for getting file attributes from the server looked like this:

 static inline int
 nfs_revalidate_inode(struct nfs_server *server, struct inode *inode)
         if (time_before(jiffies, NFS_READTIME(inode)+NFS_ATTRTIMEO(inode)))
                return NFS_STALE(inode) ? -ESTALE : 0;
         return __nfs_revalidate_inode(server, inode);

Which basically means – get the new attributes if the cache has timed out.

In 2.6 the code changed and the following check was added:

/* We may force a getattr if the user cares about atime */
       if (need_atime)
                err = __nfs_revalidate_inode(NFS_SERVER(inode), inode);
                err = nfs_revalidate_inode(NFS_SERVER(inode), inode);

Which means that if the user needs to know the last time the attribute changed, we skip the cache time check and force a get attribute from the server. Another IO operations. Even if the cache did not time out.

Luckily, the fix is also easy. Just add “noatime” to the nfs mount, to let the kernel know that we don’t care about the last time attributes changed, and therefore it can go back and use the cache.

So easy once you know what to look for!

How Many Parameters can Rank() Take?

Laurent Schneider and I were discussing the analytics chapter of his nearly finished book, when he casualy mentioned that Rank() can take unlimited number of parameters. “Wow!” I said, “Thats cool. But how many parameters can it take before something crashes?”

I checked on my test 11g system, running on a puny windows xp.

Starting with 80 parameters, everything works fine:

SQL> select
2    rank(
3  1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8
4    ) within group (order by
5  1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8
6  ) x
7  from dual
8  /


160 parameters also worked fine, 240 worked fine, returns results within two seconds.

So I tried with 320.  It has been running for almost 15 minutes when I decided to check with 310 and 300. At that point Oracle took 100% cpu and my mouse could barely move. I had to crash Oracle. You could say that three processes each with slightly over 300 parameters crashes Oracle 🙂

I continued donating my CPU to scientific discovery and reached interesting results:

At 255 parameters Rank still works fine. At 256, Oracle seems to go into an infinite loop – it never returns an answer and CPU climbs to 100%. Suspicious numbers, I’d say.

Just Return Any Random Row

Sometime I see a developer try to run something like “select deptno,sal from emp group by deptno”. I usually ask the developer “But employees in the department have many different salaries. Which one do you want?”, and sometimes I get an amazing answer: “I don’t really care, I just want to see any random salary”. Usually, I tell her to aggregate by “max” or something similar, if she doesn’t care about the result.

But today I was really annoyed. So I wrote a custom aggregation function that will return a random salary.  I admit, it is not as useful as str_agg, but there seem to be few developers who are interested in this feature.

The main challenge was to make the aggregation truly random. When you aggregate, you always have the current aggregation value and a new one. I have to randomly choose one of them – but I can’t make it a simple 50/50 selection.

Suppose I have three rows. The way aggregation works, I first take two rows and flip a coin to pick one. Now I have a current value – and I have to take the third row and decide if I want to keep the current value or the new one. I can’t flip the coin again – because if the third row has 50% chance to be selected, this means the first and second rows only have 25% chance each. Not fair. So I need to give the third row 1/3 chance, and the current value 2/3.

create or replace type agg_t as object (

    curr_value number,
    running_count number,

    static function ODCIAggregateInitialize(sctx  in out agg_t)
                    return number,

    member function ODCIAggregateIterate   (self  in out agg_t,
                                            new_value in number)
                    return number,

    member function ODCIAggregateTerminate (self         in     agg_t   ,
                                            return_value    out number,
                                            flags        in number      )
                    return number,

    member function ODCIAggregateMerge(self in out agg_t,
                                       ctx2 in agg_t    )
                    return number

create or replace type body agg_t is 

    static function ODCIAggregateInitialize(sctx in out agg_t)
        return number is
        sctx := agg_t(null,0);
        return ODCIConst.Success;

    member function ODCIAggregateIterate(
      self in out agg_t, new_value in number)
        return number is
        if (mod(dbms_random.random,running_count)=0) then
        end if;
        return ODCIConst.Success;

    member function ODCIAggregateTerminate(self in agg_t,
        return_value out number, flags in number) return number is
        return_value := curr_value;
        return ODCIConst.Success;

    member function ODCIAggregateMerge(self in out agg_t,
        ctx2 in agg_t) return number is

    	if (mod(dbms_random.random,running_count+ctx2.running_count)<ctx2.running_count) then
    	end if;
        return ODCIConst.Success;

create or replace function agg_random (input number) return number
    parallel_enable aggregate using agg_t;

So easy! I love user defined aggregations! But you have to be careful when writing them. I accidentally replaced "number" with "varchar" somewhere in the code and got a lovely error message when I tried to run it:
<pre>SQL&gt; select deptno,agg_random(sal) from emp group by deptno;
select deptno,agg_random(sal) from emp group by deptno
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 2104
Session ID: 138 Serial number: 1496</pre>
And in the alert log: <em>Exception [type: ACCESS_VIOLATION, UNABLE_TO_WRITE] [ADDR:0x0] [PC:0x3D31168, ___intel_new_memcpy()+40]</em>
I segfaulted! and you can imagine how fun it was to debug my pl/sql code using the trace file...

Anyway, after all the debugging is done, its time to show the code to my developer:

SQL> select deptno,agg_random(sal) from emp group by deptno;

---------- ---------------
        10            2450
        20            2975
        30             950

SQL> select deptno,agg_random(sal) from emp group by deptno;

---------- ---------------
        10            1300
        20            2975
        30            1600

SQL> select deptno,agg_random(sal) from emp group by deptno;

---------- ---------------
        10            2450
        20            2975
        30             950

SQL> select deptno,agg_random(sal) from emp group by deptno;

---------- ---------------
        10            2450
        20            3000
        30            2850

But I still wasn’t happy. What if I have a bug and the selection is not random enough? After all, the values in the example seems a bit repetitive.

No problem. I’ll use my function on a table containing numbers 1 to 100, pick few random numbers, and then use the well known chi-square test to check if my random selection matches the uniform distribution.

SQL> create table r1 as select level l from dual connect by level<100;

Table created.

SQL> create table r2 (n number);

Table created.

SQL> insert into r2 select agg_random(l) from r1 group by ();

1 row created.

-- repeated 20 times

SQL> select count(*) from r2;

  2   sig   NUMBER;
  3   alpha NUMBER := 1;
  4   beta  NUMBER := 99;
  5   ttype VARCHAR2(20) := 'CHI_SQUARED';
  6  BEGIN
  7    dbms_stat_funcs.uniform_dist_fit('SCOTT', 'R2', 'N', 'DISCRETE', ttype, a
lpha, beta, sig);
  9    dbms_output.put_line(sig);
 10  END;
 11  /
X-squared value : .9000000000000000000000000000000000000026 
Degree of freedom : 18.0000000013919621366679664

PL/SQL procedure successfully completed.

90% probability of getting my result assuming the function is good. I’m so happy. I knew that getting a degree in statistics was not a complete waste of time 🙂