Concurrent Joins – Coming Soon to a DW near you?Posted: August 5, 2009 Filed under: hardcore, musing, nerdism, optimizer 11 Comments
Oracle’s Concepts book says the following about the characteristics of Data Warehouses:
“A typical data warehouse query scans thousands or millions of rows.For example, “Find the total sales for all customers last month.””
They also say:
“Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations.”
So we have multiple ad-hoc queries, each with its own plan, unaware of other similar queries that may be running at the same time, each reading millions of rows. Obviously they compete for resources. What if several concurrent queries are full-scanning the same table? What if this table is too large to fit entirely into memory?
It sounds obvious that in this case, all the separate execution threads will read their own blocks into memory, often “aging out” blocks that will be needed in just few seconds by another query. All competing for buffer-cache latches and blocks. The more queries we have concurrently on the system, the slower response times will be, due to competition on limited memory and IO resources.
Jonathan Lewis famously said: “The performance of a query should be related to the size of the data set you’re interested in, not to the size of the database”
I would like to add “… or to the number of concurrent users in the system”, but this is obviously untrue. Resource contention when the number of users rises has dramatic negative impact on performance. This is why we run load tests before putting new systems in production. We know that good performance with use in the lab does not guarantee good performance with 40 users in production.
But what if it doesn’t have to be this way? What if your database could, instead of optimizing each query seperately, could optimize the entire workload? So that with one or 40 or 256 users in the system we will still see very similar response times? What if the queries could share resources instead of competing for them?
All this is a rather lenghthy introduction to a cool idea I’ve ran into when scanning the program of the upcoming VLDB conference.
The paper I’ve read is called “A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses” and it is by George Candea, Neoklis Polyzotis and Radek Vingralek.
In the paper, the authors introduce CJOIN – a physical operator (i.e. an implementation of a relational operator) that can evaluate concurrent joins efficiently. It was written to allow sharing of CPU and IO resources and to fit modern DW systems – star schema, multiple cores, fast sequential scans and large memory.
The idea behind the CJOIN design is that there is a single physical plan that is “always on” and is optimized based on run-time statistics. Each new query can use this plan at any time and start sharing work with concurrent queries in the same plan.
Since all this sounds a bit vague, let me present the example that is given in the paper to demonstrate the idea. Then I’ll mention few of the more interesting points that are detailed in the paper, and hopefully after reading my descriptions you’ll decide to read the paper (which requires some effort on the part of the reader).
The design on CJOIN is based on a simple observation: Star queries all work by filtering a fact table through dimension tables and aggregating the results.
CJOIN works as a pipeline – receiving the input from a continuous scan of the fact table, passing the data through a set of filters, one for each dimension table, and distributing the results to aggregation operators that produce the output for *all the queries* using the operator.
Since the scan of the fact table is continuous, a query can start using the operator at any time, by remembering the point it registered and completing when the scan reaches this point again.
Suppose we have a fact table “Sales” with dimension tables “Customers” and “Products”.
Lets imagine the following two queries running concurrently:
Select sum(quantity) from sales, customers, products
where sales.customer_id=customer.customer_id and sales.product_id=products.product_id
and customers.city=’Japan’ and products.type=’High-Price’;
Select avg(dollar) from sales, customers, products
where sales.customer_id=customer.customer_id and sales.product_id=products.product_id
and customers.service_level=’Gold’ and products.type=’High-Price’;
As you can see, they share the same data source, but apply different filters and predicates.
Here’s how the CJOIN pipe will work:
The pipeline starts with a pre-processor, which receives rows from the continuous scan of the fact table and forwards them to the filtering part of the pipeline. Before doing so, the pre-processor adds few bits to each row – one bit for every query that is registered on the pipeline (i.e. queries that are interested in rows from this fact table). All the bits start out as “1”, signifying that at this stage all queries are interested in every row.
Now lets take a look at the filters:
We have a filter for each dimension table. The filter is a hash table that stores all the rows of that dimension that are of interest to any of our queries. Remember that while the fact table is too big to fit into memory, dimension tables are typically small enough to fit the memory of a nice DW server. Like the fact rows, the filter rows also have an additional byte per query.
So in our example, the “customers” filter will contain all the customers from Japan and customers with service_level “Gold”. The rows for customer from Japan will be have the first bit turned on and the second turned off, the row for Gold customers will have the reverse, because only the first query checks for customers from Japan and only the second checks for Gold customers. Products filter will contain the products of type “High Price” and both bits will be on, as both queries check for High Price products.
Note that when we start a new query, we need to add the appropriate rows from the dimension tables to the filters and remove them when the query is finished running. This is relatively quick because dimension tables are relatively small.
Now a row from the fact table arrives at the Customers filter. We will quickly check the customer_id on this row and see if it matches any row in the filter. If it exists, we know that at least one query wants this fact row. We can then check the query bits in the matching filter row to see which query needs it. If we see that only query 1 needs this fact row, then this row no longer interests query 2 and we can mark the second bit of the fact row as 0. If all query bits are marked as 0 , we can throw the fact row away. No one will need it.
In this way the row from the fact table passes through all the filters and arrives at the distributor. The distributor recieves fact rows that are relevant for at least one query in the current work load. It checks the bits to see which queries are interested in this row and sends it to the aggregators for these queries.
Once you got this example, you should be able to enjoy the paper. The paper actually contains this example, but with D1 instead of customers and d21 instead of Gold. I’m just a simple DBA and I understand better with a more concrete example.
You probably want to read the paper because it contains the algorithms for adding and removing queries from the pipe, so you’ll be convinced of how fast and clever this is.
The paper also contains a discussion of how to best parallelize this pipeline. Parallelization is very important in DW, and the paper offers several ideas and picks the best. It also has some ideas on how to handle concurrent updates to the tables, and ideas of how to adapt the CJOIN to other models except star schema.
Finally, the authors of the paper implemented their idea on top of PostgreSQL database, and they have extensive analysis of how the CJOIN indeed improve performance for concurrent workload (They seem to achieve almost linear growth in throughput as the number of concurrent queries grow!).
I hope you enjoyed this peek into the future of DBMS as much as I did and I hope to see CJOIN in Oracle soon 🙂
“I hope to see CJOIN in Oracle soon”
It would be a pain reconciling it with Oracle’s read-consistency model. You’d want to have all the participating queries running as of the same SCN.
The paper actually gave some suggestions on how to match SCNs to queries and have queries with different SCNs use the same pipeline. It does sound painful 🙂
The CJOIN will probably work better in a “load data once a day” environment, if such still exist.
I think some shared nothing db’s like Teradata are already doing something like this. ie if a scan is happening on a node, another process on the node can use the same scan as it’s happening. It might just be marketing speak though.
Teradata could be doing something very similar (there are other known methods for plan sharing) but probably not exactly the same – the paper seems too new for Teradata to implement this already.
If you hadn’t implied it would take quite a while to read the paper, then I would read the paper to answer my question. But I’ll ask you instead. Is the goal of the method in the paper to minimize operating system reads only? Or will it significantly reduce CPU too? The reason I ask is that I/O capacity is not nearly as big of a problem as CPU capacity.
The paper is a bit longish, but I mostly implied that it will take a while to read for someone unused to mathematical notation. I’m not sure it includes you.
In any case, it appears that the main goal of the paper is to make query times more consistent. So that one user running the query and 256 users running the same query will experience very similar response times.
The paper said very little about conserving CPU resources – I believe there will be savings under high concurrency (more concurrent queries than you have cores), but probably not before that: On one hand CJOIN is essentially doing nested join, which is CPU intensive. On the other hand, adding more queries will not require more resources (unless they add a new filter). CJOIN also seems to use multiple-cores more efficiently than traditional parallel execution plans, so depending on your system, this can be a big deal.
Figure 2 does a good job of answering my question. Workload sharing is the term I should have used. If the workload can be shared, then throughput will increase. Figures 5 and 6 indicate some of their tests were able to leverage the sharing of work.
However, as one would expect, when workload cannot be shared or when selectivity (s) increases, then CJOIN benefit decreases. Figure 7 indicates big changes for s between 0 and 10. Why did the authors not show the figure for s between 0 and 100? Could it be that as s approaches 100 there is no difference between any of the three systems compared?
The paper explains explicitly why they did not look at selectivity larger than 10%:
“As noted above, the performance of CJOIN decreases significantly
for higher values of s. Essentially, the dimension hash tables
have to hold an increased number of tuples, which has adverse
effects on cache locality and hence access times. Moreover, as we
explain below, the overhead of submitting new queries grows substantially,
which contributes to the slow down of the operator.”
I suspect that when s>10% CJOIN is far slower than traditional join (which probably uses hash-joins and not nested loops).
Thanks for that peek into DB-research. It is sometimes good to have a look over the limited border of existing systems.
Exactly! I love this kind of practical research (as opposed to the highly theoretical stuff) and wanted to share with my readers 🙂
I thought I had read something similar to this before. So I checked out the references at the end of the paper.
Qpipe does something similar to this.
I hope the paper leads to something more than a paper or two and nothing in the way of something tangible to install and play around with.