CPU Cache – For serious nerdsPosted: December 14, 2007
For DBAs, memory access is the fastest thing in the world. Disk IO is slow and evil, having the data in memory is fast and good. Last night I’ve had the opportunity to listen to a talk by Ulrich Drepper of Red Hat. This man is obsessed with performance in a way that I’ve never seen before. His mission in life is to utilize every CPU cycle, never ever wasting precious CPU waiting for anything – especially not memory. How can you not like someone like this? Of course the talk was fascinating, and I was furiously scribbling notes for two hours. The rest of the post are these notes.
CPUs are fast and are getting faster. Memory, on the other hand, is slow. Each memory access takes 240 core cycles to complete, which is an incredible waste of resources.
To make matters worse, there are two types of memory these days: DRAM and SRAM. DRAM is relatively cheap, takes less energy and less real estate than SRAM, but it also has a physical limit for its speed – it needs to discharge and this takes time. SRAM is a huge energy hog, but it has no theoretical limits on its speed. For energy and real estate reasons, SRAM can never be used as the machine main memory, so DRAM is used.
The current solution for the discrepancy is using CPU Caches. They are extra bits of fast memory that sit between the CPU and the main memory with the intention to speed things up. The use of these caches is transparent to the application developers (although they can do a lot to optimize cache usage), and requires only minor changes in the OS, as they are controlled by the CPU and the related chipset.
Important part of this solution is that there are different layers of caches with different sizes and speed. Usually we are talking about two or three such layers, and the closer the layer is to the CPU, the faster it will be and also smaller.
CPU Layer (AKA registers) – Cache so fast it takes less than one core cycle to access it.
L1 – These days there are two L1 caches – one for code instructions and the other for program data. Thats because programs don’t modify their own instructions while running (at least not often), and separating read-only instructions make things faster. If you use an interpreted language such as Python, your code will be kept in the data cache and you will lose this optimization.
Accessing L1 cache takes about 3 core cycles, which is fast enough to be irrelevant. CPUs use pipelines, so by the time the instruction actually runs the data will already be in the CPU.
L2 – Takes about 14 cycles to access. If you have multiple cores, this layer will be shared between all of them.
Main memory – As we mentioned before, this takes 240 cycles to get to. This is a huge gap. Turns out that if a machine needs to be really really fast, like a backbone router or a cray, it will not contain any DRAM memory.
At this point Ulrich talked at length about addressing the memory in the cache. L2 cache works by physical addresses, just like real memory.
L1 cache works a bit differently. It uses something called Cache Lines. Each cache line keeps 32 or 64 bits of information, that could belong to different variables. Cache lines are addressed by a hash key and a tag. The hash key can match up to four different cache lines, and then the correct line is chosen by comparing tags. Ulrich shows that this is much faster than direct hashing (each key matches just one line).
The nerdiest part of the talk is now behind us, and now Ulrich talks about different programming practices and how they impact performance by allowing greater or lesser cache utilization. He did amazing benchmarks and showed us very cool graphs that demonstrate these ideas visually. I’ll just summarize the ideas.
Cache usage recommendation for C++ developers:
- Sequential memory access is almost 10 times faster than random access (No surprise for DB developers, I hope). In your programs, this means you should use arrays instead of linked lists. In your code, this means you should avoid indirect calls. C++ developers – avoid virtual functions and use templates instead.
- Choose data structures as small as possible. There is a huge speed penalty for data elements that are larger than a cache line. If they are larger than a memory page, it gets much worse.
- Pre-fetching data (reading it into the cache before it is actually use) can help hide much of the cost of main memory access. C developers can force pre-fetch by accessing the variable before it is actually used, or by calling an appropriate assembly instruction.
- Use structures to keep together variables that are used together. This will help you get them on the same cache line.
- L3 cache is used in high end servers. Large L3 cache, say with 32M, can give you tremendous speed ups, much better than what a faster CPU will give you. Which is why servers with large L3 cache are so expensive. Ulrich gave an example of having Oracle fit entirely into the L3 cache, which will make it 10 times faster and will justify the cost. I burst out laughing, because to me fitting the DB into 32M of memory is ridiculous – we need 12G for the SGA! But the rest of the audience took this seriously, so maybe I missed something.
- Multi-threading. Generally, a very bad idea – because if a bit of memory is in L1 cache of CPU #1, CPU #2 can’t use it until #1 releases the bit, and then #2 reads it from main memory, while invalidating the cache for #1, because the bit was changed by #2. So, if you use multi-threading, your expansive L1 cache is wasted. This part reminded me of the way cache fusion works in RAC.
You can find the slides for the talk here, and the paper that the slide is based on here. Ulrich claims that the talk contains only about 1% of the content of the paper. I encourage you to at least look at the slides, because the graphs that demonstrate the recommendations make a powerful impact.
According to Ulrich, Redhat and Oracle are working hard so our DB code will be as optimized as humanly possible to make the most of the CPU caches and the related speedups. Which is a nice thing to keep in mind – while we are working hard to optimize our applications, query and the DB itself, someone was working hard to make sure that everything below this level is already as optimal as possible.