I'm just a simple DBA on a complex production system

Writing about all things production. Especially Oracle databases.

First International NoCoug SQL Challenge – And the Winner is… July 31, 2009

Filed under: musing, nocoug, sql — prodlife @ 4:45 am

Iggy Fernandz invented an impossible challenge.
NoCoug and Apress sponsored an international competition.
9 SQL experts from all over the world submitted clever solutions.
SQL Guru Dan Tow reviewed the solutions and picked his favorite.

Alberto Dell’Era wins the challenge for his wonderful solution using Discrete Fourier Transforms; the runner-up is André Araujo from Australia, who used binary arithmetic and common table expressions in his solution.

The Order of the Wooden Pretzel will be bestowed on Alberto but the real prize is six books of his choice from the Apress catalog. André will receive a prize of six e-books of his choice.

Congratulations to the winners! Iggy will contact you soon to arrange for the delivery of your bounty. Thanks to everyone who participated. We’ll see you again in April for the second SQL challenge. BTW. If you have good SQL riddles – something fun and difficult, send them my way. Who knows, maybe you’ll be featured as the next challenging wizard.

Which reminds me: Last April, when Iggy presented his challenge, he told me that he wrote his own highly efficient solution to the problem. Now that the challenge is over and the winners have been announced – it is high time he will present his own solution to the world. Please go to his blog and bug him a bit about this. I’m sure a little peer pressure will make him see the light. Thanks :)

NoCoug will publish Iggy’s review of the winning result in the NoCoug Journal. If it’ll be posted online I’ll link to it. Iggy’s article  includes an explanation of the decision by Dan Tow and also an explanation of the mathematics behind Alberto’s solution that I actually almost kind-of understood after reading both Alberto’s and Iggy’s explanations for around 4 hours one afternoon. My university would take my degree back if they knew, I’m sure.

One thing that bugs me is that Alberto’s winning solution is not very interesting from a SQL perspective. Note that the winning solution is the DFT solution which uses cartesian joins and subquery factoring, not the FFT solution which uses SQL Model. Dan Tow said that Model is iterative and therefore violates the spirit of SQL – which is about sets and relations. I can understand his sentiment, except that Model is part of the SQL standard.

So the winning solution is just plain SQL. It is algorithmically brilliant and I completely admire Alberto for coming up with it, but still just plain (very long) SQL. And I was wondering – this is a SQL challenge and not a math challenge. Shouldn’t the winning solution demonstrate more SQL brilliance and less math brilliance?

On the other hand, the winning solution demonstrates one of the most important principles of programming – Using a clever algorithm (such as FFT) will give you performance and scalability that no amount of query and database tuning will ever achieve. So in away its light use of SQL features does make it a good win for a SQL challenge.

My personal favorite solution is Fabien (Walder) Contaminard’s. He still uses more math than SQL. He uses the multinomial distribution (which unlike’s Alberto’s math I actually understood). I thought it was a simple, straight forward, readable and very clever solution. Of course, it was slower than the winning entry, but he did find some interesting differences between SQL Server and Oracle while trying to optimize it.
Here’s a summary of all submitted solutions:

  1. Laurent Schneider’s solution using CONNECT BY and XMLQUERY (Switzerland).
  2. Rob van Wijk’s solution using the MODEL clause (Netherlands).
  3. Vadim Tropashko’s solution using 11gR2 recursive joins This entry should win the shortest solution award. (USA).
  4. Craig Martin’s solution using CONNECT BY and logarithms (USA).
  5. Alberto Dell’Era’s two solutions using Fourier transforms (Italy).
  6. Fabien (Waldar) Contaminard’s solution using the multinomial distribution (France).
  7. Almost anonymous Postgres solution using pipeline functions (Romania).
  8. André Araujo’s solution using binary arithmetics and subquery factoring (Australia)

Thanks for playing. See you all next Year. Don’t forget to send me your SQL riddles so I can pick the best for the next challenge (and spend long weekends trying to solve them all!)

 

More Troubleshooting Monsters – Correlation and Causation July 24, 2009

Filed under: musing — prodlife @ 2:51 am

We have this application that hangs for few minutes almost every hour. Why? No one knows. But naturally, the database is a suspect.

The developers responsible for this application requested an AWR report of the last hour. No problem. AWR report shows a significant amount of time spent waiting for locks on a specific table. It happened to be a timesheet table. So the developers say “Aha! We found the cause of the problem – contention on the timesheet table!”.

Oh no, I say. You did not find the cause of THE PROBLEM, you found the cause of A PROBLEM. What you are now looking at is our usual Monday morning slowdown. Timesheets are due at 12pm, so everyone is inserting their timesheets at 10am, causing contention and slowdown. I’ll be forever gratefull if you fix it, but it is unrelated to THE PROBLEM of the application hangs. Here are few AWR reports from Tuesday morning, when the application still hangs, but we no longer have any locks.

It took me about 4 conference calls and 3 emails to get the following point across: Even if the locks occured at every hang, you can’t be completely sure they are the cause for the hangs. When the locks occure only once a week and the hangs occure every hour, you can be quite sure that the locks don’t cause the hang. If they did cause the hang, they would appear at every hang.

Tanel Poder is going to give his Advanced Troubleshooting at NoCoug in November (More about this soon!). Some people asked me – how do I know if I’m advanced enough for the course. Well, being able to understand correlation and therefore avoid jumping to conclusions about what caused an event would be a great start.

 

When-Others-Then and other Troubleshooting Monsters July 16, 2009

Filed under: rants — prodlife @ 8:15 pm

I’m definitely having one of these weeks.

We have this insanely complicated, highly visible, ultra high priority project going live this this Wednsday. The DBA who worked on this project nearly full time for the last 6 month left on 3 weeks vacation on Monday. I’m the replacement.

Tuesday morning, I’m on a conference call. The night job failed due to an issue with “dblink”. Must be fixes ASAP or the go-live will be delayed. I was all “Wait. I’m just a simple DBA. What dblink are you talking about?”.

So the morning was spent with me trying to slowly work my way through huge amounts of night-job code. While everyone around me was running in circles screaming. Mostly screaming “network” and “dblink”.

You know what I found out? That we have no clue why the night job failed. Because at the end of every single procedure in the job was code that said:

EXCEPTION
 WHEN OTHERS
 THEN
 INSERT INTO NIGHT_JOB
 (status, last_run_stmp)
 VALUES ('failed', SYSDATE);
 COMMIT;

The job failed. What more information could you possibly want?

And the funny thing that happens when you have such useful error messages, is that everyone starts developing theories about why the failure occurred. Maybe someone tested the dblink few hours after the failure occurred, noticed that the remote site is down, and decides that this is the issue. Someone else tries rerunning some of the code and gets “unique constraint” error, so he guesses that this is the issue. But of course, its all guesses after the fact. No one can know what caused the night job to fail on its original run.

Trying to troubleshoot an issue based on guesses about what was the error is a complete waste of time – what are the chances that you’ll find the real issue and that the next time the night-job runs everything will work?

Proper error handling is a complex topic – which exceptions can be handled locally and which should be raised? At which levels do we trap exceptions? These questions are a topic of much discussions among developers and architects. But the basics of error handling should not be ignored – When an error occurs,  we must know what was the error.

 

Optimal Page Replacement Algorithm July 7, 2009

Filed under: musing — prodlife @ 4:36 am

I was reviewing memory management basics this weekend (don’t ask), and I ran into this “algorithm” that totally cracked me up.

The idea behind all page replacement algorithms is that when the physical memory on a machine is completely full, but a process tries to load something into the memory, something else that is currently in the memory has to be written to disk and removed from memory to make space for the new data. The problem is – which page should be taken out of memory and written to disk?

This should sound quite familiar. Oracle’s buffer cache faces the same problem when you read a block but the cache is full. Oracle solves this problem by maintaining a list of all the blocks in the cache, ordered (roughly) from the most recently used (MRU) to the least recently used (LRU). When it needs to throw out a block, it will pick one of the least recently used blocks. Makes sense, if a block was used recently, there may be someone who still needs it, so throwing it out can be wasteful because we may need to read it back in again soon. On the other hand, if the block wasn’t used recently, perhaps we no longer need it at all, so why waste space?

So this particular method is pretty good, but is it optimal?

A smart scientist called Laszlo Belady came up with the optimal page replacement algorithm – when a page has to be swapped out, we should swap out the page whose next use will occur farthest in the future. So we prefer to swap out a page that will be used in 600 seconds and not in 3 seconds. Optimal indeed.
I’m sure that later on Belady came up with the optimal stock trading algorithm – buy stocks whose value will increase most in the future, and sell those whose value will decrease. The guy is a genius.

Thankfully, the optimal algorithm is useful even for operating system designers who lack regular access to a crystal ball. The idea is that for a given set of processes and actions, you know what the optimal algorithm will do. Then you can compare several practical algorithms (Say Oracle’s vs. MySQL’s) and see which one is closer to doing the optimal thing in your case. Or if someone suggests a modification o Oracle’s algorithm, we can know if the modification makes the system closer to optimal. Its a benchmark of sorts.

BTW. If you enjoy memory management, you should probably read about the paging game.

****

The cool guys from HotSos put up  a video from their symposium on Youtube. I’m there right at the beginning and its my first youtube appearance. Maybe you want to take a look :) While there, I noticed they put up a bunch of videos demonstrating the use of their products. Pretty neat – soon screenshots will be a thing of the past. Oh, and while looking around in their website, I noticed they already put up their 2010 call for papers. Which is funny because its way early and also no one mentioned it anywhere. Don’t panic – you have until September 30 to send your papers :)