I'm just a simple DBA on a complex production system

Writing about all things production. Especially Oracle databases.

Mad Troubleshooting Skillz! November 24, 2009

Filed under: musing — prodlife @ 3:06 am

Last week I attended Tanel Poder’s Advanced Oracle Troubleshooting seminar, organized by NoCOUG.

Well, actually it was organized by me, with lots of help from Iggy and the rest of NoCOUG. Organizing a seminar was not trivial, but wow – it was totally worth it!

When the seminar was over, we asked the attendees to fill a small survey and tell us what was good and what was bad. Turns out that there was just a single answer for “What was the best thing about the seminar”. The answer is – Tanel.

He is clearly an expert. He loves what he is doing. He can think on his feet and answer all sorts of more-or-less related questions from the audience. He is extremely generous with his time and his knowledge and his scripts (I know lots of DBAs who guard their scripts with their lives, Tanel happily shares them in his blog). Tanel is also funny, entertaining and the course is pretty well structured. He also talks very fast (and seems somewhat obsessed about not wasting a single millisecond) – you won’t believe how much you’ll learn in two days.

I had some trouble explaining explaining to my team and boss what I learned:
“Well, the first half day was troubleshooting hangs and slowdowns using v$views”
“But you knew that before!”
“We also learned lots of Oracle internals. How the shared pool really works, and how SQL is really processed. Lots of cool stuff.”
“OK, but what is it good for?”
“Knowing how stuff works is always good. Anyway, I also learned to use Unix tools to debug problems. Like dumping process stuck to see where it hangs! Also, we learned how to handle free memory issues. Remember that awful leak we had on that test server?”
“No.”
“Oh! Look! Shiny scripts!”

But the proof of the pudding is in the eating. No one can argue the fact that last week I already managed to troubleshoot and solve two problems that other team members failed to make much progress on. I did it very quickly too.
Now here is the strange thing – the two problems were in areas of Oracle that Tanel very explicitly did not mention during the seminar. Streams and Clusterware. I did not even use any of his scripts to shoot them. And yet I’m still convinced that the reason I was so effective in solving those problems is directly related to the seminar. How is that? Here are the important non-technical things I learned at the seminar:

  1. Systematic approach – You don’t work off lists, you don’t waste time by looking at random places and you don’t guess (much). You gather symptoms, you use them to pinpoint the problem and you use the pinpointed knowledge to work your way toward a solution. The last part sometimes involves Oracle support. I knew about the systematic approach thing before, but two days of looking at someone demonstrating it makes a difference.
  2. Don’t believe anyone (except the OS) – Users lie, other DBAs lie, even Oracle sometimes lies. Always crosscheck and double check the facts. No one lies intentionally, but the result is still misleading.
  3. Problems have causes. I know it sounds funny, but very often we stop troubleshooting too soon, attributing a problem to mysterious unknown forces or at least say “well, I don’t know how to know this” and leave things at that. Tanel went farther than anyone I’ve ever seen by saying “I have to know why this behaves like that” and when Oracle doesn’t tell him, he goes to the OS, or the network, or writes his own tools. Thats a good lesson – don’t take no for an answer.
  4. All DBAs have tons of troubleshooting scripts. Real experts have scripts with very short names and very flexible arguments. They also have a script for reminding them how to use their scripts
  5. I no longer view trouble as something annoying that wastes my time and prevents me from doing stuff I want to do. Instead every trouble is now cherished as an opportunity to practice what I learned, learn more and polish my skills.

I highly recommend Tanel’s course to DBAs who want to suddenly become the best troubleshooters in their team. Its not a comfortable position to be in (suddenly a lot more trouble finds its way to you), but it can be lots of fun.

 

The Senile DBA Guide to Troubleshooting Sudden Growth in Redo Generation November 4, 2009

Filed under: scripts,tips — prodlife @ 2:03 am

I just troubleshooted a server where the amounts of redo generated suddenly exploded to the point of running out of disk space.

After I was done, the problem was found and the storage manager pacified, I decided to save the queries I used. This is a rather common issue, and the scripts will be useful for the next time.

It was very embarrassing to discover that I actually have 4 similar but not identical scripts for troubleshooting redo explosions. Now I have 5 :)

Here are the techniques I use:

  1. I determine whether there is really a problem and the times the excessive redo was generated by looking at v$log_history:
    select trunc(FIRST_TIME,'HH') ,sum(BLOCKS) BLOCKS , count(*) cnt
    ,sum(decode(THREAD#,1,BLOCKS,0)) Node1_Blocks
    ,sum(decode(THREAD#,1,1,0)) Node1_Count
    ,sum(decode(THREAD#,2,BLOCKS,0)) Node2
    ,sum(decode(THREAD#,2,1,0)) Node2_Count
    ,sum(decode(THREAD#,3,BLOCKS,0)) Node3
    ,sum(decode(THREAD#,3,1,0)) Node3_Count
    from v$archived_log
    where FIRST_TIME >sysdate -30
    group by trunc(FIRST_TIME,'HH')
    order by trunc(FIRST_TIME,'HH') desc
    
  2. If the problem is still happening, I can use Tanel Poder’s Snapper to find the worse redo generating sessions. Tanel explains how to do this in his blog.
  3. However, Snapper’s output is very limited by the fact that it was written specifically for situations where you cannot create anything on the DB. Since I’m normally the DBA on the servers I troubleshoot, I have another script that actually gets information from v$session, sorts the results, etc.
    create global temporary table redo_stats
    ( runid varchar2(15),
      sid number,
      value int )
    on commit preserve rows;
    
    truncate table redo_stats;
    
    insert into redo_stats select 1,sid,value from v$sesstat ss
    join v$statname sn on ss.statistic#=sn.statistic#
    where name='redo size'
    
    commit;
    
    insert into redo_stats select 2,sid,value from v$sesstat ss
    join v$statname sn on ss.statistic#=sn.statistic#
    where name='redo size'
    
    commit;
    
    select *
            from redo_stats a, redo_stats b,v$session s
           where a.sid = b.sid
           and s.sid=a.sid
             and a.runid = 1
             and b.runid = 2
             and (b.value-a.value) > 0
           order by (b.value-a.value)
    
  4. Last technique is normally the most informative, and requires a bit more work than the rest, so I save it for special cases. I’m talking about using logminer to find the offender in the redo logs. This is useful when the problem is no longer happening, but we want to see what was the problem last night. You can do a lot of analysis with the information in logminer, so the key is to dig around and see if you can isolate a problem. I’m just giving a small example here. You can filter log miner data by users, segments, specific transaction ids – the sky is the limit.

    -- Here I pick up the relevant redo logs.
    -- Adjust the where condition to match the times you are interested in.
    -- I picked the last day.
    select 'exec sys.dbms_logmnr.add_logfile(''' || name ||''');'
    from v$archived_log
    where FIRST_TIME >= sysdate-1
    and THREAD#=1
    and dest_id=1
    order by FIRST_TIME desc
    
    -- Use the results of the query above to add the logfiles you are interest in
    -- Then start the logminer
    exec sys.dbms_logmnr.add_logfile('/u10/oradata/MYDB01/arch/1_8646_657724194.dbf');
    
    exec sys.dbms_logmnr.start_logmnr(options=>sys.dbms_logmnr.DICT_FROM_ONLINE_CATALOG);
    
    -- You can find the top users and segments that generated redo
    select seg_owner,seg_name,seg_type_name,operation ,min(TIMESTAMP) ,max(TIMESTAMP) ,count(*)
    from v$logmnr_contents 
    group by seg_owner,seg_name,seg_type_name,operation
    order by count(*) desc
    
    -- You can get more details about the specific actions and the amounts of redo they caused.
    select LOG_ID,THREAD#,operation, data_obj#,SEG_OWNER,SEG_NAME,TABLE_SPACE,count(*) cnt ,sum(RBABYTE) as RBABYTE ,sum(RBABLK) as RBABLK 
    from v$logmnr_contents 
    group by LOG_ID,THREAD#,operation, data_obj#,SEG_OWNER,SEG_NAME,TABLE_SPACE
    order by count(*)
    
    -- don't forget to close the logminer when you are done
    exec sys.dbms_logmnr.end_logmnr;
    
  5. If you have ASH, you can use it to find the sessions and queries that waited the most for “log file sync” event. I found that this has some correlation with the worse redo generators.
    -- find the sessions and queries causing most redo and when it happened
    
    select SESSION_ID,user_id,sql_id,round(sample_time,'hh'),count(*) from V$ACTIVE_SESSION_HISTORY
    where event like 'log file sync'
    group by  SESSION_ID,user_id,sql_id,round(sample_time,'hh')
    order by count(*) desc
    
    -- you can look the the SQL itself by:
    select * from DBA_HIST_SQLTEXT
    where sql_id='dwbbdanhf7p4a'
    
 

Thinking about Design Patterns November 3, 2009

Filed under: books,design — prodlife @ 3:46 am

I have this friend who is an ambitious young corporate climber. When I was started out as a team lead, I was totally overwhelmed by the office politics I was suddenly exposed to. Naturally, I turned to my friend for advice. She told me to read Machiaveli’s “The Prince”. So I did. Its an interesting read, but not that amazing as far as management advice goes. When I later tried to discuss the book with my friend, I found out that she never actually read the book herself – she just thought it is good advice.

6 years later and I still believe people when they tell me that I have to read a book. I’m naive like that.

So I spent the last two weeks reading “The Timeless Way of Building”. Its an architecture book. Architecture as in cities and buildings. The reason I spent two weeks reading a professional book intended for a different profession is that at some point in history (1994, I believe), some people though that the ideas in the book are relevant to software development. These days you can’t really be a Java developer without being fully fluent in the development pattern language.

And of course, everyone was saying “You have to read The Timeless Way of Building. It will change the way you think about software.”. From my days in development, I still remember quite a bit of stuff about design patterns, and I never really liked that particular approach to software development, but I didn’t really figure out why. After reading the authoritative source on patterns, I can say the following:

  1. Christopher Alexander had some good ideas about patterns. The book is readable to non-architects and is very enlightening. I recommend reading it if you are interested in what makes some cities and buildings feel better than others.
  2. I am pretty sure that his advice on how to design good buildings is not really applicable to software development field.
    A lot of his ideas are based on the fact (which he did the research to prove) that people intuitively know how buildings should be designed, and that when you ask a large number of people “How do you imagine you will feel in such room?”, you’ll get an overwhelming consensus. This is far from being true in the software field.
  3. What is common known today as software design patterns is so far removed from what Christopher Alexander recommended for architects, that software developers should really go and find a different name for what they are doing.

I’m still rather shocked by the differences between Christopher Alexander’s patterns, and what design patterns look like today. It is not few tiny differences that occur whenever ideas are translated from one domain to another. Some of the changes are profound.

First of all, Christopher Alexander says that patterns describe the way people already do things. “Night Life” and “Parallel Streets” existed before the book “A Pattern Language” was written. I’m not at all sure this is the case for design patterns. People buy design pattern books to learn the patterns themselves, not just the language or which patterns are better than others.

Second, patterns should have an intuitive meaning and intuitive name. Again, you don’t need a book to know what is a “Bus Stop” or “Small Parking Lots”. You may want to read the book to find out why they are a good idea, or how to make a good bus stop, but you know what it is. I don’t believe that anyone knew what is an “Abstract Factory” before reading a document about design patterns. Even patterns that have been used for decades got a fancy name. It can take a while to figure out that a Singleton is a global variable. One of the simplest and most common patterns in software development “A function that does exactly one task” is missing from software design patterns. “A Loop” is also a pattern which is missing in action. All this gives the wrong impression that patterns are very complicated and something that can be mastered by experts only – which is exactly the opposite of what Christopher Alexander intended.

Third, patterns are abstract concepts. They are always implemented in a different way, because the entire idea is to be sensitive to the context, which is never the same twice. There is a pattern called “Six-foot balcony”, but it would be wrong to mass-manufacture six-foot balconies and start attaching them to buildings. Six-foot balcony is the idea, the exact shape of the balcony will be designed to match the building, the view, the trees, the sun, etc.
So it is rather annoying to discover that all patterns have “implementation examples”, which developers enjoy copying into their code. I’m all for code reuse, but this is not patterns mean.Wikipedia has a decent description of what defines a pattern, and “being implementable in one or two simple classes that can be copy-pasted” is not part of it.

Executive summary: “Timeless Way of Building” is an interesting book on architecture, with some good insights about how humans like to live and a bit of a Zen feel. You will not learn anything about software development from reading it. If you already know software design patterns, you will be struck by how different the ideas in the book seem.

 

 
Follow

Get every new post delivered to your Inbox.

Join 46 other followers