I'm just a simple DBA on a complex production system

Writing about all things production. Especially Oracle databases.

The Senile DBA Guide to Troubleshooting Sudden Growth in Redo Generation November 4, 2009

Filed under: scripts, tips — prodlife @ 2:03 am

I just troubleshooted a server where the amounts of redo generated suddenly exploded to the point of running out of disk space.

After I was done, the problem was found and the storage manager pacified, I decided to save the queries I used. This is a rather common issue, and the scripts will be useful for the next time.

It was very embarrassing to discover that I actually have 4 similar but not identical scripts for troubleshooting redo explosions. Now I have 5 :)

Here are the techniques I use:

  1. I determine whether there is really a problem and the times the excessive redo was generated by looking at v$log_history:
    select trunc(FIRST_TIME,'HH') ,sum(BLOCKS) BLOCKS , count(*) cnt
    ,sum(decode(THREAD#,1,BLOCKS,0)) Node1_Blocks
    ,sum(decode(THREAD#,1,1,0)) Node1_Count
    ,sum(decode(THREAD#,2,BLOCKS,0)) Node2
    ,sum(decode(THREAD#,2,1,0)) Node2_Count
    ,sum(decode(THREAD#,3,BLOCKS,0)) Node3
    ,sum(decode(THREAD#,3,1,0)) Node3_Count
    from v$archived_log
    where FIRST_TIME >sysdate -30
    group by trunc(FIRST_TIME,'HH')
    order by trunc(FIRST_TIME,'HH') desc
    
  2. If the problem is still happening, I can use Tanel Poder’s Snapper to find the worse redo generating sessions. Tanel explains how to do this in his blog.
  3. However, Snapper’s output is very limited by the fact that it was written specifically for situations where you cannot create anything on the DB. Since I’m normally the DBA on the servers I troubleshoot, I have another script that actually gets information from v$session, sorts the results, etc.
    create global temporary table redo_stats
    ( runid varchar2(15),
      sid number,
      value int )
    on commit preserve rows;
    
    truncate table redo_stats;
    
    insert into redo_stats select 1,sid,value from v$sesstat ss
    join v$statname sn on ss.statistic#=sn.statistic#
    where name='redo size'
    
    commit;
    
    insert into redo_stats select 2,sid,value from v$sesstat ss
    join v$statname sn on ss.statistic#=sn.statistic#
    where name='redo size'
    
    commit;
    
    select *
            from redo_stats a, redo_stats b,v$session s
           where a.sid = b.sid
           and s.sid=a.sid
             and a.runid = 1
             and b.runid = 2
             and (b.value-a.value) > 0
           order by (b.value-a.value)
    
  4. Last technique is normally the most informative, and requires a bit more work than the rest, so I save it for special cases. I’m talking about using logminer to find the offender in the redo logs. This is useful when the problem is no longer happening, but we want to see what was the problem last night. You can do a lot of analysis with the information in logminer, so the key is to dig around and see if you can isolate a problem. I’m just giving a small example here. You can filter log miner data by users, segments, specific transaction ids – the sky is the limit.
    -- Here I pick up the relevant redo logs.
    -- Adjust the where condition to match the times you are interested in.
    -- I picked the last day.
    select 'exec sys.dbms_logmnr.add_logfile(''' || name ||''');'
    from v$archived_log
    where FIRST_TIME >= sysdate-1
    and THREAD#=1
    and dest_id=1
    order by FIRST_TIME desc
    
    -- Use the results of the query above to add the logfiles you are interest in
    -- Then start the logminer
    exec sys.dbms_logmnr.add_logfile('/u10/oradata/MYDB01/arch/1_8646_657724194.dbf');
    
    exec sys.dbms_logmnr.start_logmnr(options=>sys.dbms_logmnr.DICT_FROM_ONLINE_CATALOG);
    
    -- You can find the top users and segments that generated redo
    select seg_owner,seg_name,seg_type_name,operation ,min(TIMESTAMP) ,max(TIMESTAMP) ,count(*)
    from v$logmnr_contents
    group by seg_owner,seg_name,seg_type_name,operation
    order by count(*) desc
    
    -- You can get more details about the specific actions and the amounts of redo they caused.
    select LOG_ID,THREAD#,operation, data_obj#,SEG_OWNER,SEG_NAME,TABLE_SPACE,count(*) cnt ,sum(RBABYTE) as RBABYTE ,sum(RBABLK) as RBABLK
    from v$logmnr_contents
    group by LOG_ID,THREAD#,operation, data_obj#,SEG_OWNER,SEG_NAME,TABLE_SPACE
    order by count(*)
    
    -- don't forget to close the logminer when you are done
    exec sys.dbms_logmnr.end_logmnr;
    
  5. If you have ASH, you can use it to find the sessions and queries that waited the most for “log file sync” event. I found that this has some correlation with the worse redo generators.

    -- find the sessions and queries causing most redo and when it happened
    
    select SESSION_ID,user_id,sql_id,round(sample_time,'hh'),count(*) from V$ACTIVE_SESSION_HISTORY
    where event like 'log file sync'
    group by  SESSION_ID,user_id,sql_id,round(sample_time,'hh')
    order by count(*) desc
    
    -- you can look the the SQL itself by:
    select * from DBA_HIST_SQLTEXT
    where sql_id='dwbbdanhf7p4a'
    
 

8 Responses to “The Senile DBA Guide to Troubleshooting Sudden Growth in Redo Generation”

  1. CJ Says:

    Good article.

    I normally create a database logoff trigger that reads v$sesstat and inserts data into a table, and leave it running for a day or so. That usually reveals which session/schema/app is the worst offender.

    • prodlife Says:

      For reasons I don’t 100% agree with, we avoid creating triggers in our environment at all. Your method sounds useful – as long as the offending session logged off.

      • CJ Says:

        To give credit where due, I learnt this trick from the OWI book by Kirtikumar Deshpande et al. In fact they recommend it as a ‘performance data collector’ tool. Since we don’t have ASH licenses, I find this extremely useful. Lots of useful information can be derived from recording v$sesstat and v$session_event.

  2. Marko Sutic Says:

    Nice trick using logminer to diagnose redo growth.

    We had similar situation with disk running out of space cause of redo generation. If I knew then this trick with logminer troubleshooting would be much simpler.
    Great article!

    Regards,
    Marko

  3. Yas Says:

    Chen, nice queries. But I do not agree with 5. A session generating the most redo may not be the one mostly waiting on log file sync. If the session runs a massive DML statement without a commit it never waits for the log file sync event. Right?

  4. Freek Says:

    Also, you need to be carefull when there are many sessions, each doing a little bit of redo instead of 1 big session.
    We had this problem once, when an application went crazy and restarted a certain action again and again (each time creating a session to the db, do some work and logging off again).

    I had to use logminer to go through the actions to find out what was happening.

  5. [...] Shapira offers The Senile DBA Guide to Troubleshooting Sudden Growth in Redo Generation, which begins, “I just troubleshooted a server where the amounts of redo generated suddenly [...]

  6. [...] Chen Shapira-The Senile DBA Guide To Troubleshooting Sudden Growth In Redo Generation [...]


Leave a Reply