The Senile DBA Guide to Troubleshooting Sudden Growth in Redo Generation
Posted: November 4, 2009 Filed under: scripts, tips 13 Comments »I just troubleshooted a server where the amounts of redo generated suddenly exploded to the point of running out of disk space.
After I was done, the problem was found and the storage manager pacified, I decided to save the queries I used. This is a rather common issue, and the scripts will be useful for the next time.
It was very embarrassing to discover that I actually have 4 similar but not identical scripts for troubleshooting redo explosions. Now I have 5
Here are the techniques I use:
- I determine whether there is really a problem and the times the excessive redo was generated by looking at v$log_history:
select trunc(FIRST_TIME,'HH') ,sum(BLOCKS) BLOCKS , count(*) cnt ,sum(decode(THREAD#,1,BLOCKS,0)) Node1_Blocks ,sum(decode(THREAD#,1,1,0)) Node1_Count ,sum(decode(THREAD#,2,BLOCKS,0)) Node2 ,sum(decode(THREAD#,2,1,0)) Node2_Count ,sum(decode(THREAD#,3,BLOCKS,0)) Node3 ,sum(decode(THREAD#,3,1,0)) Node3_Count from v$archived_log where FIRST_TIME >sysdate -30 group by trunc(FIRST_TIME,'HH') order by trunc(FIRST_TIME,'HH') desc
- If the problem is still happening, I can use Tanel Poder’s Snapper to find the worse redo generating sessions. Tanel explains how to do this in his blog.
- However, Snapper’s output is very limited by the fact that it was written specifically for situations where you cannot create anything on the DB. Since I’m normally the DBA on the servers I troubleshoot, I have another script that actually gets information from v$session, sorts the results, etc.
create global temporary table redo_stats ( runid varchar2(15), sid number, value int ) on commit preserve rows; truncate table redo_stats; insert into redo_stats select 1,sid,value from v$sesstat ss join v$statname sn on ss.statistic#=sn.statistic# where name='redo size' commit; insert into redo_stats select 2,sid,value from v$sesstat ss join v$statname sn on ss.statistic#=sn.statistic# where name='redo size' commit; select * from redo_stats a, redo_stats b,v$session s where a.sid = b.sid and s.sid=a.sid and a.runid = 1 and b.runid = 2 and (b.value-a.value) > 0 order by (b.value-a.value) - Last technique is normally the most informative, and requires a bit more work than the rest, so I save it for special cases. I’m talking about using logminer to find the offender in the redo logs. This is useful when the problem is no longer happening, but we want to see what was the problem last night. You can do a lot of analysis with the information in logminer, so the key is to dig around and see if you can isolate a problem. I’m just giving a small example here. You can filter log miner data by users, segments, specific transaction ids – the sky is the limit.
-- Here I pick up the relevant redo logs. -- Adjust the where condition to match the times you are interested in. -- I picked the last day. select 'exec sys.dbms_logmnr.add_logfile(''' || name ||''');' from v$archived_log where FIRST_TIME >= sysdate-1 and THREAD#=1 and dest_id=1 order by FIRST_TIME desc -- Use the results of the query above to add the logfiles you are interest in -- Then start the logminer exec sys.dbms_logmnr.add_logfile('/u10/oradata/MYDB01/arch/1_8646_657724194.dbf'); exec sys.dbms_logmnr.start_logmnr(options=>sys.dbms_logmnr.DICT_FROM_ONLINE_CATALOG); -- You can find the top users and segments that generated redo select seg_owner,seg_name,seg_type_name,operation ,min(TIMESTAMP) ,max(TIMESTAMP) ,count(*) from v$logmnr_contents group by seg_owner,seg_name,seg_type_name,operation order by count(*) desc -- You can get more details about the specific actions and the amounts of redo they caused. select LOG_ID,THREAD#,operation, data_obj#,SEG_OWNER,SEG_NAME,TABLE_SPACE,count(*) cnt ,sum(RBABYTE) as RBABYTE ,sum(RBABLK) as RBABLK from v$logmnr_contents group by LOG_ID,THREAD#,operation, data_obj#,SEG_OWNER,SEG_NAME,TABLE_SPACE order by count(*) -- don't forget to close the logminer when you are done exec sys.dbms_logmnr.end_logmnr; -
If you have ASH, you can use it to find the sessions and queries that waited the most for “log file sync” event. I found that this has some correlation with the worse redo generators.
-- find the sessions and queries causing most redo and when it happened select SESSION_ID,user_id,sql_id,round(sample_time,'hh'),count(*) from V$ACTIVE_SESSION_HISTORY where event like 'log file sync' group by SESSION_ID,user_id,sql_id,round(sample_time,'hh') order by count(*) desc -- you can look the the SQL itself by: select * from DBA_HIST_SQLTEXT where sql_id='dwbbdanhf7p4a'

Good article.
I normally create a database logoff trigger that reads v$sesstat and inserts data into a table, and leave it running for a day or so. That usually reveals which session/schema/app is the worst offender.
For reasons I don’t 100% agree with, we avoid creating triggers in our environment at all. Your method sounds useful – as long as the offending session logged off.
To give credit where due, I learnt this trick from the OWI book by Kirtikumar Deshpande et al. In fact they recommend it as a ‘performance data collector’ tool. Since we don’t have ASH licenses, I find this extremely useful. Lots of useful information can be derived from recording v$sesstat and v$session_event.
Nice trick using logminer to diagnose redo growth.
We had similar situation with disk running out of space cause of redo generation. If I knew then this trick with logminer troubleshooting would be much simpler.
Great article!
Regards,
Marko
Chen, nice queries. But I do not agree with 5. A session generating the most redo may not be the one mostly waiting on log file sync. If the session runs a massive DML statement without a commit it never waits for the log file sync event. Right?
Also, you need to be carefull when there are many sessions, each doing a little bit of redo instead of 1 big session.
We had this problem once, when an application went crazy and restarted a certain action again and again (each time creating a session to the db, do some work and logging off again).
I had to use logminer to go through the actions to find out what was happening.
[...] Shapira offers The Senile DBA Guide to Troubleshooting Sudden Growth in Redo Generation, which begins, “I just troubleshooted a server where the amounts of redo generated suddenly [...]
[...] Chen Shapira-The Senile DBA Guide To Troubleshooting Sudden Growth In Redo Generation [...]
Do you have a query where you can find how to find which node(Rac) is generating the most redo log?
The first code snippet in my post will do just that – see how I have a separate column for each node?
Just had to use this for a client. Thanks!
For the below query,
In our database the sql_id is null for few rows… Would you explain me what would be cause? please; thanks.
select SESSION_ID,user_id,sql_id,round(sample_time,’hh’),count(*) from V$ACTIVE_SESSION_HISTORY
where event like ‘log file sync’
group by SESSION_ID,user_id,sql_id,round(sample_time,’hh’)
order by count(*) desc
SESSION_ID user_id sql_id round(sample_time,’hh’) count(*)
429 1202 16/08/2011 12:00:00 PM 560
345 22 16/08/2011 11:00:00 AM 4
330 2784 16/08/2011 11:00:00 AM 3
blank sql_id usually means the sql is part of a scheduled job. Its a well known issue in the data reported in v$session.