The Senile DBA Guide to Troubleshooting Sudden Growth in Redo Generation

I just troubleshooted a server where the amounts of redo generated suddenly exploded to the point of running out of disk space.

After I was done, the problem was found and the storage manager pacified, I decided to save the queries I used. This is a rather common issue, and the scripts will be useful for the next time.

It was very embarrassing to discover that I actually have 4 similar but not identical scripts for troubleshooting redo explosions. Now I have 5 🙂

Here are the techniques I use:

  1. I determine whether there is really a problem and the times the excessive redo was generated by looking at v$log_history:
    select trunc(FIRST_TIME,'HH') ,sum(BLOCKS) BLOCKS , count(*) cnt
    ,sum(decode(THREAD#,1,BLOCKS,0)) Node1_Blocks
    ,sum(decode(THREAD#,1,1,0)) Node1_Count
    ,sum(decode(THREAD#,2,BLOCKS,0)) Node2
    ,sum(decode(THREAD#,2,1,0)) Node2_Count
    ,sum(decode(THREAD#,3,BLOCKS,0)) Node3
    ,sum(decode(THREAD#,3,1,0)) Node3_Count
    from v$archived_log
    where FIRST_TIME >sysdate -30
    group by trunc(FIRST_TIME,'HH')
    order by trunc(FIRST_TIME,'HH') desc
    
  2. If the problem is still happening, I can use Tanel Poder’s Snapper to find the worse redo generating sessions. Tanel explains how to do this in his blog.
  3. However, Snapper’s output is very limited by the fact that it was written specifically for situations where you cannot create anything on the DB. Since I’m normally the DBA on the servers I troubleshoot, I have another script that actually gets information from v$session, sorts the results, etc.
    create global temporary table redo_stats
    ( runid varchar2(15),
      sid number,
      value int )
    on commit preserve rows;
    
    truncate table redo_stats;
    
    insert into redo_stats select 1,sid,value from v$sesstat ss
    join v$statname sn on ss.statistic#=sn.statistic#
    where name='redo size'
    
    commit;
    
    insert into redo_stats select 2,sid,value from v$sesstat ss
    join v$statname sn on ss.statistic#=sn.statistic#
    where name='redo size'
    
    commit;
    
    select *
            from redo_stats a, redo_stats b,v$session s
           where a.sid = b.sid
           and s.sid=a.sid
             and a.runid = 1
             and b.runid = 2
             and (b.value-a.value) > 0
           order by (b.value-a.value)
    
  4. Last technique is normally the most informative, and requires a bit more work than the rest, so I save it for special cases. I’m talking about using logminer to find the offender in the redo logs. This is useful when the problem is no longer happening, but we want to see what was the problem last night. You can do a lot of analysis with the information in logminer, so the key is to dig around and see if you can isolate a problem. I’m just giving a small example here. You can filter log miner data by users, segments, specific transaction ids – the sky is the limit.
    -- Here I pick up the relevant redo logs.
    -- Adjust the where condition to match the times you are interested in.
    -- I picked the last day.
    select 'exec sys.dbms_logmnr.add_logfile(''' || name ||''');'
    from v$archived_log
    where FIRST_TIME >= sysdate-1
    and THREAD#=1
    and dest_id=1
    order by FIRST_TIME desc
    
    -- Use the results of the query above to add the logfiles you are interest in
    -- Then start the logminer
    exec sys.dbms_logmnr.add_logfile('/u10/oradata/MYDB01/arch/1_8646_657724194.dbf');
    
    exec sys.dbms_logmnr.start_logmnr(options=>sys.dbms_logmnr.DICT_FROM_ONLINE_CATALOG);
    
    -- You can find the top users and segments that generated redo
    select seg_owner,seg_name,seg_type_name,operation ,min(TIMESTAMP) ,max(TIMESTAMP) ,count(*)
    from v$logmnr_contents 
    group by seg_owner,seg_name,seg_type_name,operation
    order by count(*) desc
    
    -- You can get more details about the specific actions and the amounts of redo they caused.
    select LOG_ID,THREAD#,operation, data_obj#,SEG_OWNER,SEG_NAME,TABLE_SPACE,count(*) cnt ,sum(RBABYTE) as RBABYTE ,sum(RBABLK) as RBABLK 
    from v$logmnr_contents 
    group by LOG_ID,THREAD#,operation, data_obj#,SEG_OWNER,SEG_NAME,TABLE_SPACE
    order by count(*)
    
    -- don't forget to close the logminer when you are done
    exec sys.dbms_logmnr.end_logmnr;
    
  5. If you have ASH, you can use it to find the sessions and queries that waited the most for “log file sync” event. I found that this has some correlation with the worse redo generators.

    -- find the sessions and queries causing most redo and when it happened
    
    select SESSION_ID,user_id,sql_id,round(sample_time,'hh'),count(*) from V$ACTIVE_SESSION_HISTORY
    where event like 'log file sync'
    group by  SESSION_ID,user_id,sql_id,round(sample_time,'hh')
    order by count(*) desc
    
    -- you can look the the SQL itself by:
    select * from DBA_HIST_SQLTEXT
    where sql_id='dwbbdanhf7p4a'
    

Lessons From OOW09 #1 – Shell Script Tips

During OpenWorld I went to a session about shell scripting. The speaker, Ray Smith, was excellent. Clear, got the pace right, educating and entertaining.

His presentation was based on the book “The Art of Unix Programming” by one Eric Raymond. He recommended reading it, and I may end up doing that.

The idea is that shell scripts should obey two important rules:

  1. Shell scripts must work
  2. Shell scripts must keep working (even when Oracle takes BDUMP away).

Hard to object to that 🙂

Here’s some of his advice on how to achieve these goals (He had many more tips, these are just the ones I found non-trivial and potentially useful. My comments in italics.)

  1. Document dead ends, the things you tried and did not work, so that the next person to maintain the code won’t try them again.
  2. Document the script purpose in the script header, as well as the input arguments
  3. Be kind – try to make the script easy to read. Use indentation. Its 2009, I’m horrified that “please indent” is still a relevant tip.
  4. Clean up temporary files you will use before trying to use them:

    function CleanUpFiles {
    [ $LOGFILE ] && rm -rf ${LOGFILE}
    [ $SPOOLFILE ] && rm -rf ${SPOOLFILE}
    }
  5. Revisit old scripts. Even if they work. Technology changes. This one is very controversial – do we really need to keep chasing the latest technology?
  6. Be nice to the users by working with them – verify before taking actions and keep user informed of what the script is doing at any time. OPatch is a great example.
  7. Error messages should explain errors and advise how to fix them
  8. Same script can work interactively or in cron by using: if [ tty -s ] …
  9. When sending email notifying of success or failure, be complete. Say which host, which job, what happened, how to troubleshoot, when is the next run (or what is the schedule).
  10. Dialog/Zenity – tools that let you easily create cool dialog screens
  11. Never hardcode passwords, hostname, DB name, path. Use ORATAB, command line arguments or parameter files.I felt like clapping here. This is so obvious, yet we are now running a major project to modify all scripts to be like that.
  12. Be consistent – try to use same scripts whenever possible and limit editing permissions
  13. Use version control for your scripts. Getting our team to use version control was one of my major projects this year.
  14. Subversion has HTTP access, so the internal knowledge base can point at the scripts. Wish I knew that last year.
  15. Use deployment control tool like CFEngine. I should definitely check this one out.
  16. Use getopts for parameters. Getopts looked to complicated when I first checked it out, but I should give it another try.
  17. Create everything you need every time you need it. Don’t fail just because a directory does not exist. Output what you just did.
  18. You can have common data files with things like hosts list or DB lists that are collected automatically on regular basis and that you can then reference in your scripts.
  19. You can put comments and descriptions in ORATAB

Preparing to Clone!

We are moving to this new data center. The new servers are in the new data centers, just waiting for Oracle to be installed on them. We have about 50 new servers to install.

Obviously we want to install them as quickly as possible, and taking as little human-work time as possible. To achieve this, we are checking to options – scripted silent installation and installation cloning.

I’m checking the installation cloning part, and I’m using Oracle’s Universal Installer User Guide for the process. One of the first steps in the process is:

At the source, you run a script called prepare_clone.pl.
This is a Perl script that prepares the source for cloning by recording the information that is required for cloning. This is generally located in the following location: $ORACLE_HOME/clone/bin/prepare_clone.pl.

When I tried to run the script, I found out that oracle user did not have execute permissions on the file. Then I found out that the script had the location of PERL hardcoded to the wrong location. Finally, I found out that the file had the usual amount of comments for an Oracle script, but only one line of code:
exit 0;

I found this incredibly amusing, so I decided to blog on this. While blogging, I took a closer look at the documentation, and found the following comment:

The need to perform the preparation phase depends on the Oracle product that you are installing. This script needs to be executed only for the Application Server Cloning. Database and CRS Oracle home Cloning does not need this.

I guess the joke is on me. Serves me right for not reading the instructions carefully.


Group By in Shell

Storage manager dropped by my cube to say that redo log archives for one of our DBs grew from 40G to 200G in the last day. Running “du -Sh” few times showed that the files are no longer getting written as fast, so the issue is probably over. But what was it? And to start, when was it? Once I figure out when the issue occure, I may be able to find the offending DML in ASH.

So, I need to do something like:

select day,hour,sum(bytes)
from (ls -l)
group by day,hour
order by day,hour

I guess I could have used the output of “ls -l”  as an external table and actually run this query. But I’m actually trying to improve my shell scripting skills. There must be some way to do it!

I asked on Twitter, and Kris Rice (of SQLDeveloper fame) pointed me in the right direction – use AWK.

Here’s the result:

ls -l | sed ‘s/:/ /g’ | awk ‘{sizes[$7$8] += int($5)} END {for (hour in sizes)
{ printf(“At %s %d bytes were inserted\n”,hour,sizes[hour])}}’ | sort

I run ls. Use sed to seperate the time into fields (Laurent Schneider will kill me, because I’m mixing sed and awk, but I’m not it his level yet…). After the split, the fifth field contains the file size, the seventh contains the date and the eighth is the hour. So I  have an associative array called “sizes”, indexed on the date and hour, and each bucket contains the sum of the file sizes in this hour.

Then I loop on the array and print the results, and sorted them to make it more readable.

A long post for a short script, but I want to make sure I don’t forget it 😉

Anyway, the best part by far was everyone who tried to help my desperate call in Twitter/Facebook. It was unexpected and very very nice. About 5 people sent tips, ideas and even solutions! I only imagine that doing my work is more fun than doing their own work, but I appreciate the help a lot.


Fork() on Windows.

I was assigned a very simple task – write a script that opens SSH connection to a remote server, waits 30 seconds and closes the connection. About 5 minutes of work on Linux.

On windows it took an entire day.

First attempt: Using MKS32’s KSH.

d:/sitescope/tools/plink sitescope@testdb -pw pass -m d:/sshTest_cmd2;
SSH_PID=$!
echo $SSH_PID
(sleep 30 ; kill $SSH_PID);

Problem: The ssh process executed within a subshell, and I only got the PID for the subshell. Killing the subshell did not kill the SSH process.

Second attempt: Using Perl’s fork, exec and wait.

$pid = fork;
if($pid == 0) {
    exec("d:/sitescope/tools/plink sitescope@testdb -pw pass -m d:/sshTest_cmd2");
}elsif ($pid > 0) {
    sleep 30;
    kill(9,$pid);
} else {
    print "fork failed";
};

Problem: Under windows, fork is not a real fork. It creates a “pseudo process” (AKA thread), and $PID is a negative number. Not a real process ID. After the “exec”, the child thread will become a real process with PID, but again, I’ll not have the PID.

Third attempt: Perl has a WIN32 package, which includes a process object, allowing me to create a process, wait for it, and kill it.

use Win32;
use Win32::Process;
use Win32::Process (STILL_ACTIVE);

sub ErrorReport{
    print Win32::FormatMessage( Win32::GetLastError() );
}

$params = "plink sitescope\@$ARGV[0] -pw $ARGV[1] -m D:\\sshTest_cmd";
print "$params \n";

Win32::Process::Create($ProcessObj,"D:\\SiteScope\\tools\\plink.exe",$params,0,
NORMAL_PRIORITY_CLASS,".")|| die ErrorReport();

$PID = $ProcessObj->GetProcessID();
print "$PID \n";

$ProcessObj->Wait(30000);
$ProcessObj->GetExitCode($exitcode);

print "exit code is: $exitcode \n";

if ($exitcode == 0)
{
    print "exiting normally \n";
exit 0;
} elsif ($exitcode == STILL_ACTIVE)
{
    print "killing ssh \n";
    $ProcessObj->Kill(59);
    exit 59;
} else
{
    print "ssh exited abnormally \n";
    exit $exitcode;
}

Finally something works 🙂


Read Only User

“Oh mighty DBA” said the developer, “You are wise and very attractive. Can I please have access to the TEST system of my application? I need to debug an issue that QA found and I cannot reproduce”.

(I’m paraphrasing a bit).

“No way. You will change few things while you are there and ruin the test cycle. Even if you won’t, QA will think you did and this will ruin the test cycle.”

“I see your point, wise and attractive DBA. In this case, can I have just read-only access to the system?”

(I hate this part in the conversation. When a request sounds innocent and trivial, but is in-fact complicated and full of trickery.)

“I’ll see what I can do. But you don’t get to execute any functions. Just select on table and views. And you’ll need to prefix the table names with the name of the original user.”

(Well, actually giving synonyms is easy enough. If he’ll insist, I’ll add them).

I wrote the function that does it. It is horrible in several ways:

  1. It uses “execute immediate”.
  2. No bind variables. I don’t think I can use them in this case, but it still means I’m doing bad things to the library cache.
  3. I add “with grant” privilege to all object privileges the original user had. I did not want to touch the original user, but as I explained before, I had no choice. I could have been more selective here, though.

I hope it will be useful, but I post it mostly as a rough reference. Think twice before you use it.

CREATE OR REPLACE
PROCEDURE GRANT_READ_ONLY(RO_USER in varchar2, ORIG_USER in varchar2,with_grant in BOOLEAN 
default true) AUTHID current_user AS stmt varchar2(4000);
get_grant_objs varchar2(4000);
TYPE cur_typ IS REF CURSOR;
c_cursor cur_typ;
Type user_priv_typ is record (
    table_name all_tab_privs.table_name%type,
    privilege all_tab_privs.privilege%type,
    grantor all_tab_privs.grantor%type);
user_priv_rec user_priv_typ;

BEGIN
  if (with_grant) then
      get_grant_objs := 'select table_name,privilege,grantor from dba_TAB_PRIVS dtp
          join dba_objects do on dtp.grantor=do.owner and dtp.table_name=do.object_name
          where grantee=''' || ORIG_USER || '''  order by TABLE_NAME';
      open c_cursor for get_grant_objs;
      LOOP
        FETCH c_cursor into user_priv_rec;
        EXIT WHEN c_cursor%NOTFOUND;
        stmt := 'GRANT ' || user_priv_rec.privilege || ' ON ' || user_priv_rec.grantor || '.' || 
        user_priv_rec.table_name || ' TO ' || ORIG_USER || ' WITH GRANT OPTION'; 
        execute immediate stmt;
        --dbms_output.put_line(stmt);
  END LOOP;
  CLOSE c_cursor;
  end if;
  for obj in 
  (
  select object_name from all_objects
  where owner=ORIG_USER
  and object_type in ('SEQUENCE','TABLE','VIEW','MATERIALIZED VIEW'))
  LOOP
    stmt := 'GRANT SELECT ON ' || ORIG_USER || '.' || obj.object_name || ' TO ' || RO_USER;
    execute immediate stmt;
    --dbms_output.put_line(stmt);
  END LOOP;
END;

Joining Unrelated Data

I’m sure this happened to every DBA – an important customer sends an Excel spreadsheet and explains that he wants this report generated from the DB on a weekly basis.

At first look the report looks easy enough and you say “Sure, you’ll have it ready by the end of the week”, but on second look something seems off and you give the customer a call “Listen, it seems that the data in the columns is not really related. First column contains your customers and the second contains the contact emails for each customer, which is fine. But the third column contains the open issues for the customer , and it is not really related to the contact emails column.”
And the customer calmly replies “You are right, it isn’t related. Thats how I want the report. I’ll have it ready by Friday, yes?”

No choice but to write a non-relational join. I do this by generating a row id column and joining on that.
For example:

SQL> create table customers (cust_id number, cust_name varchar(20));

Table created.

SQL> insert into customers values (1,'cust A');

1 row created.

SQL> insert into customers values (2,'cust B');

1 row created.

SQL> insert into customers values (3,'cust C');

1 row created.

SQL> create table contacts (cust_id number, email varchar(64));

Table created.

SQL> insert into contacts values (1,'a1@a.com');

1 row created.

SQL> insert into contacts values (1,'a2@a.com');

1 row created.

SQL> insert into contacts values (2,'b1@b.com');

1 row created.

SQL> insert into contacts values (3,'c1@c.com');

1 row created.

SQL> insert into contacts values (3,'c2@c.com');

1 row created.

SQL> insert into contacts values (3,'c3@c.com');

1 row created.

SQL> create table issues (cust_id number, description varchar(4000));

Table created.

SQL> insert into issues values (1,'Late delivery');

1 row created.

SQL> insert into issues values (1,'Missing components');

1 row created.

SQL> insert into issues values (2,'Wrong address');

1 row created.

SQL> insert into issues values (3,'Malfunctions');

1 row created.

SQL> col email format a10
SQL> col description format a20
SQL> select nvl(t1.cust_name,t2.cust_name),email,description from
  2  (select dense_rank() over (partition by cust_name order by email) n, cust_n
ame,email
  3  from customers cr join contacts cn on cr.cust_id=cn.cust_id) t1
  4  full outer join
  5  (select dense_rank() over (partition by cust_name order by description) n,
cust_name,description
  6  from customers cr join issues i on cr.cust_id=i.cust_id) t2
  7  on t1.cust_name=t2.cust_name and t1.n=t2.n;

NVL(T1.CUST_NAME,T2. EMAIL      DESCRIPTION
-------------------- ---------- --------------------
cust A               a1@a.com   Late delivery
cust A               a2@a.com   Missing components
cust B               b1@b.com   Wrong address
cust C               c1@c.com   Malfunctions
cust C               c2@c.com
cust C               c3@c.com

6 rows selected.

Very ugly. I hope the customer will be happy with his report.