I'm just a simple DBA on a complex production system

Writing about all things production. Especially Oracle databases.

Autocorrelation and Causation December 7, 2009

Filed under: musing, statistics — prodlife @ 7:48 pm

Everyone talks about how correlation doesn’t imply causation, but no one says what autocorrelation implies.

Maybe its because we don’t talk about autocorrelation at all :)

Lets start talking about autocorrelation by saying what it is:
First of all, autocorrelation is a concept related to time series. Time series is an ordered series of measurements taken at intervals over time. You know how we measure CPU utilization every 10 minutes and then display nice “CPU over time” graphs? thats a time series.
Autocorrelation is the correlation between points in the time series. So we can compare every point in the time series to the measurement taken 10 minutes later, 20 minutes later, etc. And we can find out that every point in our graph is strongly correlated with the measurement taken 10 minutes later and the point taken 60 minutes later.

But does that imply causation?
We normally don’t assume that the current value of the CPU caused the value that the CPU has in 60 minutes. It makes much more sense to assume that there is a third factor that causes the CPU to peak every 60 minutes. This effect is also called seasonality. The weather today is strongly correlated with the weather on Dec 9th 2008. The third factor in this case is the circles our planet makes around the sun.

However the autocorrelation with the point immediately following the current value, often does imply causation of sorts. If you want to make a good guess about the value of IBM stock tomorrow, your best bet is to guess that it will be the same as the value today. Stock values usually have very strong short-term autocorrelation, and we can say that tomorrow’s value is todays value plus some error. IBM stock prices are normally stable, so the error is normally small. So you can say that today’s stock value is caused by today’s value. In a similar way the CPU in 2 minutes can be predicted to be identical to the CPU right now.

I’m hesitant to call this “causation”, because although the stock price today does cause the stock price of tomorrow (plus an error!), the “real” cause is that stock prices and cpus behave in a specific way. On the other hand, we know that they behave in a specific way because we measured the autocorrelation, modeled it and made predictions that work. So in two important uses of causation, understanding the behavior of the thing we measured and making predictions, we can say that we have a cause-and-effect relation. Albeit a bit less intuitive that usual.

If you dig the idea of explaining and predicting CPU and other important performance measurements by using only the measure itself without looking for other explaining factors, then you should definitely attend my presentation about time series analysis at RMOUG. I’ll show exactly how we find autocorrelations and how to predict future values and we’ll discuss whether or not this is a useful method.

 

Goodness of Data December 2, 2009

Filed under: Analysis — prodlife @ 4:16 pm

I’m working on my time-series analysis presentation for RMOUG, and one of the topics I may include (or may not, because it is only marginally relevant) is that of data quality.

You cannot do good analysis and get meaningful results if your data is distorted. If you start your analysis with garbage you’ll end up with garbage.

So, before starting your analysis, you have to look at the data and make sure it doesn’t have any obvious problems. One of the favorite ways of doing this is by literally looking at the data. Plotting a graph of the data is the fastest and easiest way to spot issues.

What kinds of issues?

  1. Outliers: It is easy to spot outliers in a graph (especially if you use a box plot), but not all outliers are bad data. It is important to differentiate between extreme but legitimate data and bad measurements. In order to do that, you really need to understand the data you are looking at and the system it describes. There is a world of difference between “Yes, the system did hit 75 load average that morning” and “Oh yeah, thats the morning when we hit a freaky problem with the /proc system and top reported 75 load average even though the system was not loaded at all”. Some outliers don’t offer a ready explanation – the data says load average was 75, but you have no clue if it was a real issue or not. In those cases my tendency is to err on the side of including the data – if I can’t explain why the data is bad and should be excluded, then I keep it.
  2. Missing data: Missing data isn’t always bad. Systematically missing data is fine – ASH samples Oracle sessions every 10 seconds, so you can say that data in between is missing. But thats just random sampling and all analysis tools deal with this, so you are safe. The bad kind of missing data is the biased missing data – if you monitor your sessions from an external tool that queries v$session, it is likely that when the load is very high, the tool will be unable to connect and query your database. So the data isn’t randomly missing – it is always missing the points of the highest load. Your data set will lack the most important data, and worse it will show your system load as much lower than it really is. Obviously any analysis based on this data will be hopelessly flawed.
  3. Breaks: By breaks I mean specific points in time where the nature of the data completely changed. Example: You are looking at monthly response-times data, and at Nov 15th someone upgraded the SAN. Before the upgrade average response time was 8s with standard deviation of 4s, but after the average response time was 3s with standard deviation of 0.01s. It should be obvious that you can’t analyze November as one single time series, because the behavior of the system changed dramatically. Any forecast made based on the first half of the month will be completely irrelevant.

How do we fix the bad data? Here are few relatively simple suggestions.

  1. Replace outliers and missing data with average value. Note that if your data has strong trend or seasonality (higher load on Monday morning for example), you will need to use local average values because overall average will be meaningless. The process of replacing missing data (or outliers) with meaningful average data is also known as interpolation. You do this whenever you display your data as a graph with one continuous line instead of a series of dots. You only sampled the system every 10 seconds, but your show the graph as if you also have all the data in between. This is probably the most common and most intuitive way of fixing data.
  2. If the missing data has a bias (data is only missing when the load is very high) then replacing missing data with averages is not a good idea. You know that the missing data was not average. In this case you can replace the missing values with maximum values of the data you did measure.
  3. If the data has breaks in it, analyze each part of the data separately and fit a different model to each part. In forecasts you will probably want to use only the latest model (if you have reasons to believe that this behavior will continue into the future).
 

Mad Troubleshooting Skillz! November 24, 2009

Filed under: musing — prodlife @ 3:06 am

Last week I attended Tanel Poder’s Advanced Oracle Troubleshooting seminar, organized by NoCOUG.

Well, actually it was organized by me, with lots of help from Iggy and the rest of NoCOUG. Organizing a seminar was not trivial, but wow – it was totally worth it!

When the seminar was over, we asked the attendees to fill a small survey and tell us what was good and what was bad. Turns out that there was just a single answer for “What was the best thing about the seminar”. The answer is – Tanel.

He is clearly an expert. He loves what he is doing. He can think on his feet and answer all sorts of more-or-less related questions from the audience. He is extremely generous with his time and his knowledge and his scripts (I know lots of DBAs who guard their scripts with their lives, Tanel happily shares them in his blog). Tanel is also funny, entertaining and the course is pretty well structured. He also talks very fast (and seems somewhat obsessed about not wasting a single millisecond) – you won’t believe how much you’ll learn in two days.

I had some trouble explaining explaining to my team and boss what I learned:
“Well, the first half day was troubleshooting hangs and slowdowns using v$views”
“But you knew that before!”
“We also learned lots of Oracle internals. How the shared pool really works, and how SQL is really processed. Lots of cool stuff.”
“OK, but what is it good for?”
“Knowing how stuff works is always good. Anyway, I also learned to use Unix tools to debug problems. Like dumping process stuck to see where it hangs! Also, we learned how to handle free memory issues. Remember that awful leak we had on that test server?”
“No.”
“Oh! Look! Shiny scripts!”

But the proof of the pudding is in the eating. No one can argue the fact that last week I already managed to troubleshoot and solve two problems that other team members failed to make much progress on. I did it very quickly too.
Now here is the strange thing – the two problems were in areas of Oracle that Tanel very explicitly did not mention during the seminar. Streams and Clusterware. I did not even use any of his scripts to shoot them. And yet I’m still convinced that the reason I was so effective in solving those problems is directly related to the seminar. How is that? Here are the important non-technical things I learned at the seminar:

  1. Systematic approach – You don’t work off lists, you don’t waste time by looking at random places and you don’t guess (much). You gather symptoms, you use them to pinpoint the problem and you use the pinpointed knowledge to work your way toward a solution. The last part sometimes involves Oracle support. I knew about the systematic approach thing before, but two days of looking at someone demonstrating it makes a difference.
  2. Don’t believe anyone (except the OS) – Users lie, other DBAs lie, even Oracle sometimes lies. Always crosscheck and double check the facts. No one lies intentionally, but the result is still misleading.
  3. Problems have causes. I know it sounds funny, but very often we stop troubleshooting too soon, attributing a problem to mysterious unknown forces or at least say “well, I don’t know how to know this” and leave things at that. Tanel went farther than anyone I’ve ever seen by saying “I have to know why this behaves like that” and when Oracle doesn’t tell him, he goes to the OS, or the network, or writes his own tools. Thats a good lesson – don’t take no for an answer.
  4. All DBAs have tons of troubleshooting scripts. Real experts have scripts with very short names and very flexible arguments. They also have a script for reminding them how to use their scripts
  5. I no longer view trouble as something annoying that wastes my time and prevents me from doing stuff I want to do. Instead every trouble is now cherished as an opportunity to practice what I learned, learn more and polish my skills.

I highly recommend Tanel’s course to DBAs who want to suddenly become the best troubleshooters in their team. Its not a comfortable position to be in (suddenly a lot more trouble finds its way to you), but it can be lots of fun.

 

The Senile DBA Guide to Troubleshooting Sudden Growth in Redo Generation November 4, 2009

Filed under: scripts, tips — prodlife @ 2:03 am

I just troubleshooted a server where the amounts of redo generated suddenly exploded to the point of running out of disk space.

After I was done, the problem was found and the storage manager pacified, I decided to save the queries I used. This is a rather common issue, and the scripts will be useful for the next time.

It was very embarrassing to discover that I actually have 4 similar but not identical scripts for troubleshooting redo explosions. Now I have 5 :)

Here are the techniques I use:

  1. I determine whether there is really a problem and the times the excessive redo was generated by looking at v$log_history:
    select trunc(FIRST_TIME,'HH') ,sum(BLOCKS) BLOCKS , count(*) cnt
    ,sum(decode(THREAD#,1,BLOCKS,0)) Node1_Blocks
    ,sum(decode(THREAD#,1,1,0)) Node1_Count
    ,sum(decode(THREAD#,2,BLOCKS,0)) Node2
    ,sum(decode(THREAD#,2,1,0)) Node2_Count
    ,sum(decode(THREAD#,3,BLOCKS,0)) Node3
    ,sum(decode(THREAD#,3,1,0)) Node3_Count
    from v$archived_log
    where FIRST_TIME >sysdate -30
    group by trunc(FIRST_TIME,'HH')
    order by trunc(FIRST_TIME,'HH') desc
    
  2. If the problem is still happening, I can use Tanel Poder’s Snapper to find the worse redo generating sessions. Tanel explains how to do this in his blog.
  3. However, Snapper’s output is very limited by the fact that it was written specifically for situations where you cannot create anything on the DB. Since I’m normally the DBA on the servers I troubleshoot, I have another script that actually gets information from v$session, sorts the results, etc.
    create global temporary table redo_stats
    ( runid varchar2(15),
      sid number,
      value int )
    on commit preserve rows;
    
    truncate table redo_stats;
    
    insert into redo_stats select 1,sid,value from v$sesstat ss
    join v$statname sn on ss.statistic#=sn.statistic#
    where name='redo size'
    
    commit;
    
    insert into redo_stats select 2,sid,value from v$sesstat ss
    join v$statname sn on ss.statistic#=sn.statistic#
    where name='redo size'
    
    commit;
    
    select *
            from redo_stats a, redo_stats b,v$session s
           where a.sid = b.sid
           and s.sid=a.sid
             and a.runid = 1
             and b.runid = 2
             and (b.value-a.value) > 0
           order by (b.value-a.value)
    
  4. Last technique is normally the most informative, and requires a bit more work than the rest, so I save it for special cases. I’m talking about using logminer to find the offender in the redo logs. This is useful when the problem is no longer happening, but we want to see what was the problem last night. You can do a lot of analysis with the information in logminer, so the key is to dig around and see if you can isolate a problem. I’m just giving a small example here. You can filter log miner data by users, segments, specific transaction ids – the sky is the limit.
    -- Here I pick up the relevant redo logs.
    -- Adjust the where condition to match the times you are interested in.
    -- I picked the last day.
    select 'exec sys.dbms_logmnr.add_logfile(''' || name ||''');'
    from v$archived_log
    where FIRST_TIME >= sysdate-1
    and THREAD#=1
    and dest_id=1
    order by FIRST_TIME desc
    
    -- Use the results of the query above to add the logfiles you are interest in
    -- Then start the logminer
    exec sys.dbms_logmnr.add_logfile('/u10/oradata/MYDB01/arch/1_8646_657724194.dbf');
    
    exec sys.dbms_logmnr.start_logmnr(options=>sys.dbms_logmnr.DICT_FROM_ONLINE_CATALOG);
    
    -- You can find the top users and segments that generated redo
    select seg_owner,seg_name,seg_type_name,operation ,min(TIMESTAMP) ,max(TIMESTAMP) ,count(*)
    from v$logmnr_contents
    group by seg_owner,seg_name,seg_type_name,operation
    order by count(*) desc
    
    -- You can get more details about the specific actions and the amounts of redo they caused.
    select LOG_ID,THREAD#,operation, data_obj#,SEG_OWNER,SEG_NAME,TABLE_SPACE,count(*) cnt ,sum(RBABYTE) as RBABYTE ,sum(RBABLK) as RBABLK
    from v$logmnr_contents
    group by LOG_ID,THREAD#,operation, data_obj#,SEG_OWNER,SEG_NAME,TABLE_SPACE
    order by count(*)
    
    -- don't forget to close the logminer when you are done
    exec sys.dbms_logmnr.end_logmnr;
    
  5. If you have ASH, you can use it to find the sessions and queries that waited the most for “log file sync” event. I found that this has some correlation with the worse redo generators.

    -- find the sessions and queries causing most redo and when it happened
    
    select SESSION_ID,user_id,sql_id,round(sample_time,'hh'),count(*) from V$ACTIVE_SESSION_HISTORY
    where event like 'log file sync'
    group by  SESSION_ID,user_id,sql_id,round(sample_time,'hh')
    order by count(*) desc
    
    -- you can look the the SQL itself by:
    select * from DBA_HIST_SQLTEXT
    where sql_id='dwbbdanhf7p4a'
    
 

Thinking about Design Patterns November 3, 2009

Filed under: books, design — prodlife @ 3:46 am

I have this friend who is an ambitious young corporate climber. When I was started out as a team lead, I was totally overwhelmed by the office politics I was suddenly exposed to. Naturally, I turned to my friend for advice. She told me to read Machiaveli’s “The Prince”. So I did. Its an interesting read, but not that amazing as far as management advice goes. When I later tried to discuss the book with my friend, I found out that she never actually read the book herself – she just thought it is good advice.

6 years later and I still believe people when they tell me that I have to read a book. I’m naive like that.

So I spent the last two weeks reading “The Timeless Way of Building”. Its an architecture book. Architecture as in cities and buildings. The reason I spent two weeks reading a professional book intended for a different profession is that at some point in history (1994, I believe), some people though that the ideas in the book are relevant to software development. These days you can’t really be a Java developer without being fully fluent in the development pattern language.

And of course, everyone was saying “You have to read The Timeless Way of Building. It will change the way you think about software.”. From my days in development, I still remember quite a bit of stuff about design patterns, and I never really liked that particular approach to software development, but I didn’t really figure out why. After reading the authoritative source on patterns, I can say the following:

  1. Christopher Alexander had some good ideas about patterns. The book is readable to non-architects and is very enlightening. I recommend reading it if you are interested in what makes some cities and buildings feel better than others.
  2. I am pretty sure that his advice on how to design good buildings is not really applicable to software development field.
    A lot of his ideas are based on the fact (which he did the research to prove) that people intuitively know how buildings should be designed, and that when you ask a large number of people “How do you imagine you will feel in such room?”, you’ll get an overwhelming consensus. This is far from being true in the software field.
  3. What is common known today as software design patterns is so far removed from what Christopher Alexander recommended for architects, that software developers should really go and find a different name for what they are doing.

I’m still rather shocked by the differences between Christopher Alexander’s patterns, and what design patterns look like today. It is not few tiny differences that occur whenever ideas are translated from one domain to another. Some of the changes are profound.

First of all, Christopher Alexander says that patterns describe the way people already do things. “Night Life” and “Parallel Streets” existed before the book “A Pattern Language” was written. I’m not at all sure this is the case for design patterns. People buy design pattern books to learn the patterns themselves, not just the language or which patterns are better than others.

Second, patterns should have an intuitive meaning and intuitive name. Again, you don’t need a book to know what is a “Bus Stop” or “Small Parking Lots”. You may want to read the book to find out why they are a good idea, or how to make a good bus stop, but you know what it is. I don’t believe that anyone knew what is an “Abstract Factory” before reading a document about design patterns. Even patterns that have been used for decades got a fancy name. It can take a while to figure out that a Singleton is a global variable. One of the simplest and most common patterns in software development “A function that does exactly one task” is missing from software design patterns. “A Loop” is also a pattern which is missing in action. All this gives the wrong impression that patterns are very complicated and something that can be mastered by experts only – which is exactly the opposite of what Christopher Alexander intended.

Third, patterns are abstract concepts. They are always implemented in a different way, because the entire idea is to be sensitive to the context, which is never the same twice. There is a pattern called “Six-foot balcony”, but it would be wrong to mass-manufacture six-foot balconies and start attaching them to buildings. Six-foot balcony is the idea, the exact shape of the balcony will be designed to match the building, the view, the trees, the sun, etc.
So it is rather annoying to discover that all patterns have “implementation examples”, which developers enjoy copying into their code. I’m all for code reuse, but this is not patterns mean.Wikipedia has a decent description of what defines a pattern, and “being implementable in one or two simple classes that can be copy-pasted” is not part of it.

Executive summary: “Timeless Way of Building” is an interesting book on architecture, with some good insights about how humans like to live and a bit of a Zen feel. You will not learn anything about software development from reading it. If you already know software design patterns, you will be struck by how different the ideas in the book seem.

 

Lessons From OOW09 #3 – What’s New with Streams? October 20, 2009

Filed under: 11g, openworld09, streams — prodlife @ 1:24 am

The big news with streams is the Golden Gate acquisition. Oracle acquired Golden Gate, a smaller company that has a solution very similar to what Oracle Streams is doing. During OpenWorld Oracle announced that Golden Gate is the new strategic direction and that future development will be towards integrating Streams with Golden Gate. Oracle will still support streams, fix bugs, etc – but don’t expect new features other than the Golden Gate integration.

I missed the session where Golden Gate explain what its all about, but I’m planning to invite Golden Gate representative to give us a presentation and explain exactly what they do.

In the mean while, interesting new stuff with the old streams:

  1. One to many replication should be faster and more reliable in 11gR2.
  2. New statement DML handlers – allows manipulating LCRs using SQL statements instead of the PL/SQL code used in the past. According to Oracle it should be 4 times faster this way. One common use case of DML handlers that can now be implemented with the statement handlers is converting “delete” statements into an update that marks a row as “deleted” by modifying a varchar in a column.
  3. Keep Columns – new rule based transformation that allows you to specify which columns should be preserved in an LCR. In the past you can specify columns to drop, but not which columns to keep.
  4. Built in script for recording changes in a log table. This is an incredibly common use case for Streams, and Oracle now has a single command that automatically sets up the necessary statement handles and keep columns. Just call DBMS_STREAMS_ADM.MAINTAIN_CHANGE_TABLE.
  5. XSTREAM – Oracle says its a new API for allowing external programs to insert and pull changes from streams. Its cool, but I’m not convinced its new, since I’ve heard about a similar feature in previous versions under a different name.
  6. Streams performance advisor – thats an 11gR1 feature, but I didn’t know about it. Its a PL/SQL package and a bunch of views that can be used to report on streams performance. It should also be able to detect your topology automatically. You use this by running DBMS_PERFORMANCE_ADVISOR_ADM.ANALYZE_CURRENT_PERFORMANCE and checking a bunch of views. Documented here.
  7. If I followed correctly, the new advisor package uses the UTL_SPADV package, that you can also use for performance analysis or monitoring.
  8. DBMS_COMPARISON – a package that can be used to compare tables between two servers (one should be 11g and the other 10g or 11g). Can be useful when troubleshooting streams.
  9. Streams cross DB tracing – allows tracking a message when trying to troubleshoot apply that doesn’t work properly. You enable message tracking in a session (set message tracking) and then you monitor the views to see the actions that happen to it.
  10. 11g has greatly improved and more detailed apply error messages. This is probably my favorite new feature :) Most of the time I no longer need to print the LCR to debug the issue.
  11. Not sure if this is 11gR1 or R2 – but apparently propagation is no longer managed by “jobs” but has now moved to the new scheduler, making it much more manageable.
  12. Bunch of nice improvements. I’m looking forward to seeing what Golden Gate is doing and why it is so much better.

 

Lessons From OOW09 #2 – Consolidation Tips October 19, 2009

Filed under: openworld09, tips — prodlife @ 5:54 am

The session was called “All in One” and it was given by Husnu Sensoy. A young and very accomplished DBA from Turkey. I chatted with him during ACE dinner and it turns out we have many colleagues in common. This was probably the most useful presentation I’ve heard this OpenWorld. As I am going into a large consolidation project for next year, I am glad I can learn from the experience of someone like Husnu who already gone through this and is very willing to share the experience.

His presentation is shared on his blog, so I’ll just give the parts that I consider to be highlights. You will probably learn more by reading his slides. He had tons of good content and he talks very very fast, so I’m sure I missed a bunch of good stuff. Since his content is readily available, I’m mixing in a lot of my own thoughts here.

The problems he set out to solve:

  • Too many DBs and too few DBAs.
  • Some servers are doing almost nothing.
  • Some servers have no HA.

Pick candidates for consolidation based on: License costs, utilization, data center location, dependencies, I/O characteristics, risk levels.

The driver for our consolidation were license costs – our new machines had two quad cores instead of one dual core, so license costs suddenly quadrapled and we were forced into cost saving consolidations. We mostly used data center location and risk levels to decide on the plan. Our most OLTP system, the one that is most sensitive to slow-downs, will remain unconsolidated for now.

Prior to consolidation collect lots of system/performance metrics. They will help pick candidates, plan capacity, test and later troubleshoot.

Don’t forget to talk to DBAs and business reps when making the consolidation plans, they will have their own ideas and this can be important input.

Additive linear models are recommended for capacity planning. He gave lots of guidelines on how to do this. Pages 26-36 in his PDF have the details. I could have sworn he recommended to stay below the 65% utilization when planning for CPU capacity, but I cannot see it in his slides. In any case – do this, because any higher than that and the linear additive model is questionable.

Also pay attention to the part about preferring larger servers and less RAC nodes, since RAC adds complexity. And to the part about every storage system delivering about 70-80% of spec. Actually, this is more true for the EMC system he used. Our Netapps seem to be up to spec.

Don’t mix sequential and random IO (i.e. OLTP and DW) is a good idea. A lot of places can’t really do this because of the way their apps are designed.

Benchmarking the new system to test the capacity plan is a great idea. I’d love to see more concrete information on how to benchmark, maybe a whole other presentation on this. One of the things that worry me most about our consolidation plans is that I’m not sure how good our tests will be. Husnu recommended HammerOra, which I’ll check out.

Crash tests. We did those ages ago when we moved to RAC architecture and then again for Netapp clusters. Was lots of fun and maybe its time to do this again. Husnu advised to ask support for a list of good test scenarios. I recommend taking your sysadmins, storage admins and net admins for few beers and asking them for scenarios – they generally come up with very creative stuff.

Good tip: In 11g, memory target does not work with huge pages. You are using huge pages, right?

Write the backup and recovery document as the first document on the new system.

Pages 76-86 have good advice on merging databases. We’ll be doing that too and I was glad to see that we came up with the same plans and same problems as Husnu.

His last advice is the best: “You never know how long this is going to take.” So true! Who could have known that we will be delayed for 3 month by an IT security group that popped up from no-where with requirement that we will pass certain security audits that we’ve never heard of. Life in a big organization can be full of surprises, so be prepared :)

 

Lessons From OOW09 #1 – Shell Script Tips October 17, 2009

Filed under: Linux, openworld09, scripts, tips — prodlife @ 1:46 am

During OpenWorld I went to a session about shell scripting. The speaker, Ray Smith, was excellent. Clear, got the pace right, educating and entertaining.

His presentation was based on the book “The Art of Unix Programming” by one Eric Raymond. He recommended reading it, and I may end up doing that.

The idea is that shell scripts should obey two important rules:

  1. Shell scripts must work
  2. Shell scripts must keep working (even when Oracle takes BDUMP away).

Hard to object to that :)

Here’s some of his advice on how to achieve these goals (He had many more tips, these are just the ones I found non-trivial and potentially useful. My comments in italics.)

  1. Document dead ends, the things you tried and did not work, so that the next person to maintain the code won’t try them again.
  2. Document the script purpose in the script header, as well as the input arguments
  3. Be kind – try to make the script easy to read. Use indentation. Its 2009, I’m horrified that “please indent” is still a relevant tip.
  4. Clean up temporary files you will use before trying to use them:

    function CleanUpFiles {
    [ $LOGFILE ] && rm -rf ${LOGFILE}
    [ $SPOOLFILE ] && rm -rf ${SPOOLFILE}
    }
  5. Revisit old scripts. Even if they work. Technology changes. This one is very controversial – do we really need to keep chasing the latest technology?
  6. Be nice to the users by working with them – verify before taking actions and keep user informed of what the script is doing at any time. OPatch is a great example.
  7. Error messages should explain errors and advise how to fix them
  8. Same script can work interactively or in cron by using: if [ tty -s ] …
  9. When sending email notifying of success or failure, be complete. Say which host, which job, what happened, how to troubleshoot, when is the next run (or what is the schedule).
  10. Dialog/Zenity – tools that let you easily create cool dialog screens
  11. Never hardcode passwords, hostname, DB name, path. Use ORATAB, command line arguments or parameter files.I felt like clapping here. This is so obvious, yet we are now running a major project to modify all scripts to be like that.
  12. Be consistent – try to use same scripts whenever possible and limit editing permissions
  13. Use version control for your scripts. Getting our team to use version control was one of my major projects this year.
  14. Subversion has HTTP access, so the internal knowledge base can point at the scripts. Wish I knew that last year.
  15. Use deployment control tool like CFEngine. I should definitely check this one out.
  16. Use getopts for parameters. Getopts looked to complicated when I first checked it out, but I should give it another try.
  17. Create everything you need every time you need it. Don’t fail just because a directory does not exist. Output what you just did.
  18. You can have common data files with things like hosts list or DB lists that are collected automatically on regular basis and that you can then reference in your scripts.
  19. You can put comments and descriptions in ORATAB
 

Most Important Thing I’ve Learned at OOW09 October 13, 2009

Filed under: musing, openworld09 — prodlife @ 2:11 pm

It is only Tuesday morning, OOW is not even half-way through. But there is something I’ve learned on Monday morning, and it left such a huge impression, that I know right now that nothing else that will happen until Thursday can top this.

Jonathan Lewis gave a presentation on “How to be an Expert” in the Unconference. Later I went on to discuss the issue of expertise with him, especially how much work it takes to become an expert. He told me that this morning (Meaning Monday morning) he was preparing a demo for a presentation, and after testing it on 9i, 10g and 11g, he noticed something strange in the way 10g behaved. He then went on to spend the next hour figuring out the strange behavior he saw. Even though it was not part of the presentation and not one was really expecting him to solve this problem.

This story was like a lightbulb going off in my head.

Because I wish I was sure that I would do just what he did. I am geek enough that there is a possibility I would do it. But there is also a voice in my head that tells me things like “Why are you wasting time just playing with this? Its not like solving this problem is going to be useful in any way. You have more important things to do, stop playing!” (Yes, I’m hearing voices. Don’t you?)

And the problem is that very often that internal manager is correct. There are more important things to do. Always. Stop playing with that interesting issue with the TCP/IP stack, reboot that machine already and solve the next SR, it is urgent and has to get done by lunch. Jonathan repeated that several times in his presentation – DBAs are under a lot of pressure not to be experts.

But I also believe that much of life is a story we tell ourselves. Its invented. Not only the relative importance of understanding the system and solving SRs fasters is an invention, also the importance of having our manager approve of the speed in which we solve SRs is something we decide on.

I can invent a story in which I am an Oracle expert. And as an Oracle expert I take time to understand how things behave and why they behave the way they do. Because this is how experts work.

Of course experts also take into account the wishes and desires of the people who pay them, but it cannot replace the importance of really understanding things. Because in the long term, people do pay you to be an expert and understand what is Oracle really doing.

I really hope this post can be as much of an epiphany to others as Jonathan Lewis’s presentation was to me. Invent yourself as an expert. Take time to learn, research, think, play and understand.

 

Visualization Session – The Slides October 13, 2009

Filed under: Uncategorized — prodlife @ 1:49 pm

The “Visualization Session” at OOW Unconference was great. Thanks to everyone who showed up for the lively discussion. It was probably the most fun I’ve ever had at a presentation.
Also thanks for the fine folks whom I later met at the OTN lounge and explained that they wanted to attend my presentation but the OTN lounge had free beer and I did not. I’ll see what I can do about the beer next year.

For those who missed the presentation whether due to beer or to distance from OpenWorld, you can get my slides here. As usual for my presentations, I’m not sure if my slides are meaningful without me standing next to them. It is just a bunch of graphs without the stories. If you really want to hear the stories, you can invite me to speak at your usergroup :)