I'm just a simple DBA on a complex production system

Writing about all things production. Especially Oracle databases.

Lessons From OOW09 #3 – What’s New with Streams? October 20, 2009

Filed under: 11g, openworld09, streams — prodlife @ 1:24 am

The big news with streams is the Golden Gate acquisition. Oracle acquired Golden Gate, a smaller company that has a solution very similar to what Oracle Streams is doing. During OpenWorld Oracle announced that Golden Gate is the new strategic direction and that future development will be towards integrating Streams with Golden Gate. Oracle will still support streams, fix bugs, etc – but don’t expect new features other than the Golden Gate integration.

I missed the session where Golden Gate explain what its all about, but I’m planning to invite Golden Gate representative to give us a presentation and explain exactly what they do.

In the mean while, interesting new stuff with the old streams:

  1. One to many replication should be faster and more reliable in 11gR2.
  2. New statement DML handlers – allows manipulating LCRs using SQL statements instead of the PL/SQL code used in the past. According to Oracle it should be 4 times faster this way. One common use case of DML handlers that can now be implemented with the statement handlers is converting “delete” statements into an update that marks a row as “deleted” by modifying a varchar in a column.
  3. Keep Columns – new rule based transformation that allows you to specify which columns should be preserved in an LCR. In the past you can specify columns to drop, but not which columns to keep.
  4. Built in script for recording changes in a log table. This is an incredibly common use case for Streams, and Oracle now has a single command that automatically sets up the necessary statement handles and keep columns. Just call DBMS_STREAMS_ADM.MAINTAIN_CHANGE_TABLE.
  5. XSTREAM – Oracle says its a new API for allowing external programs to insert and pull changes from streams. Its cool, but I’m not convinced its new, since I’ve heard about a similar feature in previous versions under a different name.
  6. Streams performance advisor – thats an 11gR1 feature, but I didn’t know about it. Its a PL/SQL package and a bunch of views that can be used to report on streams performance. It should also be able to detect your topology automatically. You use this by running DBMS_PERFORMANCE_ADVISOR_ADM.ANALYZE_CURRENT_PERFORMANCE and checking a bunch of views. Documented here.
  7. If I followed correctly, the new advisor package uses the UTL_SPADV package, that you can also use for performance analysis or monitoring.
  8. DBMS_COMPARISON – a package that can be used to compare tables between two servers (one should be 11g and the other 10g or 11g). Can be useful when troubleshooting streams.
  9. Streams cross DB tracing – allows tracking a message when trying to troubleshoot apply that doesn’t work properly. You enable message tracking in a session (set message tracking) and then you monitor the views to see the actions that happen to it.
  10. 11g has greatly improved and more detailed apply error messages. This is probably my favorite new feature :) Most of the time I no longer need to print the LCR to debug the issue.
  11. Not sure if this is 11gR1 or R2 – but apparently propagation is no longer managed by “jobs” but has now moved to the new scheduler, making it much more manageable.
  12. Bunch of nice improvements. I’m looking forward to seeing what Golden Gate is doing and why it is so much better.

 

Lessons From OOW09 #2 – Consolidation Tips October 19, 2009

Filed under: openworld09, tips — prodlife @ 5:54 am

The session was called “All in One” and it was given by Husnu Sensoy. A young and very accomplished DBA from Turkey. I chatted with him during ACE dinner and it turns out we have many colleagues in common. This was probably the most useful presentation I’ve heard this OpenWorld. As I am going into a large consolidation project for next year, I am glad I can learn from the experience of someone like Husnu who already gone through this and is very willing to share the experience.

His presentation is shared on his blog, so I’ll just give the parts that I consider to be highlights. You will probably learn more by reading his slides. He had tons of good content and he talks very very fast, so I’m sure I missed a bunch of good stuff. Since his content is readily available, I’m mixing in a lot of my own thoughts here.

The problems he set out to solve:

  • Too many DBs and too few DBAs.
  • Some servers are doing almost nothing.
  • Some servers have no HA.

Pick candidates for consolidation based on: License costs, utilization, data center location, dependencies, I/O characteristics, risk levels.

The driver for our consolidation were license costs – our new machines had two quad cores instead of one dual core, so license costs suddenly quadrapled and we were forced into cost saving consolidations. We mostly used data center location and risk levels to decide on the plan. Our most OLTP system, the one that is most sensitive to slow-downs, will remain unconsolidated for now.

Prior to consolidation collect lots of system/performance metrics. They will help pick candidates, plan capacity, test and later troubleshoot.

Don’t forget to talk to DBAs and business reps when making the consolidation plans, they will have their own ideas and this can be important input.

Additive linear models are recommended for capacity planning. He gave lots of guidelines on how to do this. Pages 26-36 in his PDF have the details. I could have sworn he recommended to stay below the 65% utilization when planning for CPU capacity, but I cannot see it in his slides. In any case – do this, because any higher than that and the linear additive model is questionable.

Also pay attention to the part about preferring larger servers and less RAC nodes, since RAC adds complexity. And to the part about every storage system delivering about 70-80% of spec. Actually, this is more true for the EMC system he used. Our Netapps seem to be up to spec.

Don’t mix sequential and random IO (i.e. OLTP and DW) is a good idea. A lot of places can’t really do this because of the way their apps are designed.

Benchmarking the new system to test the capacity plan is a great idea. I’d love to see more concrete information on how to benchmark, maybe a whole other presentation on this. One of the things that worry me most about our consolidation plans is that I’m not sure how good our tests will be. Husnu recommended HammerOra, which I’ll check out.

Crash tests. We did those ages ago when we moved to RAC architecture and then again for Netapp clusters. Was lots of fun and maybe its time to do this again. Husnu advised to ask support for a list of good test scenarios. I recommend taking your sysadmins, storage admins and net admins for few beers and asking them for scenarios – they generally come up with very creative stuff.

Good tip: In 11g, memory target does not work with huge pages. You are using huge pages, right?

Write the backup and recovery document as the first document on the new system.

Pages 76-86 have good advice on merging databases. We’ll be doing that too and I was glad to see that we came up with the same plans and same problems as Husnu.

His last advice is the best: “You never know how long this is going to take.” So true! Who could have known that we will be delayed for 3 month by an IT security group that popped up from no-where with requirement that we will pass certain security audits that we’ve never heard of. Life in a big organization can be full of surprises, so be prepared :)

 

Lessons From OOW09 #1 – Shell Script Tips October 17, 2009

Filed under: Linux, openworld09, scripts, tips — prodlife @ 1:46 am

During OpenWorld I went to a session about shell scripting. The speaker, Ray Smith, was excellent. Clear, got the pace right, educating and entertaining.

His presentation was based on the book “The Art of Unix Programming” by one Eric Raymond. He recommended reading it, and I may end up doing that.

The idea is that shell scripts should obey two important rules:

  1. Shell scripts must work
  2. Shell scripts must keep working (even when Oracle takes BDUMP away).

Hard to object to that :)

Here’s some of his advice on how to achieve these goals (He had many more tips, these are just the ones I found non-trivial and potentially useful. My comments in italics.)

  1. Document dead ends, the things you tried and did not work, so that the next person to maintain the code won’t try them again.
  2. Document the script purpose in the script header, as well as the input arguments
  3. Be kind – try to make the script easy to read. Use indentation. Its 2009, I’m horrified that “please indent” is still a relevant tip.
  4. Clean up temporary files you will use before trying to use them:

    function CleanUpFiles {
    [ $LOGFILE ] && rm -rf ${LOGFILE}
    [ $SPOOLFILE ] && rm -rf ${SPOOLFILE}
    }
  5. Revisit old scripts. Even if they work. Technology changes. This one is very controversial – do we really need to keep chasing the latest technology?
  6. Be nice to the users by working with them – verify before taking actions and keep user informed of what the script is doing at any time. OPatch is a great example.
  7. Error messages should explain errors and advise how to fix them
  8. Same script can work interactively or in cron by using: if [ tty -s ] …
  9. When sending email notifying of success or failure, be complete. Say which host, which job, what happened, how to troubleshoot, when is the next run (or what is the schedule).
  10. Dialog/Zenity – tools that let you easily create cool dialog screens
  11. Never hardcode passwords, hostname, DB name, path. Use ORATAB, command line arguments or parameter files.I felt like clapping here. This is so obvious, yet we are now running a major project to modify all scripts to be like that.
  12. Be consistent – try to use same scripts whenever possible and limit editing permissions
  13. Use version control for your scripts. Getting our team to use version control was one of my major projects this year.
  14. Subversion has HTTP access, so the internal knowledge base can point at the scripts. Wish I knew that last year.
  15. Use deployment control tool like CFEngine. I should definitely check this one out.
  16. Use getopts for parameters. Getopts looked to complicated when I first checked it out, but I should give it another try.
  17. Create everything you need every time you need it. Don’t fail just because a directory does not exist. Output what you just did.
  18. You can have common data files with things like hosts list or DB lists that are collected automatically on regular basis and that you can then reference in your scripts.
  19. You can put comments and descriptions in ORATAB
 

Most Important Thing I’ve Learned at OOW09 October 13, 2009

Filed under: musing, openworld09 — prodlife @ 2:11 pm

It is only Tuesday morning, OOW is not even half-way through. But there is something I’ve learned on Monday morning, and it left such a huge impression, that I know right now that nothing else that will happen until Thursday can top this.

Jonathan Lewis gave a presentation on “How to be an Expert” in the Unconference. Later I went on to discuss the issue of expertise with him, especially how much work it takes to become an expert. He told me that this morning (Meaning Monday morning) he was preparing a demo for a presentation, and after testing it on 9i, 10g and 11g, he noticed something strange in the way 10g behaved. He then went on to spend the next hour figuring out the strange behavior he saw. Even though it was not part of the presentation and not one was really expecting him to solve this problem.

This story was like a lightbulb going off in my head.

Because I wish I was sure that I would do just what he did. I am geek enough that there is a possibility I would do it. But there is also a voice in my head that tells me things like “Why are you wasting time just playing with this? Its not like solving this problem is going to be useful in any way. You have more important things to do, stop playing!” (Yes, I’m hearing voices. Don’t you?)

And the problem is that very often that internal manager is correct. There are more important things to do. Always. Stop playing with that interesting issue with the TCP/IP stack, reboot that machine already and solve the next SR, it is urgent and has to get done by lunch. Jonathan repeated that several times in his presentation – DBAs are under a lot of pressure not to be experts.

But I also believe that much of life is a story we tell ourselves. Its invented. Not only the relative importance of understanding the system and solving SRs fasters is an invention, also the importance of having our manager approve of the speed in which we solve SRs is something we decide on.

I can invent a story in which I am an Oracle expert. And as an Oracle expert I take time to understand how things behave and why they behave the way they do. Because this is how experts work.

Of course experts also take into account the wishes and desires of the people who pay them, but it cannot replace the importance of really understanding things. Because in the long term, people do pay you to be an expert and understand what is Oracle really doing.

I really hope this post can be as much of an epiphany to others as Jonathan Lewis’s presentation was to me. Invent yourself as an expert. Take time to learn, research, think, play and understand.

 

Visualization Session – The Slides October 13, 2009

Filed under: Uncategorized — prodlife @ 1:49 pm

The “Visualization Session” at OOW Unconference was great. Thanks to everyone who showed up for the lively discussion. It was probably the most fun I’ve ever had at a presentation.
Also thanks for the fine folks whom I later met at the OTN lounge and explained that they wanted to attend my presentation but the OTN lounge had free beer and I did not. I’ll see what I can do about the beer next year.

For those who missed the presentation whether due to beer or to distance from OpenWorld, you can get my slides here. As usual for my presentations, I’m not sure if my slides are meaningful without me standing next to them. It is just a bunch of graphs without the stories. If you really want to hear the stories, you can invite me to speak at your usergroup :)

 

Visualization Session at OOW Unconference October 8, 2009

Filed under: openworld09, presentation — prodlife @ 11:02 pm

OOW is next week, and on Monday, October 12, 4:00pm PST, I’ll be giving my unconference session about visualization.

Visualization is a big topic. What I’m going to talk about is:

  1. Why visualization is such an amazing tool for DBAs. Not just for newbies, it is actually more effective for experienced DBAs. I’ll explain why.
  2. Lots of examples of how you can use different graphs to explore your data set from different angles. I’ll show you some graphs you are familiar with (Histograms), and some that you probably don’t use (QQNorm). I’ll explain when to use each graph type.
  3. Stories about visualization – how a visualization mistake caused a manager to panic and cancel an important maintenance for instance.
  4. Some suggestions on how to format graphs so they’ll have maximum effect in documents, emails and presentations.
  5. If anyone is interested, I can show how I work with R to do visual data analysis. But its not in the slides, so you’ll have to ask for it.

One thing I will not mention:

Edward Tufte.

I know that everyone who talks about visualization talks about him, but I won’t. He does those amazing graphs and demonstrates how powerful visualization can be. I use Excel and R to create my graphs. They are not beautiful or amazing, and they can still be very very effective. This means that you will be able to easily reproduce everything I do in my presentation. You don’t need to be a talented designer to use graphs effectively.

See you on Monday!

 

One More Thing Everyone Should Know About Queues October 8, 2009

Filed under: concurrency, performance — prodlife @ 12:11 am

I already posted two things everyone should know about queues, but the incidents of the last month made me realize I missed another very important queue fact.

Don’t double the load on a server and expect that response times will be “a little slower”. If the server is above 30-40% capacity, doubling the load is likely to be catastrophic.

Suppose you have two servers, for example a Netapp heads, that are operating at 48% capacity each, and suddenly one of them fails and you have to move all the load to one of them. This means that you have one server at close to 100% capacity. Users will not complain that things are a bit slow, they will complain that nothing is working.

Someone once told me that if I have two RAC nodes running at 100% CPU each, I do not have high availability. The truth is that you stop having high availability long before the 200% CPU point.

Oh, and in case anyone wonders what we did about the overloaded Netapp. We stopped every non-essential process reading from this Netapp. This included snapmirrors, non-production databases, exports, vmwares, and probably more stuff than I know. This moved utilization down to 60% point and life was good (except that we weren’t too comfortable about lack of snapmirrors and exports).

 

One of those… Months? October 7, 2009

Filed under: musing — prodlife @ 10:28 pm

I’ve had one of those days, and even some weeks like that, but its the first time we have an entire murphy month – where everything possible goes wrong.

Lets see the list:

  1. DBA accidentally dropped production schema. He thought he was on test DB, of course. We are very proud that we managed to restore said schema with no data loss.
  2. One of our databases magically lost the storage network. No idea why or how. Reboot solved it.
  3. 8 hours downtime caused by a faulty switch. We have high availability, so we automatically failed over to the secondary switch. The secondary switch immediately failed too. Since we test the failover regularly, this can only be described as unbelievably bad luck.
  4. One of our Netapp heads failed. Again, we have high availability, so we fail over to the second head. Except that after we fixed the first head, it refused to recognize the disks. According to Netapp, the first head has to run a “rebuild” on the disks, so it can figure out again where is our data. We could have done it with few hours of downtime, but we already had a lot of downtime this month. So we opted for online rebuild. Which is as fun as online rebuild of indexes. Online rebuild of each disk takes around 12 hours. We have 14 disks. It was the week of unbelievable IO latency. The only upside is that for one week the DBA team was not the target for performance complaints.
  5. Bunch of smaller things: DBA who accidentally reset passwords in 20 servers, backups that stopped working, ORA-600 on capture process for our largest streams customer, accidentally exposing data of one customer to another, etc.

It should be obvious that the gods are out to get us. So much bad luck in one month cannot be accidental or random.

Since this run of production crashes coincides with the Jewish “Day of Atonement” (Yom Kippur) and the preceding repentance days period, the solution seemed obvious – I should repent my sins, promise never to repeat them, and pray for atonement. In Judaism any transgression of law is considered a sin. Even if it is not a moral hazard or if it was done by innocent mistake.

So consider this the reverse of new year’s resolutions. What I resolve not to do next year:

  1. Install new servers and publish them as production before I verify that backups and monitors indeed work on these servers.
  2. Undocumented changes on production servers.
  3. Accuse developers of being stupid and lazy. Not even if I find a nice way to paraphrase this.
  4. Ignore large infrastructural problems just because I prefer to work on something else.
  5. Ignore mysterious production glitches, just because they don’t happen a lot.

These steps should help our production be more stable next year. The more positive resolutions will wait for January :)