I’m just a simple DBA on a complex production system

Writing about all things production. Especially Oracle databases.

What is “Network Delay”? June 25, 2009

Filed under: musing, network — prodlife @ 9:00 pm

Data-Guard is often used for disaster planning, where the primary server is usual very far from the failover. Often on a different continent.

We are also planning to data-guard, but for a different disaster. We move about 90% of our operation to a new data center. About 3000 miles away. We have 4 hours of downtime. There is an old Russian saying – “two moves equal one fire”, so I guess we are planning for 50% disaster :)

One of the questions that pop-up a lot is the question of network delays. Its not a very well defined question. The person asking usually starts with “A change was done on production at 3pm. When will it be applied on the failover?” and the next question is: “How much time out of this gap is spent on network-related waits, rather than oracle-waits or disk-waits”?

There are 3 important factors that will influence the answer:

  1. Network Latency. The time it takes to send 1 packet of data from target to destination. It is the easiet to measure – just ping the destination, and it is influenced mostly by the physical distance between the two locations and the speed of light.
  2. Bandwidth utilization. The time it takes to send 1 packet is interesting, but we are more interested in the time it takes to send 500M of redo log. We have a nice OC3, theoretically capable of doing 155Mb/s. So theoretically 500M should around take around 15 seconds? Not really. First of all, network is a queue system, and we all know that we shouldn’t really run our queue system at 100% capacity, so we can’t use all 155 Mb/s.  TCP has congestion control implemented, so it wouldn’t let you send all your data at once, it will make you wait for the other side to start acknowledging first and carefully control the amounts of data you are allowed to send at one time.  Oh, and maybe other applications are also using the line. Given all that, it should be obvious that the percentage of the given bandwidth that you manage to actually use has a huge impact on the transfer rates. I’ve no idea how to check the line utilization myself – but my network manager can send me very nice graphs  :)
  3. Congestion and errors. You have an SLA with your provider and all, but its a fact of life that not all the packets that leave source arrive safely to destination. Sometimes they get lost, or arrive in the wrong order. All these errors have an proportional impact on transfer time – 1% lost packets can cause 200% change in transfer times. Because once TCP has to retransmit lost packets it will start sending data very slowly, waiting to make sure it is received on the other side before sending more – and the utilization will drop like a rock.

If you are in the business of getting data across the ocean at a decent speed, you should also know that there are some companies that do WAN acceleration by addressing the causes for delay I mentioned about, and by introducing compression and proxies to the game. Its worth taking a look.

Most of all, don’t estimate delays by pinging remote machines. Talk to your network manager.

 

Cause and Effect – Warranties and Equipment Half-life. June 19, 2009

Filed under: musing — prodlife @ 3:00 am

Cause and effect is much discussed topic. People are always jumping to conclusions about causality, when all they know is that there is a correlation between two events.

XKCD even made a cute cartoon about this subject.

So this eternal question showed up for lunch today. We were discussing the recent epidemic of network equipment failures. Within a week we lost two switches and a router. Someone remarked that we probably purchased all this equipment at the same time, and the warranty on the equipment probably expired last week.

Everyone laughed. Because we all know that if you purchase something with 1 year warranty, the equipment is sure to fail after a year and a day.

But this started a chicken and egg discussion – do manufacturers build equipment that will last exactly a year and a day, or perhaps we they are setting the warranty time to be slightly lower than the mean time to failure of the equipment.

Both the senior DBA and I thought the answer was completely obvious. We just didn’t agree on which answer.

Senior DBA thought that manufacturing technology advanced to a point that it is very easy to build equipment that will last for very long times. However, since they only had incentive to offer warranties for around three years, they built the equipment so that it will fail shortly after the warranty expired.

I thought the opposite. Equipement durability is like SLAs – building something that will last 4 years is vastly more expensive than building something that will last 3 years. Also, I’m not sure that consumers take durability into account when deciding which brand to purchase. So, the manufacturers build something that can be profitable at a price the market will pay. Then they test or calculate how long it will continue operating on average, substract a month from that and call it their warranty.

We had a long and not very usefull discussion around that – each of us was so convinced that his point of view is completely obvious and could not understand how the other person (who is normally very smart) does not immediately see the obvious.

Any readers working in the manufacturing industry cares to shed some light on this mystery?

 

Everything a Junior DBA Should Know – a Book Review June 18, 2009

Filed under: books — prodlife @ 3:50 am

The book is called “Beginning Oracle Database 11g Administration”.

I know what you are thinking. A book called “Beginning Oracle Database 11g Administration” cannot possibly be a good book. It just sounds like one of those books. You know the kind of book I’m talking about -  “DBA in 21 days”, “Oracle 11g handbook”, etc. Big, boring, completely useless.

In this case, looks are so misleading that I feel compelled to correct this first impression. This is not a big, boring, useless book. This is a brilliant and funny book and it is the most perfect book for newbie DBAs. I wish someone had given me this book few years ago. If I ever write a book, I want it to be just like this one. I’m definitely going to make management buy it for every new junior DBA hired in our department (Not that we are hiring new DBAs. But thats a different and very sad story).

I will now proceed to explain (in detail!) why this book is so perfect, and few of the less perfect things about this book.

In my opinion, the thing that makes a technical book fun and brilliant (as opposed to boring and useless) is the author’s voice. My favorite books have a very specific voice to them. The book sounds pretty much like the author sounds in a presentation or conversation or email.

Tom Kyte’s book has that “make a complex subject crystal clear with one good example” voice that I associate with AskTom. Cost-Based Fundamentals sound as obsessively details-oriented as their author, Jonathan Lewis. Laurent Schneider’s Advanced SQL book sounds exactly like Laurent – “The example is self explanatory, words are a waste of time”.

Books that don’t have that personal voice, that sound totally mechanical and robotic, are boring books, and since they probably bored their authors as well, they are usually not very good books.

“Beginning Oracle Database 11g Administration” sound exactly like the author – Iggy Fernandez. And Iggy is one of the funniest people I know. He’s also very no-nonsense type of guy, very realistic about what databases, customers and DBA jobs are like. And he has tons of experience. I’ve known Iggy for about a year, but you’ll be able to figure all this out after reading just few pages – the book is amusing, and slightly cynical and contains lots of real world stories all ending with “This happened to me, don’t let it happen to you!” and there are a bunch of exercises at the end of each chapter to test your knowledge, oh and a nice quote at the beginning of each chapter. Did I mention that this is exactly the type of book I wish I could write?

So, what’s in the book?

The book starts at unusual place – The first chapter is about relational theory. Which is a very welcome change from the usual “SGA, buffer cache, shared pool, 5 processes, data files, redo files, control files” chapter that normally open the newbie books. The chapter introduces important concepts about relations, ACID and data integrity. I liked that – starting with basic concepts before diving into Oracle mechanisms is a great way to start.

Follows is a chapter about SQL and PL/SQL. Its a decent introduction, but one of my least favorite chapters. First of all, the chapter about SQL and PL/SQL contains exactly 2 pages about PL/SQL and one example. I understand the limitations involved in writing an introductory book, but it still didn’t seem right. To make things even more annoying – 4 pages are dedicated to “Criticisms of SQL”. Thats the downside of reading a book with a personal author voice – it is clear that the author fell in love with Chris Date, but objectively, Date’s theories are somewhat of an advanced topic and seem out of place in a beginner level book. Not that I’m complaining – I found the topic fascinating :)

One other issue I found is that the first example in the chapter attempts to demonstrate efficient vs. inefficient SQL, by showing a query that due to a function in the where clause will fail to use an index. Its a great example, but indexes were not introduced in the book at this point and may be unfamiliar to the reader. This kind of problem happens several times in the book – examples and explanations use concepts that were not introduced yet. Not a big deal, but it may stump complete beginners.

Chapter 3 gives the expected overview of Oracle architecture and with that the “concept” chapters are finished.

Chapter 4 is called “Planning” and is one of the reasons I wish I had this book five years ago. It covers license costs, the different editions, different architecture choices (like RAC and dataguard) and most important – sizing. This is the only general-purpose DBA book I’ve seen to cover the essential issue of sizing.

Chapter 5 is installation and 6 is database creation. They contain all the usual screenshots of the usual installation and DBCA graphical interfaces. In the end of chapter 6 is a nice surprise – a listing that shows how to create a database without DBCA, using the almost forgotten “create database” command.

Chapter 7 covers physical design – partitions, indexes, clusters, etc. Chapter 8 covers user creation, some of the basics of privilege management (although options like “with grant” are left out) and also tools like export, datapump and sqlloader. I loved the explanation about datapump – because the listing contained many of the errors DBAs run into when using it. Much better than the examples in which everything works!

This ends the section about implementation. The next section is about support.

Chapter 9 covers some of the administration tools – SQL Developer and EM. I would have dropped this chapter in favor of PL/SQL chapter, but thats me.

Chapter 10 is about monitoring. Like planning, monitoring is a huge and critical part of the DBA job, and most books don’t give it the attention it deserves. This chapter is another reason this book is a must-have for new DBAs. Iggy avoids the trap of saying “EM takes care of monitoring. Here’s how you configure EM”. The chapter is completely tool-independent and covers the most important things a DBA should monitor. (I know it took me few years to figure out that backups should be monitored and very carefully too!). The chapter is also full of the queries that you’ll want to use for monitoring – which makes it a very valuable and useful reference.

Chapter 11 is about troubleshooting. It starts with a systematic overview of the troubleshooting process (because troubleshooting is a process and not something that is done once) and the process Iggy gives is identical to the one I give in my “Troubleshooting Streams” presentation. We did not steal from each other – its the very well known scientific process and everyone should follow it. Its a very promising start, but it gets much better. Iggy gives troubleshooting best practices and an hillarious dialog between a DBA and a user demonstrating the troubleshooting process. The chapter finishes with a list of resources that can be used for troubleshooting, and some advice about troubleshooting common error codes (although ORA-01555 is not that common anymore). The explanation on troubleshooting ORA-600 also demonstrates some metalink basics (The only Oracle book I know that explains Metalink. This is amazing, considering that many DBAs use metalink as often as they use SQL*PLUS)

Chapters 12 and 13 are a decent overview of backup and recovery topics.

Chapter 14 covers the things DBAs normally do on a regular basis – backups, statistics collection, data archiving, cleanup of trace files, reviewing audit records, password changes, capacitiy management and patching. Putting all this in a single chapter is unusual, and I’m not 100% sure it works. Statistics make more sense in the context of tuning, backups were already discussed, etc. I would have reduced this chapter to a single page checklist and a strong recommendation to automate as many of these tasks as possible.

Chapter 15 is another great one. This is the only chapter in the book that is as useful to senior DBAs as it is to beginners. The chapter covers the correct way to run a DBA team and make management happy. It covers ITIL basics and the recommended IT management processes. It explains what the business usually expects from the DBA team. And it gives very good ideas on how to build a documentation library and more important – why.

The last section is about database tuning. Books can be written about this topic, and Iggy only gives a basic beginner-level intro to the subject. Thankfully he avoids saying anything that will require unlearning later on (no ratio-based tuning!).

Chapter 16 covers instance tuning. Starting with the systematic process one should follow (instead of just messing with random hidden parameters) and discusses the main techniques used – DB Time analysis, Oracle wait interface and statspack.

Chapter 17 is about SQL statement tuning – it discusses the use of session traces to find these queries and some tuning techniques such as indexes, hints and collecting better statistics. The example that ends this chapter is very in-depth and very useful.

I hope that by now everyone who actually bothered reading my slightly longish review understands why I love this book so much. I can’t overemphasize how fun and readable the book is, and how useful it will be for DBAs with 1-2 years of experience.

In the interest of transparency, I’d like to mention that Iggy is a friend of mine and he gave me a copy of his book for free. He did not ask me to review it, its just that if a friend of mine happens to write the best beginner DBA book in existance, I think everyone should know.

 

Did you know you can rename tablespaces? June 13, 2009

Filed under: tips — prodlife @ 2:07 am

You probably knew that. There is some possibility that I’m the last person to figure it out and even I knew about it for at least two years. After all renaming tablespaces existed since 10.1, which makes it almost 6 years old.

Just in case you didn’t know:
alter tablespace users rename to cool_users;

According to the docs (and my tests confirmed that), the rename changes the tablespace name but not the tablespace id – so it will not break behaviors such as default tablespace and file ownership and such.

Why is it such a cool feature? Because sometimes I get an export dump file from a customer who is moving to our service. I have to import the schemas from the dump file to our DB. I want his schemas to reside on CUST_X tablespace, but the dump file says they are in MYTBS. What do I do?

Before the magical rename: import to an indexfile. Use VI to modify the file to create all objects in CUST_X tablespace instead of MYTBS. Run the indexfile. Run the import.

After magical rename: rename CUST_X to MYTBS. Import. Rename back.

Much better!

Of course, datapump has the even more magical remap_tablespace parameter. Alas, not all my customers use datapump.

 

NoCoug Spring Conference – May 21 May 8, 2009

Filed under: nocoug — prodlife @ 1:25 am

I’m into local things these days. I try to buy food at the local farmers market and my books in local book stores. I’m lucky to live in California where the farmer markets contain tons of fresh veggies even in February, and the local book shop carries several copies of “Cost Based Fundamentals”  :)

I’m also happy when I can get my Oracle Geekery locally. I mean,  I enjoy meeting DBAs from all over the world in the big international conferences. But practically speaking, my next job is more likely to come from the DBA I met at a local user group and tells me about that great company he is working for, and the local OUG is where I meet the guys (and girls) that I can call up when I’m completely stuck on a problem, or if I want to meet someone to have a drink and talk shop for a bit.

Local user groups rule. Especially in California.

After all that chat, here are my reasons for telling my bosses that I’ll be on training on May 21st and they can only call me on emergencies:

  1. Cary Millsap will be giving the keynote and a presentation. I’m a huge fan of his :)
  2. Gaja Krishna Vaidyanatha will be presenting “The SAN is Guilty… Until Proven Otherwise!”. He is such a good presenter that you’ll enjoy this even if you don’t have SAN.
  3. I’m giving my “Streams Troubleshooting” presentation. If you ever stared at the screen helpless while streams simply did not work, you probably want to attend. I have tons of experience to share.
  4. Dave Abercrombie from Convio will talk about session tagging for end-to-end monitoring and troubleshooting. I met him at HotSos – he’s a good speaker with great approach to monitoring.
  5. Did I mention how cheap NoCoug conferences are? I still can’t believe that you get to listen to all those great presentations + breakfast + lunch for 25$. I mean, this is 1% from the cost of Oracle Open World, and you get so much more value.
  6. Its a good chance to catch up with my local Oracle friends.
  7. People go to me and say “Hey! I love your blog!” :)
  8. I’m not in the office. This is almost like vacation, but it doesn’t count as PTO because I’m in a conference.

See you there!

 

Always Make New Mistakes May 6, 2009

Filed under: musing — prodlife @ 8:06 pm

I found a cute fridge magnet with this sentence on it, and I liked it so much that I copied it to a post it and stuck it to the wall next to the monitor.

I love it because I want to always be doing new stuff, and when you do new stuff you naturally make mistakes, and you learn from them, and its great. You just want to avoid doing the same mistake twice, because this may mean that you are not learning. What’s not to like? Oh, and I also get to write blog posts about my new mistakes. Old mistakes are usually not worth writing about.

But this should probably be improved to “Always Make New Mistakes (in staging)”, because when I work on production I want to always see nice old mistakes that I’ve seen before and know exactly how to solve in the 3 minutes of maintenance downtime I have this month while my boss is looking over my shoulder asking questions.

Yay for new mistakes, but not on production.

 

Streams on RAC May 5, 2009

Filed under: rac, streams — prodlife @ 6:51 pm

We had RAC system as streams source for 18 month now. But just today I configured a RAC as streams target.

It was somewhat of an anticlimax since there is absolutely nothing interesting to do.

We do downstream capture, so I had to place the archive logs on a shared drive. Every place where I used the SID before, I now used the DB name.

Capture and Apply processes both started on the same node. When I stopped that node, I saw the other node modifying a service called SYS${streams user}.{streams queue name}.{target db name}’ , changing it to run on the remaining node.

Then the capture and apply processes started on the remaining node and everything continued as usual.

As I said, no big deal. I just wanted to let everyone know that it is no big deal.

 

When Archive Logs Are Not Copied To Standby May 5, 2009

Filed under: streams — prodlife @ 1:01 am

I debug this same issue at least once a month since January 2008, and I always debug it from scratch because I’m a DBA with bad memory and I never remember how I solved it the last 18 times.

So, if you added a second (or third) archive log destination and then you see errors in the alter log that look like:

————————————————————
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
and that the SYS password is same in the password files.
returning error ORA-16191
————————————————————

Check:

  1. That you are using password files
  2. That the name of the password file is $ORACLE_HOME/dbs/orapw${ORACLE_SID}
  3. That the password of sys is the same on all the servers involved in the archive copy
  4. If one of the databases is 11g, make sure you set sec_case_sensitive_logon to false
  5. Also check that you created the password file with ignore_case=y

I think this covers my usual debug process. Except it normally takes me an hour to remember that 11g has this thing with case sensitive passwords, and what the parameter is called, and that there is another thingy to do with the password files… Hopefully, not anymore.

 

Everyone Loves a Reorganization April 30, 2009

Filed under: musing — prodlife @ 1:46 am

At least VPs love them. So when our division’s VP retired and we got a brand new VP, we pretty much knew what to expect.

This reorg has been a good one. Several of my favorite co-workers and managers got amazing promotions. It is terrific to see good and capable people recognized and moved up the management chain. I’m happy to work in an organization that will be run by such a great team.

The DBA team was moved from being part of application production team to the infrastructure team. Now my team mates are the storage manager, the network manager and the monitoring manager. This is a big change, but its going to be great. I love the folks of the infrastructure team. Naturally, we worked together quite a bit in the past and leaving mutual finger-pointing aside, they are really really good at what they do and I’m very happy to be in their team. One of them even reads my blog :)

But even the best reorg is still a jolt. I find myself agonizing over all sort of stuff. The silliest thing I’m worried about is what to put in “rollback decision maker” field in my change requests. My last manager thought that rollback decisions should be made by the person doing the change. The manager before that insisted that he will be the one to make the decision. Both had a point. What is my new manager going to prefer? (I’ll find out in the CAB meeting tomorrow).

This is a good time to reflect on how much my work is shaped by the specific preferences of my direct manager. Some things are very constant – backups will be taken and tested, patches will be applied, production schema will be cloned for development.

But everything else is up for grabs – whether or not the formal change process should be followed, should we get involved in long-term projects, involvement in what other teams are doing, do we try new solutions or stick to what works, whether team members should know what is going on or be kept in the dark, whether team members are allowed to have opinions, etc.

I’m also fretting about bigger things. For example, I hope the new boss will not have problem with blogs, conferences and community involvement.

*******

If you are in the Netherlands on May 26, you should seriously consider going to DBA Symposium. For some reason, many of my favorite Oracle experts are Dutch, and three of them (Rob van Wijk, Frits hoogland and Toons Koppelaars) will be presenting there. Its got to be an amazing symposium.

 

Group By in Shell April 22, 2009

Filed under: Linux, scripts, tips — prodlife @ 6:35 pm

Storage manager dropped by my cube to say that redo log archives for one of our DBs grew from 40G to 200G in the last day. Running “du -Sh” few times showed that the files are no longer getting written as fast, so the issue is probably over. But what was it? And to start, when was it? Once I figure out when the issue occure, I may be able to find the offending DML in ASH.

So, I need to do something like:

select day,hour,sum(bytes)
from (ls -l)
group by day,hour
order by day,hour

I guess I could have used the output of “ls -l”  as an external table and actually run this query. But I’m actually trying to improve my shell scripting skills. There must be some way to do it!

I asked on Twitter, and Kris Rice (of SQLDeveloper fame) pointed me in the right direction – use AWK.

Here’s the result:

ls -l | sed ’s/:/ /g’ | awk ‘{sizes[$7$8] += int($5)} END {for (hour in sizes)
{ printf(”At %s %d bytes were inserted\n”,hour,sizes[hour])}}’ | sort

I run ls. Use sed to seperate the time into fields (Laurent Schneider will kill me, because I’m mixing sed and awk, but I’m not it his level yet…). After the split, the fifth field contains the file size, the seventh contains the date and the eighth is the hour. So I  have an associative array called “sizes”, indexed on the date and hour, and each bucket contains the sum of the file sizes in this hour.

Then I loop on the array and print the results, and sorted them to make it more readable.

A long post for a short script, but I want to make sure I don’t forget it ;)

Anyway, the best part by far was everyone who tried to help my desperate call in Twitter/Facebook. It was unexpected and very very nice. About 5 people sent tips, ideas and even solutions! I only imagine that doing my work is more fun than doing their own work, but I appreciate the help a lot.