A Year with Streams

Whenever Tom Kyte is asked about replication, he has a default answer: Don’t.

Well, last month marked our first year anniversary of our use of streams based replication to create a reporting DB.

In many ways, I’m very proud of this solution. Our reporting DB allows our software-as-a-service to offer very flexible integrations with all kinds of reporting systems without having us worry about impact on our production environment. Our senior DBA designed a terrific architecture that allows for relatively pain-free maintenance on the system. We have several customers using this solution and it seems to be making some profit.

For a year anniversary, we celebrate by announcing a formal SLA for the service. SLA discussions are always interesting. We are looking at 3 important components:

  1. Setup times – How long after the customer requested this service do we expect to deliver it?
  2. Availability – How many hours a month do we expect the service to be down?
  3. Time to replication – A change was made on production. How soon can we show it on the reporting DB?

Setup times are our most volatile component – we range between 15 minutes when everything goes well, to around 2 month when we run into an Oracle bug, Oracle offers no work-around and patch after patch fails to solve the problem.

With that kind of variance, offering any kind of SLA seems a bit futile. Next week I’m at HotSos, Maybe Robyn Sands has the solution. BTW. I find it incredible that until last month I’ve never heard of Robyn Sands, and now not only I hear about her all over the place, the more I hear about her, the more I like her. A lady who’s an Oracle pro, knows her stats and wants to stop root cause madness. I can’t wait to meet her.

Back to the point,  we don’t see how we can offer an SLA on the setup time, which is bad. What about the availability after the service is up? We are doing much better here, in the sense that downtimes are predictable. The problem is that we have many of them. We have mandatory 2 hours a month due to an Oracle bug that requires us to rebuild the environment from scratch every month. We also have an additional monthly downtime for applying the various patches we need in order to solve other bugs.

In short, Streams is still buggy. So even if you design your replication to be very safe, put primary keys on all your tables, and get everything else correct, you still have trouble maintaining high availability and predictable setup times. This is for a product that has been out since 9.*

What about replication times? Here, I feel like the guy who jumped off the 50th floor window. Somewhere around floor 30 someone else looks out his window and asks him “hey! how are you doing?” and he replies “So far, so good!”. So far, we’ve been able to resolve all the performance problems we had with replication (#1 tip – don’t use “like” in replication rules).

With all the troubles we are having with Streams, would I use this solution again? I think that given all the constraints we had to work with, replicating data to a reporting DB was the best solution.

I would have preferred a different solution. I would have liked our product to have a strong, consistent, web services API that would allow integration with external systems without any need to a direct access to the database. Unfortunately, the web services API  arrived late, and we have to support versions that don’t have it yet. Also, integrating with web services API seems to require more work than database level integration and business was very eager to offer an easier solution.

Another solution that might have been preferable (although I’m less sure), is a reporting utility where we could load customized queries for our customers, run them as a low priority process on production and send the customers the results. The problem with this solution is that it may require more work than Streams (keeping an eye on all those custom code running on production), and guaranteeing response times for the reports would have been an issue. With the reporting database we somehow managed not to have any guarantee regarding performance of the reports.

To sum things up, when streams is good it is very very good, and when it is bad, it is still somewhat better than the alternatives. At least it gives me lots of blog material.

Advertisements

14 Comments on “A Year with Streams”

  1. Dan Norris says:

    Robyn is awesome indeed. I look for ways to spend more time around her as well and look forward to Hotsos for that reason, plus I’ll get to see you again too!

  2. Surachart says:

    Good idea! Stream is good solution to make replication.

  3. Narendra says:

    Chen,

    If possible, can you please shed some light on how you arrived at the decision to implement streams ? What othe options you considered and why discarded them ?
    Won’t Logical/Physical Standby database have served the purpose (of reporting DB) without issues with Streams ?
    Not that I know a lot about it (Hey, I am nowhere even near…), but would love to understand the decision-making process.

    • prodlife says:

      @Narenda – that would be a separate post, since it would summarize about a month of research and there is really a lot to describe.
      Specifically about DG, we could not use it since we wanted to consolidate only few schemas from each production DB on one reporting instance.

  4. Uli says:

    Hi there.

    What is your experience with streams and running big batch jobs with a low commit rate (say updating/inserting > 1M records) on your source.

    Thanks
    Uli

    • prodlife says:

      @Uli – I have zero experience with streaming big batch jobs with low commit rates. We stream a more OLTP type environment.
      Streams will have to process each of the updated/inserted lines separately (it does not do batches, just row by row updates), so I’d expect terrible performance. Way slower than on source.
      If I had to replicate a DB with that kind of traffic, I’d just run the batch on both DBs (source and target) at the same time, and make sure it will not be replicated.

  5. chris_c says:

    I caught Robyns presentation root cause analysis at the UKOUG conference (one of two I managed to get too) which was excellent, I missed her other presentation as I was presenting at the same time although that may have been lucky as it meant my audience was much smaller :).
    SLA’s are always tricky its not enough just to set them the clients need to understand what they mean, for instance my last permanant roe we had a 2 hour on call response SLA the actual wording was that after recieving a call a DBA would be logged in and working on the problem within 2 hours. Unfortunatly this often got interpreted as everything wil be fixed in 2 hours which led to some interesting phone calls…

    • prodlife says:

      @chris
      I often compare SLAs to insurance policies. The 99.9% is printed in big letters, but its the small print that really counts.

  6. We have a similar experience with using streams to replicate an OLTP database out to a reporting database, including:

    – Significant time spent building tools and utilities to manage the process.

    – Significant time for debugging, recreating bugs, patching, unpatching, re-patching, arguing with Oracle, managing open TAR’s.

    It’s a headache.

    • prodlife says:

      @Michael Exactly! I’m also slightly disappointed with Oracle’s support of streams. They try hard, but everything that is not completely trivial takes months to resolve.

  7. Wayne Lang says:

    How about read-only data guard? It is much easier to setup and reliable than stream. Do not add much load to production and can be HA solution for it. With 11g, your can use it as testing db as well.

    • prodlife says:

      @Wayne – in our specific case, we only wanted to replicate specific schemas, and to combine several source databases on one target. This made streams a better fit than data guard. I agree that if DG can be used, it is an easier and more reliable solution.

  8. @Uli – Streaming batch updates is possible, but painful. We tend to re-write the batch to commit periodically throughout the job rather than wait to the end to commit. (i.e. recalc fees on 1000 customers at a time instead of on all 1million).

    If streams gets behind, things can go down hill pretty quickly.

  9. Mitchell Coak says:

    @Uli,

    I did testing of this scenario last year. When the LCR count is > 1500 the apply process switches from parallel to serial. Performance slows to a crawl.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s