Control Charts

Last week, while working on customer engagement, I learned a new method of quantifying behavior of time-series data. The method is called “Control Chart” and credit to Josh Wills, our director of data science, for pointing it out. I thought I’ll share it with my readers as its easy to understand, easy to implement, flexible and very useful in many situations.

The problem is ages old – you collect measurements over time and want to know when your measurements indicate abnormal behavior. “Abnormal” is not well defined, and thats on purpose – we want our method to be flexible enough to match what you define as an issue.

For example, lets say Facebook are interested in tracking usage trend for each user, catching those with decreasing use

There are few steps to the Control Chart method:

1. Collect all relevant data points. In our case, number of minutes of Facebook use per day for each user.
2. Calculate a baseline – this can be average use for each user or average use for similar demographics.  Even adaptive average of the type calculated by Oracle Enterprise Manager, to take into account decreased Facebook use over the weekend.
3. Calculate “zones” one, two and three standard deviations around the baseline

Those zones can be used to define rules for normal and abnormal behaviors of the system. These rules are what makes the system valuable.
Examples of rules that define abnormal behavior can be:

1. Any point 3 standard deviations above the baseline. This will indicate extreme sudden increase.
2. 7 consecutive measurements more than one standard deviation over the baseline. This indicates a sustained increase.
3. 9 consecutive measurement each higher than previous one. This indicates steady upward trend.
4. 6 consecutive measurements each more than two standard deviations away from baseline, each one of different side of the baseline than the previous measurement. This indicates instability of the system.

There are even sets of standard rules used in various industries, best practices of sorts. Western Electric rules and Nelson rules are particularly well know.

Note how flexible the method is – you can use any combination of rules that will highlight abnormalities you are interested in highlighting.
Also note that while the traditional use indeed involves charts, the values and rules are very easy to calculate programmatically and visualization can be useful but not mandatory.
If you measure CPU utilization on few servers, visualizing the chart and actually seeing the server behavior can be useful. If you are Facebook and monitor user behavior, visualizing a time series of every one of their millions of users is hopeless. Calculating baselines, standard deviations and rules for each use is trivial.

Also note how this problem is “embarrassingly parallel“. To calculate behavior for each user, you only need to look at data for that particular user. Parallel, share-nothing platform like Hadoop can be used to scale the calculation indefinitely simply by throwing increasing number of servers on the problem. The only limit is the time it takes to calculate the rules for a single user.

Naturally, I didn’t dive into some of the complexities in using Control Charts.  Such as how to select a good baseline, how to calculate standard deviation (or whether to use another statistic to define zones) and how many measurements should be examined before a behavior signals a trend. If you think this tool is useful for you, I encourage you to investigate more deeply.

Big Data News from Oracle OpenWorld 2013

Only a week after Oracle OpenWorld concluded and I already feel like I’m hopelessly behind on posting news and impressions. Behind or not, I have news to share!

The most prominent feature announced at OpenWorld is the “In-Memory Option”  for Oracle Database 12c.  This option is essentially a new part of the SGA that caches tables in column formats. This is expected to make data warehouse queries significantly faster and more efficient. I would have described the feature in more details, but Jonathan Lewis gave a better overview in this forum discussion, so just go read his post.

Why am I excited about a feature that has nothing to do with Hadoop?

First, because I have a lot of experience with large data warehouses. So I know that big data often means large tables, but only few columns used in each query. And I know that in order to optimize these queries and to avoid expensive disk reads every time each query runs, we build indexes on those columns, which makes data loading slow. In-memory option will allow us to drop those indexes and just store the columns we need in memory.

Second, because I’m a huge fan of in-memory data warehouses, and am happy that Oracle is now making these feasible. Few TB of memory in a large server are no longer science fiction, which means that most of your data warehouse will soon fit in memory. Fast analytics for all! And what do you do with the data that won’t fit in memory? Perhaps store it in your Hadoop cluster.

Now that I’m done being excited about the big news, lets talk about small news that you probably didn’t notice but you should.

Oracle announced two cool new features for the Big Data Appliance. Announced may be a big word, Larry Ellison did not stand up on stage and talked about them. Instead the features sneaked quietly into the last upgrade and appeared in the documentation.

Perfect Balance – If you use Hadoop as often as I do, you know how data skew can mess with query performance. You run a job with several reducers, each aggregates data for a subset of keys. Unless you took great care in partitioning your data, the data will not be evenly distributed between the reducers, usually because it wasn’t evenly distributed between the keys. As a result, you will spend 50% of the time waiting for that one last reducer to finish already.

Oracle’s Perfect Balance makes the “took great case in partitioning your data” part much much easier. This blog post is just a quick overview, not an in-depth blog post, so I won’t go into details of how this works (wait for my next post on this topic!). I’ll just mention that Perfect Balance can be used without any change to the application code, so if you are using BDA, there is no excuse not to use it.

And no excuse to play solitaire while waiting for the last reducer to finish.

Oracle XQuery for Hadoop – Announced but not officially released yet, which is why I’m pointing you at an Amis blog post. For now thats the best source of information about this feature. This feature, combined with the existing Oracle Loader for Hadoop will allow running XQuery operations on XMLs stored in Hadoop, pushing down the entire data processing bit to Map Reduce on the Hadoop cluster. Anyone who knows how slow, painful and CPU intensive XML processing can be on an Oracle database server will appreciate this feature. I wish I had it a year ago when I had to ingest XMLs at a very high rate. It is also so cool that I’m a bit sorry that we never developed more awesome XQuery capabilities for Hive and Impala. Can’t wait for the release so I can try that!

During OpenWorld there was also additional exposure for existing, but perhaps not very well known Oracle Big Data features – Hadoop for ODI, Hadoop for OBIEE and using GoldenGate with Hadoop. I’ll try to write more about those soon.

Meanwhile, let me know what you think of In-Memory, Perfect Balance and OXH.

My Oracle OpenWorld 2013 Presentations

Oracle OpenWorld was fantastic, as usual. The best show in San Francisco. This is the seventh year in a row that I’m attending – 3 times as HP employee, 3 times as Pythian employee, and now as a Clouderan. My life changes, but the event and people are always fantastic.

There will be a separate blogpost about what I learned at the event, new exciting products and my thoughts of them. But first, let me follow up on what I taught.

On Sunday afternoon, and then again on Thursday afternoon, I presented “Data Wrangling with Oracle Connectors for Hadoop”. I presented it twice because both Oracle and IOUG liked my abstract. I was surprised to discover that both audiences had no idea what “Data Wrangling” is! I appreciate the attendees, they trusted me enough to attend without even being sure what I’m planning to talk about. In both sessions I had people come up with excellent questions, mentioning that they are current or future Cloudera customers. I absolutely loved it, what a great opportunity to connect with Hadoopers from all industries.

You can find the slides here: Data Wrangling with Oracle Connectors for Hadoop

On Monday, at OakTable World, I presented ETL on Hadoop. I presented it at Surge earlier this year, but this time I think I misjudged the fit of the content to the audience – I gave pretty technical tips of how to implement ETL on Hadoop to an audience with very little experience with Hadoop. They were smart people and mostly followed along, but I should have kept my content to more introductory level.

You can find the slides here: Scaling ETL with Hadoop

On Wednesday, I was fortunate to present with my former colleague Marc Fielding on SSDs and their use in Exadata. The topic is not very Hadoop related, but I love SSDs regardless and presenting with Marc was fun and the audience was highly engaged. I did get a lot of questions on SSDs and Hadoop, so I’ll consider writing about the topic in the future.

Marc has the latest version of the slides, but you can find an approximation here: Databases in a Solid State World.

Thanks again to everyone who attended, to all the customers who stopped to say hello and to everyone who was friendly and made the event fun. I hope to see you again next year.

See Me at Oracle OpenWorld 2013

I’ll be in San Francisco next week, presenting about Hadoop and Big Data at the biggest conference. If you want to say “hi”,

you can attend one of my sessions:

• Big Data Panel Discussion (Sunday, 8am at Moscone West room 3003)
• Data Wrangling with Oracle Big Data Connectors (Sunday 3:30 pm at Moscone West room 3003)
• Women in Technology roundtable (Monday 10 am at Oak Table World )
• ETL on Hadoop (Monday 11 am at Oak Table World – I presented the same content at Surge last week, you can peek at the slides).
• It’s a Solid-State World: How Oracle Exadata X3 Leverages Flash Storage (Wednesday 3:30 pm at Westin San Francisco, Metropolitan I –  With Marc Fielding)
• Data Wrangling with Oracle Big Data Connectors (Thursday 2 pm at Moscone South room 300)

I also plan to attend few sessions by other people:

• Monday 12am – Oracle Database 12c for Data Warehousing and Big Data [CON8710]
• Monday 6:30pm – Oracle’s Big Data Solutions: NoSQL, Connectors, R, and Appliance Technologies [BOF11057]
• Tuesday 3:30pm – Big Data Deep Dive: Oracle Big Data Appliance [CON8646]
• Wednesday 10am – In-Database MapReduce for DBAs and Database Developers Using SQL or Hadoop [CON8601]
• Thursday 12:30 – Hadoop Your ETL: Using Big Data Technologies to Enhance ‘s Data Warehouses [CON8732]

And few social events: Oracle’s ACE dinner, Friends of Pythian dinner, Blogger meetup, OTN event and possibly few more.

You may also find me helping Oracle and Cloudera demonstrate the Big Data Appliance at the Engineered Systems demo booth.

See you there!

Hadoop Summit 2013 – The Schedule

90% of winning is planning. I learned this as a kid from my dad, and I validated this through many years of work in operations. This applies to everything in life, including conferences.

So in order to maximize fun, networking and learning in Hadoop Summit, I’m planning my schedule in advance. Even if only few hours in advance. Its the thought that counts.

In addition to social activities such as catching up with my former colleagues from Pythian, dining with my distributed solutions architecture team in Cloudera and participating in the Hadoop Summit bike ride, I’m also planning to attend few sessions.

There are tons of good sessions at the conference, and it was difficult to pick. It is also very possible that I’ll change my plans in the last minute based on recommendations from other attendees. For the benefit of those who would like soume recommendations, or to catch up with me at the conference, here’s where you can find me:

Wednesday:

11:20am: Securing the Hadoop Ecosystem – Security is important, but I’ll admit that I’m only attending this session because I’m a fan of ATM. Don’t judge.

12am: LinkedIn Member Segmentation Platform: A Big Data Application – LinkedIn are integrating Hive and Pig with Teradata. Just the type of use case I’m interested in, from my favorite social media company.

2:05pm: How One Company Offloaded Data Warehouse ETL To Hadoop and Saved \$30 Million – I’m a huge believer in offloading ETL to Hadoop and I know companies who saved big bucks that way. But \$30M is more than even I’d expect, so I have to hear this story.

2:55pm: HDFS – What is New and Future – Everything Hadoop relies on HDFS, so keeping updated with new features is critical for Hadoop professionals. This should be a packed room.

4:05pm: Parquet: Columnar storage for the People – Parquet is a columnar data store for Hadoop. I’m interesting to learn more about Parquet as it should enable smoother transition of data warehouse workloads to Hadoop.

Thursday:

11:50pm: Mahout and Scalable Natural Language Processing – This session is promising so much data science content in one hour, that I’m a bit worried that my expectations are too high. Hope it doesn’t disappoint.

2:30pm: Video Analysis in Hadoop – A Case Study – My former colleagues at Pythian are presenting a unique Hadoop use-case they developed. I have to be there.

3:35pm: Large scale near real-time log indexing with Flume and SolrCloud – I’m encountering this type of architecture quite often now. Will be interesting to hear how Cisco are doing it.

5:15pm: Building a geospatial processing pipeline using Hadoop and HBase and how Monsanto is using it to help farmers increase their yield – Using Hadoop to analyze large amounts of geospatial data and help farmers. Sounds interesting. Also, Robert is a customer and a very sharp software architect, so worth attending.

Expect me to tweet every 5 minutes from one of the sessions. Its my way of taking notes and sharing knowledge.  If you are at the conference and want to get together, ping me through here or twitter. Also ping me if you think I’m missing a not-to-be-missed session.

Beginnings

“A beginning is the time for taking the most delicate care that the balances are correct.”

It is spring. Time for planting new seeds. I started on a new job last week, and it seems that few of my friends and former colleagues are on their way to new adventures as well. I’m especially excited because I’m starting not just a new job – I will be working on a new product, far younger than Oracle and even MySQL. I am also making first tiny steps in the open-source community, something I’ve been looking to do for a while.

I’m itching to share lessons I’ve learned in my previous job, three challenging and rewarding years as a consultant. The time will arrive for those, but now is the time to share what I know about starting new jobs. Lessons that I need to recall, and that my friends who are also in the process of starting a new job may want to hear.

Say hello
I’m usually a very friendly person and after years of attending conferences I’m very comfortable talking to people I’ve never met before. But still, Cloudera has around 200 people in the bay area offices, which means that I had to say “Hello, I’m Gwen Shapira the new Solutions Architect, who are you?” around 200 times. This is not the most comfortable feeling in the world. Its important to go through the majority of the introductions in the first week or two, later on it becomes a bit more awkward. So in the first week it will certainly seem like you are doing nothing except meeting people, chatting a bit and franctically memorizing names and faces. This is perfectly OK.

Get comfortable being unproductive
The first week in a new job feels remarkably unproductive. This is normal. I’m getting to know people, processes, culture, about 20 new products and 40 new APIs. I have incredibly high expectations of myself, and naturaly I’m not as fast installing Hadoop cluster as I am installing RAC cluster. It takes me far longer to write Python code than it does to write SQL. My expectations create a lot of pressure, I internally yell at myself for taking an hour or so to load data into Hive when it “should” have taken 5 minutes. But of course, I don’t know how long it “should” take, I did it very few times before. I’m learning and while learning has its own pace, it is an investment and therefore productive.

Have lunch, share drinks
The best way to learn about culture is from people, and the best way to learn about products is from the developers who wrote them and are passionate about how they are used. Conversations at lunch time are better than tackling people in the corridor or interrupting them at their desk. Inviting people for drinks are also a great way to learn about a product. Going to someones cube and asking for an in-depth explanation of Hive architecture can be seen as entitled and bothersome. Sending email to the internal Hive mailing list and saying “I’ll buy beer to anyone who can explain Hive architecture to me” will result in a fun evening.

If its not overwhelming, you may be in the wrong job
I’m overwhelmed right now. So many new things to learn. First there are the Hadoop ecosystem products, I know some but far from all of them, and I feel that I need to learn everything in days. Then there is programming. I can code, but I’m not and never have been a proficient programmer. My colleagues are sending out patches left and right. It also seems like everyone around me is a machine learning expert. When did they learn all this? I feel like I will never catch up.

And that is exactly how I like it.

Make as many mistakes as possible
You can learn faster by doing, and you can do faster if you are not afraid of failing and making mistakes. Mistakes are more understandable and forgivable when you are new. I suggest using this window of opportunity and accelerate your learning by trying to do as much as possible. When you make a mistake smile and say “Sorry about that. I’m still new. Now I know what I should and shouldn’t do”

Take notes
When you are new a lot of things will look stupid. Sometimes just because they are very different from the way you are used to things in a previous job. Don’t give in to the temptation to criticise everything, because you will look like a whiner. No one likes whiner. But take note of them, because you will get used to them soon and never see things with “beginner mind” again. In few month take a look at your list, if things still look stupid, it will be time to take on a project or two to fix them.

Contribute
I may be new at this specific job, but I still have a lot to contribute. I try hard to look for opportunities and I keep finding out that I’m more useful than I thought. I participate in discussions in internal mailing lists, I make suggestions, I help colleagues solve problems. I participate in interviews and file tickets when our products don’t work as expected. I don’t wait to be handed work or to be sent to a customer, I look for places where I can be of use.

I don’t change jobs often. So its quite possible that I don’t know everything there is to know about starting a new job. If you have tips and suggestions to share with me and my readers, please comment!

Environment Variables in Grid Control User Defined Metrics

This post originally appeared at the Pythian blog.

Emerson wrote: “Foolish consistency is the hobgoblin of small minds”. I love this quote, because it allows me to announce a presentation titled “7 Sins of Concurrency” and then show up with only 5. There are places where consistency is indeed foolish, while other times I wish for more consistency.

Here is a nice story that illustrates both types of consistency, or lack of.

This customer Grid Control installed in their environment. We were asked to configure all kinds of metrics and monitors for several databases, and we decided to use the Grid Control for this. One of the things we decided to monitor is the success of the backup jobs.

Like many others, this customer runs his backup jobs from cron and the cron job generates an RMAN logfile. I thought that a monitor that will check the logfile for RMAN- and ORA- errors will be just the thing we need.

To be consistent, I could have moved the backup jobs to run from Grid Control scheduler instead of cron. But in my opinion, this would have been foolish consistency – why risk breaking perfectly good backups? Why divert my attention from the monitoring project to take on side improvements?

To have Grid Control check the log files, I decided to use OS UDM: Thats a “User Defined Metric” that is defined on “host” targets and allows to run a script on the server. I wrote a very small shell script that finds the latest log, greps for errors and counts them. The script returns the error count to Grid Control. More than 0 errors is a critical status for the monitor. I followed the instructions in the documentation to the letter – and indeed, everything easily worked. Hurray!

Wait. There’s a catch (and a reason for this blog post). I actually had two instances that are backed up, and therefore two logs to check. I wanted to use the same script and just change the ORACLE_SID in the environment.

No worries. The UI has a field called “Environment” and the documentation says: “Enter any environmental variable(s) required to run the user-defined script.”

One could imagine, based on the field name and the documentation, that if I type: “ORACLE_SID=mysid” in this field, and later run “echo \$ORACLE_SID” in my script, the result would be “mysid”.

Wrong. What does happen is that \$ORACLE_SID is empty. \$1, on the other hand, is “{ORACLE_SID=mysid}”.

To get the environment variable I wanted, I had to do: tmp=(`echo \$1 | tr ‘{}’ ‘  ’`); eval \$tmp

It took me a while to figure this out as this behavior is not documented and I found no usage examples that refer to this issue.

Consistency between your product, the UI and the documentation is not foolish consistency. I expect the documentation and field descriptions to help me do my job and I’m annoyed when it doesn’t.

At least now this behavior is documented somewhere so future googlers may have easier time.