On Error Messages

Here’s a pet peeve of mine: Customers who don’t read the error messages. The usual symptom is a belief that there is just on error: “Doesn’t work”, and that all forms of “doesn’t work” are the same. So if you tried something, got an error, your changed something and you are still getting an error, nothing changed.

I hope everyone who reads this blog understand why this behavior makes any troubleshooting nearly impossible. So I won’t bother to explain why I find this so annoying and so self defeating. Instead, I’ll explain what can we, as developers, can do to improve the situation a bit. (OMG, did I just refer to myself as a developer? I do write code that is then used by customers, so I may as well take responsibility for it)

Here’s what I see as main reasons people don’t read error messages:

  1. Error message is so long that they don’t know where to start reading. Errors with multiple Java stack dumps are especially fun. Stack traces are useful only to people who look at the code, so while its important to get them (for support), in most cases your users don’t need to see all that very specific information.
  2. Many different errors lead to the same message. The error message simply doesn’t indicate what the error may be, because it can be one of many different things. I think Kerberos is the worst offender here, so many failures look identical. If this happens very often, you tune out the error message.
  3. The error is so technical and cryptic that it gives you no clue on where to start troubleshooting.  “Table not Found” is clear. “Call to localhost failed on local exception” is not.

I spend a lot of time explaining to my customers “When <app X> says <this> it means that <misconfiguration> happened and you should <solution>”.

To get users to read error messages, I think error messages should be:

  1. Short. Single line or less.
  2. Clear. As much as possible, explain what went wrong in terms your users should understand.
  3. Actionable. There should be one or two actions that the user should take to either resolve the issue or gather enough information to deduce what happened.

I think Oracle are doing a pretty good job of it. Every one of their errors has an ID number, a short description, an explanation and a proposed solution. See here for example: http://docs.oracle.com/cd/B28359_01/server.111/b28278/e2100.htm#ORA-02140

If we don’t make our errors short, clear and actionable – we shouldn’t be surprised when our users simply ignore them and then complain that our app is impossible to use (or worse – don’t complain, but also don’t use our app).

 

 

 


Parameterizing Hive Actions in Oozie Workflows

Very common request I get from my customers is to parameterize the query executed by a Hive action in their Oozie workflow.
For example, the dates used in the query depend on a result of a previous action. Or maybe they depend on something completely external to the system – the operator just decides to run the workflow on specific dates.

There are many ways to do this, including using EL expressions, capturing output from shell action or java action.
Here’s an example of how to pass the parameters through the command line. This assumes that whoever triggers the workflow (Human or an external system) has the correct value and just needs to pass it to the workflow so it will be used by the query.

Here’s what the query looks like:

insert into test select * from test2 where dt=${MYDATE}

MYDATE is the parameter that allows me to run the query on a different date each time. When running this query in hive, I’d use something like “set MYDATE=’2011-10-10′” before running the query. But when I run it from Oozie, I need to pass the parameter to the Hive action.

Lets assume I saved the query in a file hive1.hql. Here’s what the Oozie workflow would look like:

<workflow-app name="cmd-param-demo" xmlns="uri:oozie:workflow:0.4">
	<start to="hive-demo"/>
	<action name="hive-demo">
		<hive xmlns="uri:oozie:hive-action:0.2">
			<job-tracker>${jobTracker}</job-tracker>
			<name-node>${nameNode}</name-node>
			<job-xml>${hiveSiteXML}</job-xml>
			<script>${dbScripts}/hive1.hql</script>
			<param>MYDATE=${MYDATE}</param>
		</hive>
		<ok to="end"/>
		<error to="kill"/>
	</action>
	<kill name="kill">
		<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
	</kill>
	<end name="end"/>
</workflow-app>

The important line is “MYDATE=${MYDATE}”. Here I translate an Oozie parameter to a parameter that will be used by the Hive script. Don’t forget to copy hive-site.xml and hive1.hql to HDFS! Oozie actions can run on any datanode and will not read files on local file system.

And here’s how you call Oozie with the commandline parameter:
oozie job -oozie http://myserver:11000/oozie -config ~/workflow/job.properties -run -verbose -DMYDATE=’2013-11-15′

Thats it!


Three Things To Do Before Starting Hadoop Project

I spent the last 6 month helping customers design, implement and deploy successful Hadoop systems. Over time, you start seeing patterns. Certain things that if the customer gets right before the project event starts increase the probability that the project will finish successfully and on-time.

Lets assume you already took the most important step – you have an actual use-case or a problem to solve, and you think  Hadoop could be the right technology to use in this case. What now?

  1. Learn how to automate production deployments: 
    Yes, Cloudera Manager would do a lot for you. But occasionally you’ll need to run the same command on 20+ servers. Thats what large clusters are all about. So get used to the idea from the beginning. Learn how to write loops in bash, how to ssh into remote machines to run commands, how to distribute files and restart services.
    You can build your own tools (I had for years), but these days you can just pick your favorite automation tool and use that instead. My favorite is Ansible. It works exactly the way I would have written a cluster automation tool, so learning it never felt like an effort and its usage is never surprising or unintuitive to me. Others prefer Puppet, Chef or Cfengine. It doesn’t matter what you use, but when I show up at your office as your Cloudera Solutions Architect and ask you to update sysctl.conf on your 50 node cluster, I don’t want you too look surprised, alarmed or tell me it will take few hours.
  2. Don’t try to boil the ocean:
    Hadoop implementation is often the first chance the dev/ops teams get to do something completely new. There is a blank slate, white sheet of paper, and you can design the perfect system. Fixing every problem the old system had and building functionality you always dreamed of.Better security!  Machine learning! Open source!
    I say – Using Hadoop successfully is a large and challenging project. Changing organizational processes and culture toward better security and processes is a large and challenging project. Creating a data driven organization is a huge project. Mixing them doesn’t give you three projects for the price of one. It isn’t even just three times more challenging than one project, I’d say the risk is an order of magnitude higher, and the risk of just implementing Hadoop is high enough. Especially in the early stages. Which brings us to… 
  3. Do a POC:
     Pay Cloudera or do it yourself. Either way, you need a POC. 
    If you start with a 12 month project, you will have to do a lot of design upfront. At a time when you have too little information. At the beginning of a large project, you won’t know for sure how the system will be used and you probably won’t know enough about Hadoop. Sure, you can call Cloudera Services and discuss the design with us, but even we can sometimes (rarely!) get things wrong. With 12 month projects you will be very deep in the project before you’ll find that limitation we completely forgot to mention.
    Be agile (really agile, 6 month project with daily scrum doesn’t count): Solve the smallest useful problem first. Implement just a single workflow, single statistical analysis, parse and search data from one source. Whatever is useful for your users – do it first. Learn in the process and build from there. This will allow you to build experience and iron-out issues at the system’s usefulness, load and importance grow.
  4. (Bonus tip) Get the most out of the POC: 
    Not all POCs are created equal. Sometimes the customer hires us to “prove that Hadoop can do X”. We get very specific requirements and very short time-frame, and we need to build a system that does X. I can see why customers need proof that their vendor can deliver. But this approach is of limited value. Because at the end of the POC you are left with a non-production system and you don’t know more Hadoop than you knew before. I love teaching, but when the attitude is “prove us this works” and the requirements are inflexible, there isn’t much time for discussions and casual learning. Most of the knowledge transfer will only happen in the delivery document, which is not the same as lively discussions.
    Better POC happens when a customer brings in Cloudera to help them build their first Hadoop project. There are still time and scope constraints, but now the POC is not about “Prove us this works” but rather “Help us make it work”. We work together as a team. We will brainstorm design possibilities with you and share best practices. We will teach you how to build the system, how to configure it and troubleshoot it. You get a chance to learn all our little tricks of development and deployment that makes life easier. At the end of the POC, your team will have real Hadoop expertise, relevant to your specific system, problems, culture, data and requirements. I see this as the best investment you can make in Hadoop for your organization. 
    But I may be a bit biased.

 


Big Data News from Oracle OpenWorld 2013

Only a week after Oracle OpenWorld concluded and I already feel like I’m hopelessly behind on posting news and impressions. Behind or not, I have news to share!

The most prominent feature announced at OpenWorld is the “In-Memory Option”  for Oracle Database 12c.  This option is essentially a new part of the SGA that caches tables in column formats. This is expected to make data warehouse queries significantly faster and more efficient. I would have described the feature in more details, but Jonathan Lewis gave a better overview in this forum discussion, so just go read his post.

Why am I excited about a feature that has nothing to do with Hadoop?

First, because I have a lot of experience with large data warehouses. So I know that big data often means large tables, but only few columns used in each query. And I know that in order to optimize these queries and to avoid expensive disk reads every time each query runs, we build indexes on those columns, which makes data loading slow. In-memory option will allow us to drop those indexes and just store the columns we need in memory.

Second, because I’m a huge fan of in-memory data warehouses, and am happy that Oracle is now making these feasible. Few TB of memory in a large server are no longer science fiction, which means that most of your data warehouse will soon fit in memory. Fast analytics for all! And what do you do with the data that won’t fit in memory? Perhaps store it in your Hadoop cluster.

Now that I’m done being excited about the big news, lets talk about small news that you probably didn’t notice but you should.

Oracle announced two cool new features for the Big Data Appliance. Announced may be a big word, Larry Ellison did not stand up on stage and talked about them. Instead the features sneaked quietly into the last upgrade and appeared in the documentation.

Perfect Balance – If you use Hadoop as often as I do, you know how data skew can mess with query performance. You run a job with several reducers, each aggregates data for a subset of keys. Unless you took great care in partitioning your data, the data will not be evenly distributed between the reducers, usually because it wasn’t evenly distributed between the keys. As a result, you will spend 50% of the time waiting for that one last reducer to finish already.

Oracle’s Perfect Balance makes the “took great case in partitioning your data” part much much easier. This blog post is just a quick overview, not an in-depth blog post, so I won’t go into details of how this works (wait for my next post on this topic!). I’ll just mention that Perfect Balance can be used without any change to the application code, so if you are using BDA, there is no excuse not to use it.

And no excuse to play solitaire while waiting for the last reducer to finish.

Oracle XQuery for Hadoop – Announced but not officially released yet, which is why I’m pointing you at an Amis blog post. For now thats the best source of information about this feature. This feature, combined with the existing Oracle Loader for Hadoop will allow running XQuery operations on XMLs stored in Hadoop, pushing down the entire data processing bit to Map Reduce on the Hadoop cluster. Anyone who knows how slow, painful and CPU intensive XML processing can be on an Oracle database server will appreciate this feature. I wish I had it a year ago when I had to ingest XMLs at a very high rate. It is also so cool that I’m a bit sorry that we never developed more awesome XQuery capabilities for Hive and Impala. Can’t wait for the release so I can try that!

During OpenWorld there was also additional exposure for existing, but perhaps not very well known Oracle Big Data features – Hadoop for ODI, Hadoop for OBIEE and using GoldenGate with Hadoop. I’ll try to write more about those soon.

Meanwhile, let me know what you think of In-Memory, Perfect Balance and OXH.


My Oracle OpenWorld 2013 Presentations

Oracle OpenWorld was fantastic, as usual. The best show in San Francisco. This is the seventh year in a row that I’m attending – 3 times as HP employee, 3 times as Pythian employee, and now as a Clouderan. My life changes, but the event and people are always fantastic.

There will be a separate blogpost about what I learned at the event, new exciting products and my thoughts of them. But first, let me follow up on what I taught.

On Sunday afternoon, and then again on Thursday afternoon, I presented “Data Wrangling with Oracle Connectors for Hadoop”. I presented it twice because both Oracle and IOUG liked my abstract. I was surprised to discover that both audiences had no idea what “Data Wrangling” is! I appreciate the attendees, they trusted me enough to attend without even being sure what I’m planning to talk about. In both sessions I had people come up with excellent questions, mentioning that they are current or future Cloudera customers. I absolutely loved it, what a great opportunity to connect with Hadoopers from all industries.

You can find the slides here: Data Wrangling with Oracle Connectors for Hadoop

On Monday, at OakTable World, I presented ETL on Hadoop. I presented it at Surge earlier this year, but this time I think I misjudged the fit of the content to the audience – I gave pretty technical tips of how to implement ETL on Hadoop to an audience with very little experience with Hadoop. They were smart people and mostly followed along, but I should have kept my content to more introductory level.

You can find the slides here: Scaling ETL with Hadoop

On Wednesday, I was fortunate to present with my former colleague Marc Fielding on SSDs and their use in Exadata. The topic is not very Hadoop related, but I love SSDs regardless and presenting with Marc was fun and the audience was highly engaged. I did get a lot of questions on SSDs and Hadoop, so I’ll consider writing about the topic in the future.

Marc has the latest version of the slides, but you can find an approximation here: Databases in a Solid State World.

Thanks again to everyone who attended, to all the customers who stopped to say hello and to everyone who was friendly and made the event fun. I hope to see you again next year.


See Me at Oracle OpenWorld 2013

I’ll be in San Francisco next week, presenting about Hadoop and Big Data at the biggest conference. If you want to say “hi”,

you can attend one of my sessions:

  • Big Data Panel Discussion (Sunday, 8am at Moscone West room 3003)
  • Data Wrangling with Oracle Big Data Connectors (Sunday 3:30 pm at Moscone West room 3003)
  • Women in Technology roundtable (Monday 10 am at Oak Table World )
  • ETL on Hadoop (Monday 11 am at Oak Table World – I presented the same content at Surge last week, you can peek at the slides).
  • It’s a Solid-State World: How Oracle Exadata X3 Leverages Flash Storage (Wednesday 3:30 pm at Westin San Francisco, Metropolitan I –  With Marc Fielding)
  • Data Wrangling with Oracle Big Data Connectors (Thursday 2 pm at Moscone South room 300)

I also plan to attend few sessions by other people:

  • Monday 12am – Oracle Database 12c for Data Warehousing and Big Data [CON8710]
  • Monday 6:30pm – Oracle’s Big Data Solutions: NoSQL, Connectors, R, and Appliance Technologies [BOF11057]
  • Tuesday 3:30pm – Big Data Deep Dive: Oracle Big Data Appliance [CON8646]
  • Wednesday 10am – In-Database MapReduce for DBAs and Database Developers Using SQL or Hadoop [CON8601]
  • Thursday 12:30 – Hadoop Your ETL: Using Big Data Technologies to Enhance ‘s Data Warehouses [CON8732]

And few social events: Oracle’s ACE dinner, Friends of Pythian dinner, Blogger meetup, OTN event and possibly few more.

You may also find me helping Oracle and Cloudera demonstrate the Big Data Appliance at the Engineered Systems demo booth.

See you there!


On the difficulties of Migrations – Especially to new Blogs

I haven’t posted here in a log while. That’s because I’ve been posting all my stories and ideas over at the Pythian blog.

I knew that migrations are one of the most difficult tasks in IT operations, but I did not realize this also applies to blogs. Yesterday, Alex helped me look at the blog statistics over at the Pythian blog and it turns out that over there I have about 10% of the readers that I had over here. While I’m just as brilliant in the Pythian blog as I was here, I guess that with all the old links, google ranks and people not changing their RSS subscriptions – blog locations have a lot more momentum than I suspected.

Anyway, to the 90% of my readers who apparently only read me at this address, in the next few days I’ll copy over the blog posts that I neglected to post here. I’ll try to post new articles here in the future, but they will always appear in the Pythian blog first, so you really should add my new address to whatever it is you use to follow blogs.


VMWare Hires Redis Key Developer – But Why?

My friend MosheZ alerted me to the fact (which few hours later appeared all over the net) that VMWare hired Redis key developer. Which is as close to acquisition as you can get with an open source project.

What is Redis? Redis is yet-another-NoSQL. A key-value store, somewhat similar to Tokyo Cabinet. Except that Redis does persistence differently, which makes it faster in many cases. Redis started as a Memcached replacement, so a lot of the documentation describes it as follows: Redis is like Memcached, except it supports more data types, it is persistent to some degree and it is not distributed.

But the more interesting question is – Why does VMWare need Redis?

VMWare says: “As VMware continues its investments in the context of cloud computing, technologies such as Redis become key for future cloud based apps, whether private or public cloud, and the cloud infrastructure itself.”

So Redis is cloud and VMWare is a major cloud player, therefore VMWare needs redis. Two discrepancies stand out in this story:

  1. Redis is not distributed system. Unlike Cassandra, where you can scale by quickly adding more Cassandras to the party, Redis is just one (very fast) server, only supporting master-slave replication. VMWare is all about adding new machines quickly. Something doesn’t fit.
  2. While key-value stores are  cloudy and VMWare is cloudy, there is no natural match between their cloudiness. VMWare itself can’t use Redis – because Redis technology is a natural match for big-data websites which VMWare clearly isn’t. Some VMWare customers can benefit from Redis, but most can’t. What’s going on here?

Clearly, the place to look is not in existing value but in the future. So here are my predictions:

  1. Redis will become distributed. It can certainly be done. Perhaps it can even be done without losing it’s performance edge.
  2. VMware will announce an Amazon-like, cloud-for-rent service. They have the technology for this, and Redis will help them manage the “huge website” part of it.
  3. They may also offer Redis on top of the virtual servers, as something built in. Like Amazon’s Oracle servers.
  4. VMWare can also offer storage for rent. They can do it with EMC storage (since VMWare is an EMC company), but I’m betting that they’ll do it with Netapp – their favorite cloud partner.  I can totally imagine a near-future Netapp-Vmware offering that is similar to Amazon’s EC2+ S3+AWS.

Predicting is very difficult (especially about the future) and I’m very much ready to regret ever posting my day dreams in public, but these are exciting possibilities. I wonder if they make sense to anyone else.

*********************************

And speaking of MosheZ, he is a prolific song writer, and he wrote a song about DBAs! I’m thinking of performing it live during one of my presentations. Actually I’m thinking of writing a presentation “How to win arguments or influence users” just to have an excuse to sneak this song in 🙂


Lessons From OOW09 #1 – Shell Script Tips

During OpenWorld I went to a session about shell scripting. The speaker, Ray Smith, was excellent. Clear, got the pace right, educating and entertaining.

His presentation was based on the book “The Art of Unix Programming” by one Eric Raymond. He recommended reading it, and I may end up doing that.

The idea is that shell scripts should obey two important rules:

  1. Shell scripts must work
  2. Shell scripts must keep working (even when Oracle takes BDUMP away).

Hard to object to that 🙂

Here’s some of his advice on how to achieve these goals (He had many more tips, these are just the ones I found non-trivial and potentially useful. My comments in italics.)

  1. Document dead ends, the things you tried and did not work, so that the next person to maintain the code won’t try them again.
  2. Document the script purpose in the script header, as well as the input arguments
  3. Be kind – try to make the script easy to read. Use indentation. Its 2009, I’m horrified that “please indent” is still a relevant tip.
  4. Clean up temporary files you will use before trying to use them:

    function CleanUpFiles {
    [ $LOGFILE ] && rm -rf ${LOGFILE}
    [ $SPOOLFILE ] && rm -rf ${SPOOLFILE}
    }
  5. Revisit old scripts. Even if they work. Technology changes. This one is very controversial – do we really need to keep chasing the latest technology?
  6. Be nice to the users by working with them – verify before taking actions and keep user informed of what the script is doing at any time. OPatch is a great example.
  7. Error messages should explain errors and advise how to fix them
  8. Same script can work interactively or in cron by using: if [ tty -s ] …
  9. When sending email notifying of success or failure, be complete. Say which host, which job, what happened, how to troubleshoot, when is the next run (or what is the schedule).
  10. Dialog/Zenity – tools that let you easily create cool dialog screens
  11. Never hardcode passwords, hostname, DB name, path. Use ORATAB, command line arguments or parameter files.I felt like clapping here. This is so obvious, yet we are now running a major project to modify all scripts to be like that.
  12. Be consistent – try to use same scripts whenever possible and limit editing permissions
  13. Use version control for your scripts. Getting our team to use version control was one of my major projects this year.
  14. Subversion has HTTP access, so the internal knowledge base can point at the scripts. Wish I knew that last year.
  15. Use deployment control tool like CFEngine. I should definitely check this one out.
  16. Use getopts for parameters. Getopts looked to complicated when I first checked it out, but I should give it another try.
  17. Create everything you need every time you need it. Don’t fail just because a directory does not exist. Output what you just did.
  18. You can have common data files with things like hosts list or DB lists that are collected automatically on regular basis and that you can then reference in your scripts.
  19. You can put comments and descriptions in ORATAB

Visualization Session – The Slides

The “Visualization Session” at OOW Unconference was great. Thanks to everyone who showed up for the lively discussion. It was probably the most fun I’ve ever had at a presentation.
Also thanks for the fine folks whom I later met at the OTN lounge and explained that they wanted to attend my presentation but the OTN lounge had free beer and I did not. I’ll see what I can do about the beer next year.

For those who missed the presentation whether due to beer or to distance from OpenWorld, you can get my slides here. As usual for my presentations, I’m not sure if my slides are meaningful without me standing next to them. It is just a bunch of graphs without the stories. If you really want to hear the stories, you can invite me to speak at your usergroup 🙂