Three Things To Do Before Starting Hadoop Project

I spent the last 6 month helping customers design, implement and deploy successful Hadoop systems. Over time, you start seeing patterns. Certain things that if the customer gets right before the project event starts increase the probability that the project will finish successfully and on-time.

Lets assume you already took the most important step – you have an actual use-case or a problem to solve, and you think  Hadoop could be the right technology to use in this case. What now?

  1. Learn how to automate production deployments: 
    Yes, Cloudera Manager would do a lot for you. But occasionally you’ll need to run the same command on 20+ servers. Thats what large clusters are all about. So get used to the idea from the beginning. Learn how to write loops in bash, how to ssh into remote machines to run commands, how to distribute files and restart services.
    You can build your own tools (I had for years), but these days you can just pick your favorite automation tool and use that instead. My favorite is Ansible. It works exactly the way I would have written a cluster automation tool, so learning it never felt like an effort and its usage is never surprising or unintuitive to me. Others prefer Puppet, Chef or Cfengine. It doesn’t matter what you use, but when I show up at your office as your Cloudera Solutions Architect and ask you to update sysctl.conf on your 50 node cluster, I don’t want you too look surprised, alarmed or tell me it will take few hours.
  2. Don’t try to boil the ocean:
    Hadoop implementation is often the first chance the dev/ops teams get to do something completely new. There is a blank slate, white sheet of paper, and you can design the perfect system. Fixing every problem the old system had and building functionality you always dreamed of.Better security!  Machine learning! Open source!
    I say – Using Hadoop successfully is a large and challenging project. Changing organizational processes and culture toward better security and processes is a large and challenging project. Creating a data driven organization is a huge project. Mixing them doesn’t give you three projects for the price of one. It isn’t even just three times more challenging than one project, I’d say the risk is an order of magnitude higher, and the risk of just implementing Hadoop is high enough. Especially in the early stages. Which brings us to… 
  3. Do a POC:
     Pay Cloudera or do it yourself. Either way, you need a POC. 
    If you start with a 12 month project, you will have to do a lot of design upfront. At a time when you have too little information. At the beginning of a large project, you won’t know for sure how the system will be used and you probably won’t know enough about Hadoop. Sure, you can call Cloudera Services and discuss the design with us, but even we can sometimes (rarely!) get things wrong. With 12 month projects you will be very deep in the project before you’ll find that limitation we completely forgot to mention.
    Be agile (really agile, 6 month project with daily scrum doesn’t count): Solve the smallest useful problem first. Implement just a single workflow, single statistical analysis, parse and search data from one source. Whatever is useful for your users – do it first. Learn in the process and build from there. This will allow you to build experience and iron-out issues at the system’s usefulness, load and importance grow.
  4. (Bonus tip) Get the most out of the POC: 
    Not all POCs are created equal. Sometimes the customer hires us to “prove that Hadoop can do X”. We get very specific requirements and very short time-frame, and we need to build a system that does X. I can see why customers need proof that their vendor can deliver. But this approach is of limited value. Because at the end of the POC you are left with a non-production system and you don’t know more Hadoop than you knew before. I love teaching, but when the attitude is “prove us this works” and the requirements are inflexible, there isn’t much time for discussions and casual learning. Most of the knowledge transfer will only happen in the delivery document, which is not the same as lively discussions.
    Better POC happens when a customer brings in Cloudera to help them build their first Hadoop project. There are still time and scope constraints, but now the POC is not about “Prove us this works” but rather “Help us make it work”. We work together as a team. We will brainstorm design possibilities with you and share best practices. We will teach you how to build the system, how to configure it and troubleshoot it. You get a chance to learn all our little tricks of development and deployment that makes life easier. At the end of the POC, your team will have real Hadoop expertise, relevant to your specific system, problems, culture, data and requirements. I see this as the best investment you can make in Hadoop for your organization. 
    But I may be a bit biased.

 


2 Comments on “Three Things To Do Before Starting Hadoop Project”

  1. Ofir Manor says:

    Great stuff Chen!
    I agree with all points – automation and incremental value are very important.
    Regarding POC – in a way, you describe building a prototype together instead of a POC. That is almost always better, but the poor customer has to pick a vendor first… A bit of chicken and egg problem…
    I think that aggressive competitive POC might be great to well-understood domains, like picking best NAS storage for enterprise environments.. For messy, cutting-edge platforms – pick one of the leaders that you trust (and that has proven track of delivery in your region and maybe your vertical) and work together.

  2. public class OceanInputFormat extends Everything

    The POC phase is where we are right now and having the toughest time with. The problem is getting business input since they are obviously the stakeholder that will pay for the whole thing. However, when we talk to them about doing a POC the usual response is something akin to “Oh wow, can we get a report that tells us our student sign-ups each week?” Simple stuff we could do with our standard reporting or even a SQL Script.

    So prior to doing a POC, we’re having to change people’s perception on what a ‘business problem’ can be. I think people get so used to hearing “we can’t do that” that they forget all the awesome ideas and questions they had when they first came on board.

    So despite the prevalence of “Big Data for Business” type books on everyone’s desk, right now we’re focused on doing small projects with IT data…which unfortunately no one cares enough about to pay for.


Leave a comment