Kafka or Flume?

Posted: February 11, 2014 | Author: prodlife | Filed under: Hadoop, tips | Tags: flume, ingest, kafka | Leave a comment

A question that keeps popping up is “Should we use Kafka or Flume to load data to Hadoop clusters?”

This question implies that Kafka and Flume are interchangeable components. It makes as much sense to me as “Should we use cars or umbrellas?”. Sure, you can hide from the rain in your car and you can use your umbrella when moving from place to place. But in general, these are different tools intended for different use-cases.

Flume’s main use-case is to ingest data into Hadoop. It is tightly integrated with Hadoop’s monitoring system, file system, file formats, and utilities such a Morphlines. A lot of the Flume development effort goes into maintaining compatibility with Hadoop. Sure, Flume’s design of sources, sinks and channels mean that it can be used to move data between other systems flexibly, but the important feature is its Hadoop integration.

Kafka’s main use-case is a distributed publish-subscribe messaging system. Most of the development effort is involved with allowing subscribers to read exactly the messages they are interested in, and in making sure the distributed system is scalable and reliable under many different conditions. It was not written to stream data specifically for Hadoop, and using it to read and write data to Hadoop is significantly more challenging than it is in Flume.

To summarize:
Use Flume if you have an non-relational data sources such as log files that you want to stream into Hadoop.
Use Kafka if you need a highly reliable and scalable enterprise messaging system to connect many multiple systems, one of which is Hadoop.

Beginnings

Posted: May 20, 2013 | Author: prodlife | Filed under: musing, tips | 4 Comments

“A beginning is the time for taking the most delicate care that the balances are correct.”

It is spring. Time for planting new seeds. I started on a new job last week, and it seems that few of my friends and former colleagues are on their way to new adventures as well. I’m especially excited because I’m starting not just a new job – I will be working on a new product, far younger than Oracle and even MySQL. I am also making first tiny steps in the open-source community, something I’ve been looking to do for a while.

I’m itching to share lessons I’ve learned in my previous job, three challenging and rewarding years as a consultant. The time will arrive for those, but now is the time to share what I know about starting new jobs. Lessons that I need to recall, and that my friends who are also in the process of starting a new job may want to hear.

Say hello
I’m usually a very friendly person and after years of attending conferences I’m very comfortable talking to people I’ve never met before. But still, Cloudera has around 200 people in the bay area offices, which means that I had to say “Hello, I’m Gwen Shapira the new Solutions Architect, who are you?” around 200 times. This is not the most comfortable feeling in the world. Its important to go through the majority of the introductions in the first week or two, later on it becomes a bit more awkward. So in the first week it will certainly seem like you are doing nothing except meeting people, chatting a bit and franctically memorizing names and faces. This is perfectly OK.

Get comfortable being unproductive
The first week in a new job feels remarkably unproductive. This is normal. I’m getting to know people, processes, culture, about 20 new products and 40 new APIs. I have incredibly high expectations of myself, and naturaly I’m not as fast installing Hadoop cluster as I am installing RAC cluster. It takes me far longer to write Python code than it does to write SQL. My expectations create a lot of pressure, I internally yell at myself for taking an hour or so to load data into Hive when it “should” have taken 5 minutes. But of course, I don’t know how long it “should” take, I did it very few times before. I’m learning and while learning has its own pace, it is an investment and therefore productive.

Have lunch, share drinks
The best way to learn about culture is from people, and the best way to learn about products is from the developers who wrote them and are passionate about how they are used. Conversations at lunch time are better than tackling people in the corridor or interrupting them at their desk. Inviting people for drinks are also a great way to learn about a product. Going to someones cube and asking for an in-depth explanation of Hive architecture can be seen as entitled and bothersome. Sending email to the internal Hive mailing list and saying “I’ll buy beer to anyone who can explain Hive architecture to me” will result in a fun evening.

If its not overwhelming, you may be in the wrong job
I’m overwhelmed right now. So many new things to learn. First there are the Hadoop ecosystem products, I know some but far from all of them, and I feel that I need to learn everything in days. Then there is programming. I can code, but I’m not and never have been a proficient programmer. My colleagues are sending out patches left and right. It also seems like everyone around me is a machine learning expert. When did they learn all this? I feel like I will never catch up.

And that is exactly how I like it.

Make as many mistakes as possible
You can learn faster by doing, and you can do faster if you are not afraid of failing and making mistakes. Mistakes are more understandable and forgivable when you are new. I suggest using this window of opportunity and accelerate your learning by trying to do as much as possible. When you make a mistake smile and say “Sorry about that. I’m still new. Now I know what I should and shouldn’t do”

Take notes
When you are new a lot of things will look stupid. Sometimes just because they are very different from the way you are used to things in a previous job. Don’t give in to the temptation to criticise everything, because you will look like a whiner. No one likes whiner. But take note of them, because you will get used to them soon and never see things with “beginner mind” again. In few month take a look at your list, if things still look stupid, it will be time to take on a project or two to fix them.

Contribute
I may be new at this specific job, but I still have a lot to contribute. I try hard to look for opportunities and I keep finding out that I’m more useful than I thought. I participate in discussions in internal mailing lists, I make suggestions, I help colleagues solve problems. I participate in interviews and file tickets when our products don’t work as expected. I don’t wait to be handed work or to be sent to a customer, I look for places where I can be of use.

I don’t change jobs often. So its quite possible that I don’t know everything there is to know about starting a new job. If you have tips and suggestions to share with me and my readers, please comment!

Environment Variables in Grid Control User Defined Metrics

Posted: August 4, 2010 | Author: prodlife | Filed under: Oracle, tips | Leave a comment

This post originally appeared at the Pythian blog.

Emerson wrote: “Foolish consistency is the hobgoblin of small minds”. I love this quote, because it allows me to announce a presentation titled “7 Sins of Concurrency” and then show up with only 5. There are places where consistency is indeed foolish, while other times I wish for more consistency.

Here is a nice story that illustrates both types of consistency, or lack of.

This customer Grid Control installed in their environment. We were asked to configure all kinds of metrics and monitors for several databases, and we decided to use the Grid Control for this. One of the things we decided to monitor is the success of the backup jobs.

Like many others, this customer runs his backup jobs from cron and the cron job generates an RMAN logfile. I thought that a monitor that will check the logfile for RMAN- and ORA- errors will be just the thing we need.

To be consistent, I could have moved the backup jobs to run from Grid Control scheduler instead of cron. But in my opinion, this would have been foolish consistency – why risk breaking perfectly good backups? Why divert my attention from the monitoring project to take on side improvements?

To have Grid Control check the log files, I decided to use OS UDM: Thats a “User Defined Metric” that is defined on “host” targets and allows to run a script on the server. I wrote a very small shell script that finds the latest log, greps for errors and counts them. The script returns the error count to Grid Control. More than 0 errors is a critical status for the monitor. I followed the instructions in the documentation to the letter – and indeed, everything easily worked. Hurray!

Wait. There’s a catch (and a reason for this blog post). I actually had two instances that are backed up, and therefore two logs to check. I wanted to use the same script and just change the ORACLE_SID in the environment.

No worries. The UI has a field called “Environment” and the documentation says: “Enter any environmental variable(s) required to run the user-defined script.”

One could imagine, based on the field name and the documentation, that if I type: “ORACLE_SID=mysid” in this field, and later run “echo $ORACLE_SID” in my script, the result would be “mysid”.

Wrong. What does happen is that $ORACLE_SID is empty. $1, on the other hand, is “{ORACLE_SID=mysid}”.

To get the environment variable I wanted, I had to do: tmp=(`echo $1 | tr ‘{}’ ‘ ’`); eval $tmp

It took me a while to figure this out as this behavior is not documented and I found no usage examples that refer to this issue.

Consistency between your product, the UI and the documentation is not foolish consistency. I expect the documentation and field descriptions to help me do my job and I’m annoyed when it doesn’t.

At least now this behavior is documented somewhere so future googlers may have easier time.

The Senile DBA Guide to Troubleshooting Sudden Growth in Redo Generation

Posted: November 4, 2009 | Author: prodlife | Filed under: scripts, tips | 13 Comments

I just troubleshooted a server where the amounts of redo generated suddenly exploded to the point of running out of disk space.

After I was done, the problem was found and the storage manager pacified, I decided to save the queries I used. This is a rather common issue, and the scripts will be useful for the next time.

It was very embarrassing to discover that I actually have 4 similar but not identical scripts for troubleshooting redo explosions. Now I have 5 🙂

Here are the techniques I use:

I determine whether there is really a problem and the times the excessive redo was generated by looking at v$log_history:

select trunc(FIRST_TIME,'HH') ,sum(BLOCKS) BLOCKS , count(*) cnt
,sum(decode(THREAD#,1,BLOCKS,0)) Node1_Blocks
,sum(decode(THREAD#,1,1,0)) Node1_Count
,sum(decode(THREAD#,2,BLOCKS,0)) Node2
,sum(decode(THREAD#,2,1,0)) Node2_Count
,sum(decode(THREAD#,3,BLOCKS,0)) Node3
,sum(decode(THREAD#,3,1,0)) Node3_Count
from v$archived_log
where FIRST_TIME >sysdate -30
group by trunc(FIRST_TIME,'HH')
order by trunc(FIRST_TIME,'HH') desc

If the problem is still happening, I can use Tanel Poder’s Snapper to find the worse redo generating sessions. Tanel explains how to do this in his blog.

However, Snapper’s output is very limited by the fact that it was written specifically for situations where you cannot create anything on the DB. Since I’m normally the DBA on the servers I troubleshoot, I have another script that actually gets information from v$session, sorts the results, etc.

create global temporary table redo_stats
( runid varchar2(15),
  sid number,
  value int )
on commit preserve rows;

truncate table redo_stats;

insert into redo_stats select 1,sid,value from v$sesstat ss
join v$statname sn on ss.statistic#=sn.statistic#
where name='redo size'

commit;

insert into redo_stats select 2,sid,value from v$sesstat ss
join v$statname sn on ss.statistic#=sn.statistic#
where name='redo size'

commit;

select *
        from redo_stats a, redo_stats b,v$session s
       where a.sid = b.sid
       and s.sid=a.sid
         and a.runid = 1
         and b.runid = 2
         and (b.value-a.value) > 0
       order by (b.value-a.value)

Last technique is normally the most informative, and requires a bit more work than the rest, so I save it for special cases. I’m talking about using logminer to find the offender in the redo logs. This is useful when the problem is no longer happening, but we want to see what was the problem last night. You can do a lot of analysis with the information in logminer, so the key is to dig around and see if you can isolate a problem. I’m just giving a small example here. You can filter log miner data by users, segments, specific transaction ids – the sky is the limit.

-- Here I pick up the relevant redo logs.
-- Adjust the where condition to match the times you are interested in.
-- I picked the last day.
select 'exec sys.dbms_logmnr.add_logfile(''' || name ||''');'
from v$archived_log
where FIRST_TIME >= sysdate-1
and THREAD#=1
and dest_id=1
order by FIRST_TIME desc

-- Use the results of the query above to add the logfiles you are interest in
-- Then start the logminer
exec sys.dbms_logmnr.add_logfile('/u10/oradata/MYDB01/arch/1_8646_657724194.dbf');

exec sys.dbms_logmnr.start_logmnr(options=>sys.dbms_logmnr.DICT_FROM_ONLINE_CATALOG);

-- You can find the top users and segments that generated redo
select seg_owner,seg_name,seg_type_name,operation ,min(TIMESTAMP) ,max(TIMESTAMP) ,count(*)
from v$logmnr_contents 
group by seg_owner,seg_name,seg_type_name,operation
order by count(*) desc

-- You can get more details about the specific actions and the amounts of redo they caused.
select LOG_ID,THREAD#,operation, data_obj#,SEG_OWNER,SEG_NAME,TABLE_SPACE,count(*) cnt ,sum(RBABYTE) as RBABYTE ,sum(RBABLK) as RBABLK 
from v$logmnr_contents 
group by LOG_ID,THREAD#,operation, data_obj#,SEG_OWNER,SEG_NAME,TABLE_SPACE
order by count(*)

-- don't forget to close the logminer when you are done
exec sys.dbms_logmnr.end_logmnr;

If you have ASH, you can use it to find the sessions and queries that waited the most for “log file sync” event. I found that this has some correlation with the worse redo generators.

-- find the sessions and queries causing most redo and when it happened

select SESSION_ID,user_id,sql_id,round(sample_time,'hh'),count(*) from V$ACTIVE_SESSION_HISTORY
where event like 'log file sync'
group by  SESSION_ID,user_id,sql_id,round(sample_time,'hh')
order by count(*) desc

-- you can look the the SQL itself by:
select * from DBA_HIST_SQLTEXT
where sql_id='dwbbdanhf7p4a'

Lessons From OOW09 #2 – Consolidation Tips

Posted: October 19, 2009 | Author: prodlife | Filed under: openworld09, tips | 4 Comments

The session was called “All in One” and it was given by Husnu Sensoy. A young and very accomplished DBA from Turkey. I chatted with him during ACE dinner and it turns out we have many colleagues in common. This was probably the most useful presentation I’ve heard this OpenWorld. As I am going into a large consolidation project for next year, I am glad I can learn from the experience of someone like Husnu who already gone through this and is very willing to share the experience.

His presentation is shared on his blog, so I’ll just give the parts that I consider to be highlights. You will probably learn more by reading his slides. He had tons of good content and he talks very very fast, so I’m sure I missed a bunch of good stuff. Since his content is readily available, I’m mixing in a lot of my own thoughts here.

The problems he set out to solve:

Too many DBs and too few DBAs.
Some servers are doing almost nothing.
Some servers have no HA.

Pick candidates for consolidation based on: License costs, utilization, data center location, dependencies, I/O characteristics, risk levels.

The driver for our consolidation were license costs – our new machines had two quad cores instead of one dual core, so license costs suddenly quadrapled and we were forced into cost saving consolidations. We mostly used data center location and risk levels to decide on the plan. Our most OLTP system, the one that is most sensitive to slow-downs, will remain unconsolidated for now.

Prior to consolidation collect lots of system/performance metrics. They will help pick candidates, plan capacity, test and later troubleshoot.

Don’t forget to talk to DBAs and business reps when making the consolidation plans, they will have their own ideas and this can be important input.

Additive linear models are recommended for capacity planning. He gave lots of guidelines on how to do this. Pages 26-36 in his PDF have the details. I could have sworn he recommended to stay below the 65% utilization when planning for CPU capacity, but I cannot see it in his slides. In any case – do this, because any higher than that and the linear additive model is questionable.

Also pay attention to the part about preferring larger servers and less RAC nodes, since RAC adds complexity. And to the part about every storage system delivering about 70-80% of spec. Actually, this is more true for the EMC system he used. Our Netapps seem to be up to spec.

Don’t mix sequential and random IO (i.e. OLTP and DW) is a good idea. A lot of places can’t really do this because of the way their apps are designed.

Benchmarking the new system to test the capacity plan is a great idea. I’d love to see more concrete information on how to benchmark, maybe a whole other presentation on this. One of the things that worry me most about our consolidation plans is that I’m not sure how good our tests will be. Husnu recommended HammerOra, which I’ll check out.

Crash tests. We did those ages ago when we moved to RAC architecture and then again for Netapp clusters. Was lots of fun and maybe its time to do this again. Husnu advised to ask support for a list of good test scenarios. I recommend taking your sysadmins, storage admins and net admins for few beers and asking them for scenarios – they generally come up with very creative stuff.

Good tip: In 11g, memory target does not work with huge pages. You are using huge pages, right?

Write the backup and recovery document as the first document on the new system.

Pages 76-86 have good advice on merging databases. We’ll be doing that too and I was glad to see that we came up with the same plans and same problems as Husnu.

His last advice is the best: “You never know how long this is going to take.” So true! Who could have known that we will be delayed for 3 month by an IT security group that popped up from no-where with requirement that we will pass certain security audits that we’ve never heard of. Life in a big organization can be full of surprises, so be prepared 🙂

Lessons From OOW09 #1 – Shell Script Tips

Posted: October 17, 2009 | Author: prodlife | Filed under: Linux, openworld09, scripts, tips, Uncategorized | 15 Comments

During OpenWorld I went to a session about shell scripting. The speaker, Ray Smith, was excellent. Clear, got the pace right, educating and entertaining.

His presentation was based on the book “The Art of Unix Programming” by one Eric Raymond. He recommended reading it, and I may end up doing that.

The idea is that shell scripts should obey two important rules:

Shell scripts must work
Shell scripts must keep working (even when Oracle takes BDUMP away).

Hard to object to that 🙂

Here’s some of his advice on how to achieve these goals (He had many more tips, these are just the ones I found non-trivial and potentially useful. My comments in italics.)

Document dead ends, the things you tried and did not work, so that the next person to maintain the code won’t try them again.
Document the script purpose in the script header, as well as the input arguments
Be kind – try to make the script easy to read. Use indentation. Its 2009, I’m horrified that “please indent” is still a relevant tip.
Clean up temporary files you will use before trying to use them:
function CleanUpFiles { [ $LOGFILE ] && rm -rf ${LOGFILE} [ $SPOOLFILE ] && rm -rf ${SPOOLFILE} }
Revisit old scripts. Even if they work. Technology changes. This one is very controversial – do we really need to keep chasing the latest technology?
Be nice to the users by working with them – verify before taking actions and keep user informed of what the script is doing at any time. OPatch is a great example.
Error messages should explain errors and advise how to fix them
Same script can work interactively or in cron by using: if [ tty -s ] …
When sending email notifying of success or failure, be complete. Say which host, which job, what happened, how to troubleshoot, when is the next run (or what is the schedule).
Dialog/Zenity – tools that let you easily create cool dialog screens
Never hardcode passwords, hostname, DB name, path. Use ORATAB, command line arguments or parameter files.I felt like clapping here. This is so obvious, yet we are now running a major project to modify all scripts to be like that.
Be consistent – try to use same scripts whenever possible and limit editing permissions
Use version control for your scripts. Getting our team to use version control was one of my major projects this year.
Subversion has HTTP access, so the internal knowledge base can point at the scripts. Wish I knew that last year.
Use deployment control tool like CFEngine. I should definitely check this one out.
Use getopts for parameters. Getopts looked to complicated when I first checked it out, but I should give it another try.
Create everything you need every time you need it. Don’t fail just because a directory does not exist. Output what you just did.
You can have common data files with things like hosts list or DB lists that are collected automatically on regular basis and that you can then reference in your scripts.
You can put comments and descriptions in ORATAB

I Can Has Training Budget

Posted: September 11, 2009 | Author: prodlife | Filed under: tips, training | 3 Comments

We know how it goes – there is a recession, and companies try to reduce expanses. The next thing you know, your training budget is all gone. Or maybe there is some training budget left, but now 6 DBAs share a sum that is not enough for one Oracle University course. How do you convince your managers that paying for your training is the best investment they can make?

Start by convincing yourself. Remember that your manager probably got to his position because he is good at reading people, so if you don’t really want the training, or don’t really believe you need this training, he may see that and you lost. You have to be 100% sure that you want this training because it will really allow you to improve the way you work.

As an example, lets assume you want to go to Linux Administration course. Its an interesting case, because it is not even evident that a DBA should go to such course.

Then think about your boss for a bit – what parts of the job are most important to him? what are his pet projects? pet peeves.

Once you have your desire for the course and your bosses desires in mind, make a list of all the benefits you can see from going to the course. The important thing is to highlight how the things you want to learn will help with the projects that are most important to your boss, or will address his specific pain points.

So, if your boss loves automation say: “I will learn more shell linux tools so I’ll be able to write better automation scripts”.
If he is a capacity planning person, say: “I will be able to better monitor the OS so we can be more proactive about provisioning”.
If he is a big fan of RAC, say: “With my improved Linux knowledge, I’ll be able to understand low-leve clusterware issues and solve them faster!”

Now you need to decide if you make your pitch face to face or by email. I prefer email. Information I put in the email:
* Course title and instructor (or school name)
* Dates/Times
* Location
* Price
* The list of 3-5 reasons I need this course (as you prepared in the previous paragraph).

Until he makes his decision, keep mentioned once or twice a day how the things you do now will be much better after you take the course: “I still don’t understand how to debug coredumps after the process crashes, but the Linux course may help”, “It takes me 2 hours to copy old files to the second disk, but I’ll probably learn how to do it faster in the Linux course”. Don’t force it, but keep an eye open for opportunities to explain and demonstrate the value you see in the course.

And a questionable tactics that sometimes works: Get an ultra-expensive course rejected before asking for a reasonably-priced course. “I can totally understand you don’t have the budget to send me to Collaborate in Denver, but what about one day training given by our local usergroup at a near-by location?”. I’m not sure if this tactic works because the manager feels guilty about rejecting my request, or if the lower-price seminar just looks better in comparison. I’m not even sure if I recommend it, really. Consider and act at your own risk 😉

Automatic Maintenance Tasks

Posted: August 27, 2009 | Author: prodlife | Filed under: 11g, tips | 6 Comments

Automatic Maintenance Tasks is a new 11g feature which I recently noticed. Its a bit embarrassing, since I’ve had 11g in production for nearly a year and apparently I’ve been using the feature all along.

I discovered the feature when I noticed that the automatic statistics gathering job is running several times on a weekend, instead of just once as it did in 10g. Then I discovered that the job has a very strange name starting with ORA$, and that the name changes every time the job runs.

Turns out that Oracle’s automatic jobs are not longer jobs. They are now Maintenance Tasks.

Here’s how Oracle defines the tasks:
“When a maintenance window opens, Oracle Database creates an Oracle Scheduler job for each maintenance task that is scheduled to run in that window. Each job is assigned a job name that is generated at runtime. All automated maintenance task job names begin with ORA$AT. For example, the job for the Automatic Segment Advisor might be called ORA$AT_SA_SPC_SY_26. When an automated maintenance task job finishes, it is deleted from the Oracle Scheduler job system. However, the job can still be found in the Scheduler job history.”

And here’s the reason my statistics ran several times on a weekend:
“In the case of a very long maintenance window, all automated maintenance tasks except Automatic SQL Tuning Advisor are restarted every four hours. This feature ensures that maintenance tasks are run regularly, regardless of window size”

What practically changed? Almost nothing, we had schedule windows in 10g, and the maintenance jobs (not tasks) ran within the defined windows. I’ve no clue why this change was necessary.

It definitely looks like infrastructure prepared for a future cool feature. At present, it just looks weird. For instance:

You can’t add tasks. Oracle has 3 predefined tasks – statistics, space advisor and tuning advisor. You can add or remove maintenance windows and define in which window to run each task, but you can’t add your own task.
There are lots of seemingly unnecessary definitions around. For example, from the dictionary tables, you can see there are task clients and task jobs. Currently it looks like they are the same thing, since there is a one-to-one relation between them, but it probably won’t stay this way.
The documentation doesn’t document much. There are fields such as client attributes with values that are not really explained anywhere.
The API is really weak. As I said, you can’t do much beyond enable/disable tasks in specific schedules

So far, it looks like this feature adds confusion but no value. I hope Oracle will do something fun with it in the future.

Did you know you can rename tablespaces?

Posted: June 13, 2009 | Author: prodlife | Filed under: tips | 6 Comments

You probably knew that. There is some possibility that I’m the last person to figure it out and even I knew about it for at least two years. After all renaming tablespaces existed since 10.1, which makes it almost 6 years old.

Just in case you didn’t know:
alter tablespace users rename to cool_users;

According to the docs (and my tests confirmed that), the rename changes the tablespace name but not the tablespace id – so it will not break behaviors such as default tablespace and file ownership and such.

Why is it such a cool feature? Because sometimes I get an export dump file from a customer who is moving to our service. I have to import the schemas from the dump file to our DB. I want his schemas to reside on CUST_X tablespace, but the dump file says they are in MYTBS. What do I do?

Before the magical rename: import to an indexfile. Use VI to modify the file to create all objects in CUST_X tablespace instead of MYTBS. Run the indexfile. Run the import.

After magical rename: rename CUST_X to MYTBS. Import. Rename back.

Much better!

Of course, datapump has the even more magical remap_tablespace parameter. Alas, not all my customers use datapump.

Group By in Shell

Posted: April 22, 2009 | Author: prodlife | Filed under: Linux, scripts, tips | 19 Comments

Storage manager dropped by my cube to say that redo log archives for one of our DBs grew from 40G to 200G in the last day. Running “du -Sh” few times showed that the files are no longer getting written as fast, so the issue is probably over. But what was it? And to start, when was it? Once I figure out when the issue occure, I may be able to find the offending DML in ASH.

So, I need to do something like:

select day,hour,sum(bytes)
from (ls -l)
group by day,hour
order by day,hour

I guess I could have used the output of “ls -l” as an external table and actually run this query. But I’m actually trying to improve my shell scripting skills. There must be some way to do it!

I asked on Twitter, and Kris Rice (of SQLDeveloper fame) pointed me in the right direction – use AWK.

Here’s the result:

ls -l | sed ‘s/:/ /g’ | awk ‘{sizes[$7$8] += int($5)} END {for (hour in sizes)
{ printf(“At %s %d bytes were inserted\n”,hour,sizes[hour])}}’ | sort

I run ls. Use sed to seperate the time into fields (Laurent Schneider will kill me, because I’m mixing sed and awk, but I’m not it his level yet…). After the split, the fifth field contains the file size, the seventh contains the date and the eighth is the hour. So I have an associative array called “sizes”, indexed on the date and hour, and each bucket contains the sum of the file sizes in this hour.

Then I loop on the array and print the results, and sorted them to make it more readable.

A long post for a short script, but I want to make sure I don’t forget it 😉

Anyway, the best part by far was everyone who tried to help my desperate call in Twitter/Facebook. It was unexpected and very very nice. About 5 people sent tips, ideas and even solutions! I only imagine that doing my work is more fun than doing their own work, but I appreciate the help a lot.

Just a simple Hadoop DBA

Adventures with Data and Massively Parallel Databases