Hive or: how I learned to stop worrying and love SQL again
Today I was going to write about the inaugural Team MetaBroadcast ride we recently went on. However, events have overtaken me, and that'll have to wait for another day. Instead I'm going to write about my new-found love: Hive.
At work we're analysing some largish datasets at the moment, and it's fallen to me to get a feel for them with some ad-hoc queries. We've previously done this with some Pig scripts but as someone who's written more lines of SQL in their lifetime than is good for anyone's sanity (I once even dreamt SQL), I was struggling to adjust to it and constantly thinking "this'd be a cinch in SQL".
So, after a bit of a battle trying to get Pig to do a histogram, I conceded defeat and fired up a Hive cluster.
We're using AWS's EMR, so it was pretty straightforward. After installing the command-line tools and configuring them (yes, another set of AWS command-line tools yields yet another place to put your credentials), you can start a cluster with just:-
./elastic-mapreduce \ --create \ --alive \ --name "My Hive cluster" \ --num-instances 3 \ --instance-type m3.xlarge \ --hive-interactive
Wait a bit and, BAM, you've got yourself a cluster of one master node and two core nodes to do with as you please. This brings up an interactive cluster: one that will live until you kill it, the alternative being batch cluster to execute a specific task or tasks. I opted for an interactive one as I'd got a bunch of ad-hoc queries to run and it saves repeatedly spinning up and down instances.
That's all well and good but where next? We need to get ourselves some data. Hive natively supports files from S3, which is quite useful. After logging on to the master node, and firing up hive
, a table can be defined thus, which uses S3 files as a source:-
CREATE EXTERNAL TABLE actions ( user_name STRING, action_time STRING, description STRING, id STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' LOCATION 's3://our-bucket/data/';
It's important to note here that the table is backed by S3 data, rather than populated with S3 data. This has performance implications because every time you access it, there'll be a hop to S3. If you can't cope with that then you could always read the file to your local HDFS filesystem and import into your table from there using a LOAD
statement.
Now to write some queries! With a browser tab open at the Hive language manual I was all set. Within minutes I was like a pig in proverbial; HAVING
queries, subqueries... the list goes on. Even better, with EMR and spot instance, when a tricky query came up I was able to throw some cheap spot instances at the problem to make things happen faster:-
./elastic-mapreduce \ --jobflow my-jobflow-id \ --add-instance-group task \ --instance-type m3.xlarge \ --instance-count 2 \ --bid-price 0.2
The culmination of my queries was some subquery fun along the lines of:-
SELECT COUNT(actions.id), actions.user_name FROM actions JOIN titles on (actions.id = titles.id) JOIN ( SELECT user_name FROM actions JOIN titles ON (actions.id = titles.id) WHERE name = "The X Factor" GROUP BY user_name HAVING COUNT(id) > 10 ) a ON (actions.user_name = a.user_name) WHERE name != "Strictly Come Dancing" GROUP BY actions.user_name HAVING COUNT(id) > 10
Things were made more interesting by the input file format and Hive's lack of support for standard things like ISO-8601 datetimes, so I found myself writing User-Defined Functions to manipulate input strings. All pretty straightforward, mind.
I'm very pleased I cut my losses with Pig and switched to Hive. I'm sure there are cases where it may be better, but for the queries I've been doing and given I'm already well-versed in SQL, Hive was perfect for the job. I'd love to hear your experiences with either of them so why not come talk to me about it over a pint at our next MetaBeerTalks a week today? We'll be upstairs at The Plough in Bloomsbury from 6pm. See you there?
Picture credit: Mark Bold
blog comments powered by Disqus