code | Sam Ritchie

Introducing Cascalog-Contrib

I’ve had the pleasure of working with Cascalog for about ten months now, and have seen the community produce some fantastic work. A number of businesses are using Cascalog in production; I use Cascalog at Twitter every day to write MapReduce queries for the new Twitter Web Analytics product. One thing Cascalog doesn’t yet have is a community repository for generic queries and operations. To fill this gap we’ve created cascalog-contrib. Cascalog-contrib will be home to any higher-level abstractions over Cascalog that the community is willing to submit. If you have an idea for a module, file a pull request on GitHub or bring it up on the mailing list for discussion. ...

Testing Cascalog with Midje

I’ve been working on a Cascalog testing suite these past few weeks, an extension to Brian Marick’s Midje, that eases much of the pain of testing MapReduce workflows. I think a lot of the dull work we see in the Hadoop community is a direct result of fear. Without proper tests, Hadoop developers can’t help but be scared of making changes to production code. When creativity might bring down a workflow, it’s easiest to get it working once and leave it alone. ...

Getting Creative with MapReduce

One problem with many existing MapReduce abstraction layers is the utter difficulty of testing queries and workflows. End-to-end tests are maddening to craft in vanilla Hadoop and frustrating at best in Pig and Hive. The difficulty of testing MapReduce workflows makes it scary to change code, and destroys your desire to be creative. A proper testing suite is an absolute prerequisite to doing creative work in big data. In this blog post, I aim to show how most of the difficulty of writing and testing MapReduce queries stems from the fact that Hadoop confounds application logic with decisions about data storage. These problems are the result of poorly implemented abstractions over the primitives of MapReduce, not problems with the core MapReduce algorithms. ...

Cascalog 1.8.1 Released

Nathan Marz and I are releasing Cascalog 1.8.1 today! We’ve added a few interesting features, and I thought I’d provide a bit more detail here for anyone interested. Cross Join cascalog.api now includes support for cross-joins; just add (cross-join) to your query as its own predicate. Think of a cross-join as a “tuple comprehension”, or cartesian product, with similar results to clojure.core/for; it’s not very efficient, as it forces all tuples through a single reducer (and causes a massive blowup in the number of tuples!). Here’s an example: ...

Haskell in Emacs

I spent some time today getting my emacs config set up to learn Haskell, and ran into a few issues; I figured I’d go ahead and document the process here for everyone’s enjoyment. We’re going to install and configure Haskell mode, then add a few extensions that’ll make learning Haskell fun and easy! I’m currently running haskell-mode for emacs, with the hs-lint plugin, Haskell support for FlyMake (which provides on-the-fly syntax checking from the Haskell compiler), and code autocompletion. The steps covered by this tutorial are: ...

Simple Hadoop Clusters

I’m excited to announce Pallet-Hadoop, a configuration library written in Clojure for Apache’s Hadoop. In the tutorial, we’re going to see how to create a three node Hadoop cluster on EC2, and run a word count on MapReduce. We’ll be following along with Pallet-Hadoop example project for the introduction; for a more in-depth discussion of the design of pallet-hadoop, see the project wiki. Background Hadoop is an Apache java framework that allows for distributed processing of enormous datasets across large clusters. It combines a computation engine based on MapReduce with HDFS, a distributed filesystem based on the Google File System. ...