Programming

Getting Creative with MapReduce

One problem with many existing MapReduce abstraction layers is the utter difficulty of testing queries and workflows. End-to-end tests are maddening to craft in vanilla Hadoop and frustrating at best in Pig and Hive. The difficulty of testing MapReduce workflows makes it scary to change code, and destroys your desire to be creative. A proper testing suite is an absolute prerequisite to doing creative work in big data. In this blog post, I aim to show how most of the difficulty of writing and testing MapReduce queries stems from the fact that Hadoop confounds application logic with decisions about data storage. These problems are the result of poorly implemented abstractions over the primitives of MapReduce, not problems with the core MapReduce algorithms. ...

Cascalog 1.8.1 Released

Nathan Marz and I are releasing Cascalog 1.8.1 today! We’ve added a few interesting features, and I thought I’d provide a bit more detail here for anyone interested. Cross Join cascalog.api now includes support for cross-joins; just add (cross-join) to your query as its own predicate. Think of a cross-join as a “tuple comprehension”, or cartesian product, with similar results to clojure.core/for; it’s not very efficient, as it forces all tuples through a single reducer (and causes a massive blowup in the number of tuples!). Here’s an example: ...

Haskell in Emacs

I spent some time today getting my emacs config set up to learn Haskell, and ran into a few issues; I figured I’d go ahead and document the process here for everyone’s enjoyment. We’re going to install and configure Haskell mode, then add a few extensions that’ll make learning Haskell fun and easy! I’m currently running haskell-mode for emacs, with the hs-lint plugin, Haskell support for FlyMake (which provides on-the-fly syntax checking from the Haskell compiler), and code autocompletion. The steps covered by this tutorial are: ...

Simple Hadoop Clusters

I’m excited to announce Pallet-Hadoop, a configuration library written in Clojure for Apache’s Hadoop. In the tutorial, we’re going to see how to create a three node Hadoop cluster on EC2, and run a word count on MapReduce. We’ll be following along with Pallet-Hadoop example project for the introduction; for a more in-depth discussion of the design of pallet-hadoop, see the project wiki. Background Hadoop is an Apache java framework that allows for distributed processing of enormous datasets across large clusters. It combines a computation engine based on MapReduce with HDFS, a distributed filesystem based on the Google File System. ...