Cascalog + Hadoop Counters, Finally!

I’ve just merged a Cascalog pull request of mine that gives Cascalog operations access to the statistics that Cascading generates at the end of each job. I’ve also added global inc! and inc-by! functions that let you increment custom Hadoop counters from within your functions and operations without having to deal with all that prepfn nastiness we introduced in Cascalog 2.0. Here’s a link to the code. If you want to follow along, or just want to get the hell away from this blog and start playing with the code now, get yourself a copy of the new snapshot: ...

February 21, 2015 · 3 min

Cascalog 2.0 In Depth

Cascalog 2.0 has been out for over a year now, and outside of a post to the mailing list and a talk at Clojure/Conj 2013 (slides here), I’ve never written up the startingly long list of new features brought by that release. So shameful. This post fixes that. 2.0 was a big deal. Anonymous functions make it easy to reuse your existing, non Cascalog code. The interop story with vanilla Clojure is much better, which is huge for testing. Finally, users can access the JobConf, Cascading’s counters and other Cascading guts during operations. ...

January 3, 2015 · 10 min

Hardcore Cascalog: Dynamic Queries

A little side note before I get started - pivoting from my last post on ski mountaineering racing to this post on advanced Cascalog patterns has made me realize that I’m a full-fledged connoisseur of the esoteric. I’m embracing it! This is the first in a series of posts on hardcore Cascalog. If you’re stoked, leave me a comment telling me what you want to learn more about and we’ll go from there. ...

January 1, 2015 · 9 min

Cascalog Testing 2.0

A few months ago I announced Midje-Cascalog, my layer of Midje testing macros over the Cascalog MapReduce DSL. These allow you to write tests for your Cascalog jobs in a style that mimics Cascalog’s own query execution syntax. In this post I discuss midje-cascalog’s 0.4.0 release, which brings tighter Midje integration and a number of new ways to write tests. I’ll start with a refresher on the old syntax before debuting the new. If you’re eager, add the following to your project.clj: ...

January 23, 2012 · 6 min

Introducing Cascalog-Contrib

I’ve had the pleasure of working with Cascalog for about ten months now, and have seen the community produce some fantastic work. A number of businesses are using Cascalog in production; I use Cascalog at Twitter every day to write MapReduce queries for the new Twitter Web Analytics product. One thing Cascalog doesn’t yet have is a community repository for generic queries and operations. To fill this gap we’ve created cascalog-contrib. Cascalog-contrib will be home to any higher-level abstractions over Cascalog that the community is willing to submit. If you have an idea for a module, file a pull request on GitHub or bring it up on the mailing list for discussion. ...

November 16, 2011 · 4 min

Testing Cascalog with Midje

I’ve been working on a Cascalog testing suite these past few weeks, an extension to Brian Marick’s Midje, that eases much of the pain of testing MapReduce workflows. I think a lot of the dull work we see in the Hadoop community is a direct result of fear. Without proper tests, Hadoop developers can’t help but be scared of making changes to production code. When creativity might bring down a workflow, it’s easiest to get it working once and leave it alone. ...

September 30, 2011 · 15 min

Getting Creative with MapReduce

One problem with many existing MapReduce abstraction layers is the utter difficulty of testing queries and workflows. End-to-end tests are maddening to craft in vanilla Hadoop and frustrating at best in Pig and Hive. The difficulty of testing MapReduce workflows makes it scary to change code, and destroys your desire to be creative. A proper testing suite is an absolute prerequisite to doing creative work in big data. In this blog post, I aim to show how most of the difficulty of writing and testing MapReduce queries stems from the fact that Hadoop confounds application logic with decisions about data storage. These problems are the result of poorly implemented abstractions over the primitives of MapReduce, not problems with the core MapReduce algorithms. ...

September 29, 2011 · 6 min

Cascalog 1.8.1 Released

Nathan Marz and I are releasing Cascalog 1.8.1 today! We’ve added a few interesting features, and I thought I’d provide a bit more detail here for anyone interested. Cross Join cascalog.api now includes support for cross-joins; just add (cross-join) to your query as its own predicate. Think of a cross-join as a “tuple comprehension”, or cartesian product, with similar results to clojure.core/for; it’s not very efficient, as it forces all tuples through a single reducer (and causes a massive blowup in the number of tuples!). Here’s an example: ...

September 26, 2011 · 3 min