Leadville Trail 100, 2014 Edition

Last weekend, I raced the Leadville Trail 100 for the second time. Last year’s race was physically brutal; I sat curled up at the 50 mile point, 11 pounds light and unable to keep down fluids, for almost two hours before rallying and banging out a strong second half for a finish of 26:15 (strava report). That race earned me the silver belt buckle awarded to all finishers under 30 hours: ...

August 8, 2014 · 28 min

API Authentication with Liberator and Friend

I’ve just finished rewriting a number of PaddleGuru’s internal APIs using two great open-source libraries; Liberator and Friend. Liberator is a library for writing RESTful resources in Clojure. Friend is an authorization and authentication library written by the prolific Chas Emerick, Dominator, Esquire. You’ve certainly seen his stuff around if you’ve played with Clojure(Script) in any level of detail. Authentication and authorization are both really important in RESTful APIs. These libraries are made for each other, I thought to myself. I’ll just use them together and life will be wonderful. Right? ...

January 18, 2014 · 14 min

Upcoming Talks in 2013

This is the year I teach myself to become a better public speaker. I’ve spent the past year coding up a number of powerful Scala and Clojure projects, all the while avoiding the important and difficult work of teaching and writing about the import and use of those projects. Well, no longer. To mind that gap I’ll be giving a number of talks in 2013 on my recent work on Summingbird and Cascalog. If you’re in the Bay Area, Boston, or Northern VA (my home town!), I’d love to meet you. ...

August 24, 2013 · 3 min

Leadville Trail 100

This past weekend I ticked off one of my athletic life goals; the Leadville Trail 100, a brutal 100 mile running race in Colorado. I signed up for Leadville this May after a team injury forced our boat out of the 2013 Texas Water Safari, a 262 mile canoe race down in Texas that is my usual ultra-length torture for the year. Less than three months is less training than recommended for a hundred miler, but what the hell. Rational thinking never led anyone to a successful hundred mile finish. ...

August 20, 2013 · 36 min

Cascalog Testing 2.0

A few months ago I announced Midje-Cascalog, my layer of Midje testing macros over the Cascalog MapReduce DSL. These allow you to write tests for your Cascalog jobs in a style that mimics Cascalog’s own query execution syntax. In this post I discuss midje-cascalog’s 0.4.0 release, which brings tighter Midje integration and a number of new ways to write tests. I’ll start with a refresher on the old syntax before debuting the new. If you’re eager, add the following to your project.clj: ...

January 23, 2012 · 6 min

Introducing Cascalog-Contrib

I’ve had the pleasure of working with Cascalog for about ten months now, and have seen the community produce some fantastic work. A number of businesses are using Cascalog in production; I use Cascalog at Twitter every day to write MapReduce queries for the new Twitter Web Analytics product. One thing Cascalog doesn’t yet have is a community repository for generic queries and operations. To fill this gap we’ve created cascalog-contrib. Cascalog-contrib will be home to any higher-level abstractions over Cascalog that the community is willing to submit. If you have an idea for a module, file a pull request on GitHub or bring it up on the mailing list for discussion. ...

November 16, 2011 · 4 min

Testing Cascalog with Midje

I’ve been working on a Cascalog testing suite these past few weeks, an extension to Brian Marick’s Midje, that eases much of the pain of testing MapReduce workflows. I think a lot of the dull work we see in the Hadoop community is a direct result of fear. Without proper tests, Hadoop developers can’t help but be scared of making changes to production code. When creativity might bring down a workflow, it’s easiest to get it working once and leave it alone. ...

September 30, 2011 · 15 min

Getting Creative with MapReduce

One problem with many existing MapReduce abstraction layers is the utter difficulty of testing queries and workflows. End-to-end tests are maddening to craft in vanilla Hadoop and frustrating at best in Pig and Hive. The difficulty of testing MapReduce workflows makes it scary to change code, and destroys your desire to be creative. A proper testing suite is an absolute prerequisite to doing creative work in big data. In this blog post, I aim to show how most of the difficulty of writing and testing MapReduce queries stems from the fact that Hadoop confounds application logic with decisions about data storage. These problems are the result of poorly implemented abstractions over the primitives of MapReduce, not problems with the core MapReduce algorithms. ...

September 29, 2011 · 6 min

Cascalog 1.8.1 Released

Nathan Marz and I are releasing Cascalog 1.8.1 today! We’ve added a few interesting features, and I thought I’d provide a bit more detail here for anyone interested. Cross Join cascalog.api now includes support for cross-joins; just add (cross-join) to your query as its own predicate. Think of a cross-join as a “tuple comprehension”, or cartesian product, with similar results to clojure.core/for; it’s not very efficient, as it forces all tuples through a single reducer (and causes a massive blowup in the number of tuples!). Here’s an example: ...

September 26, 2011 · 3 min

Haskell in Emacs

I spent some time today getting my emacs config set up to learn Haskell, and ran into a few issues; I figured I’d go ahead and document the process here for everyone’s enjoyment. We’re going to install and configure Haskell mode, then add a few extensions that’ll make learning Haskell fun and easy! I’m currently running haskell-mode for emacs, with the hs-lint plugin, Haskell support for FlyMake (which provides on-the-fly syntax checking from the Haskell compiler), and code autocompletion. The steps covered by this tutorial are: ...

September 25, 2011 · 5 min