Saw this great article by Ted Dziuba that parallels my programming philosophy.

( In case you’re interested, here’s a reddit thread on the article. )

In short, the article states that much of what you want to accomplish with any project is possible with the standard, tried-and-true Unix tools. When you shun these tools in favor of the latest-and-greatest buzzword tool or library, you’re not only introducing a lot of unnecessary complexity at the get-go, but you’re taking on the responsibility of maintaining that code. You’re also introducing a lot of uncertainty by shunning a near-bulletproof method for a new-kid-on-the-block library that hasn’t had the benefit of being around the block a few billion times over the last 40 years.

I can’t tell you how many times others have looked at my code and called it “messy” because I tend to make use of standard Unix utilities rather than the newfangled libraries. Hey, I tell them, I just wanted to write a quick script that accomplishes a simple task in as little time as possible. I want it to be reliable, and I want it to “just work.” If I had done it your way, it would have taken me 6 hours instead of just 20 minutes. And it probably would have broken in a year or two. Unix’s find, xargs, cat, sort, wget, grep, awk, et al, all work splendidly. Why reinvent the wheel?

Dziuba provides an excellent example:

Here’s a concrete example: suppose you have millions of web pages that you want to download and save to disk for later processing. How do you do it? The cool-kids answer is to write a distributed crawler in Clojure and run it on EC2, handing out jobs with a message queue like SQS or ZeroMQ.

The Taco Bell answer? xargs and wget. In the rare case that you saturate the network connection, add some split and rsync. A “distributed crawler” is really only like 10 lines of shell script.

Moving on, once you have these millions of pages (or even tens of millions), how do you process them? Surely, Hadoop MapReduce is necessary, after all, that’s what Google uses to parse the web, right?

Pfft, fuck that noise:

find crawl_dir/ -type f -print0 | xargs -n1 -0 -P32 ./process

32 concurrent parallel parsing processes and zero bullshit to manage. Requirement satisfied.

Simple, yet brilliant.