ben vandgrift · on living the dream

Learning Something New: Big Data

We learn in increments. One things leads to another, which leads to another, and so on until we stand proudly atop a mountain of specialties, each founded on increasingly broad concepts.

Sometimes, we need to learn something completely new, something upon which we have little knowledge with which to build.

Big Data has been like that, for me. What follows will be an incremental introduction, taking a path from zero information to enough to work with. I'll be adding to the narrative as I reach a certain thresholds of competency, and we'll build a toy application together.

There's a lot of moving parts in here, from concepts to particular implementations.

Big Data, as we are currently using it in the industry, refers to a raft of definitions:

None of these things I had really considered before a few months ago. Sure, I'd kept up with the terms, something close to the state of the art. When working with one dating startup, we were considering these technologies as a way to match users, but we ran out of money before we got the marketing agenda off of the ground, so it wasn't something any of us got around to implementing.

Then BAM! I had a project. Requiring me to know as much of this as possible. Right now. HOKAY.

The sheer volume of what I didn't know frustrated me; in the words of AB, 'There's nothing I hate more than not knowing everything.' Incremental learning did not apply, and it was after some floundering that (with the timely advice of someone much wiser) a learning strategy was reached.

Once you have your head around the theater, pick a single tactical objective, and make it happen. Repeat until done.

The theater, in this case, consists of anyone with a large quantity of data, thought-leaders in Big Data, and a loose survey of the technologies commonly used to deal with it. We don't need to know all the details, but a high-level view of the landscape allows us to pick a valuable objective, rather than an arbitrary one.

This accomplished, pick a small objective and a tactical team to deal with it. In my case, the objective was very small, 'The Complete Works of William Shakespeare'. Not a ton of data, but enough to get me used to working with the tactical team, consisting of Hadoop, Hive, and HBase, picked because they enjoy widespread use.

In the next few posts, we'll walk through the theater, then a high-level overview of the tools, then we'll apply them toward building something that might get you the phone number of that library science grad student you've had your eye on. You know the one.

written: Jun 1 2012