Learning Something New: Big Data
We learn in increments. One things leads to another, which leads to another, and so on until we stand proudly atop a mountain of specialties, each founded on increasingly broad concepts.
Sometimes, we need to learn something completely new, something upon which we have little knowledge with which to build.
Big Data has been like that, for me. What follows will be an incremental introduction, taking a path from zero information to enough to work with. I'll be adding to the narrative as I reach a certain thresholds of competency, and we'll build a toy application together.
There's a lot of moving parts in here, from concepts to particular implementations.
Big Data, as we are currently using it in the industry, refers to a raft of definitions:
- static data sets of a particular size (in the terabytes and above), large enough that representing it in the usual ways (with RDBMS systems) doesn't cut it
- any data set with a certain tenacity, in which finding and removing any single record becomes difficult
- streams of dynamic data requiring real-time processing, and attempts to drink from these many firehoses
- techniques of processing large-scale data, from methods of statistical analysis to distributed file systems and attendant retrieval and crunching algorithms
- methods of representing and visualizing data of this size (and velocity) in a way that can be easily and immediately understood.
- strategies by which business can gain intelligence from their bulging-at-the-seams data warehouses
- specific tools used to deal with data fitting one or more of the above criteria
- and finally, the evolving discussion of Big Data itself; Big Data refers to Big Data in a circular, recursively-defining way.
None of these things I had really considered before a few months ago. Sure, I'd kept up with the terms, something close to the state of the art. When working with one dating startup, we were considering these technologies as a way to match users, but we ran out of money before we got the marketing agenda off of the ground, so it wasn't something any of us got around to implementing.
Then BAM! I had a project. Requiring me to know as much of this as possible. Right now. HOKAY.
The sheer volume of what I didn't know frustrated me; in the words of AB, 'There's nothing I hate more than not knowing everything.' Incremental learning did not apply, and it was after some floundering that (with the timely advice of someone much wiser) a learning strategy was reached.
Once you have your head around the theater, pick a single tactical objective, and make it happen. Repeat until done.
The theater, in this case, consists of anyone with a large quantity of data, thought-leaders in Big Data, and a loose survey of the technologies commonly used to deal with it. We don't need to know all the details, but a high-level view of the landscape allows us to pick a valuable objective, rather than an arbitrary one.
This accomplished, pick a small objective and a tactical team to deal with it. In my case, the objective was very small, 'The Complete Works of William Shakespeare'. Not a ton of data, but enough to get me used to working with the tactical team, consisting of Hadoop, Hive, and HBase, picked because they enjoy widespread use.
In the next few posts, we'll walk through the theater, then a high-level overview of the tools, then we'll apply them toward building something that might get you the phone number of that library science grad student you've had your eye on. You know the one.