Applied Lambda for Data Lake
As introduced in the initial chapters, big data is defined as four Vs, that is, Variance, Velocity, Volume, and Varsity. We also got introduced to Lambda architecture and how it can possibly enable merge outputs from two distinctive processing pipelines. In order to leverage big data technologies to solve processing problems, it may be a good idea to marry Lambda architecture with these Big Data architectures such that we can reap the benefits of both. Though big data refers to an end-to-end solution to handle, process, and manage information across all the four Vs, it has become quite synonymous with the Hadoop Big Data framework. While the initial implementation of Hadoop was introduced by the open source Apache community, its immediate demand brought in a lot of commercial offerings for support. Over a period of time, the community witnessed a number of customized distributions of Hadoop. Some of the most popular ones today are Cloudera, Hortonworks, and MapR. As we know, Hadoop as a framework was initially implemented by Yahoo! for internal Big Data scenarios and was later open sourced as Hadoop under an Apache license. Horton works as a spin-off from Yahoo! and continues to maintain its commercial offering in this space, competing closely with Cloudera and MapR. In this chapter, we will have a quick overview of the technologies in the Hadoop landscape and how can they conceptually help us realize a Data Lake.