上QQ阅读APP看书，第一时间看更新

Applied Lambda for Data Lake

As introduced in the initial chapters, big data is defined as four Vs, that is, Variance, Velocity, Volume, and Varsity. We also got introduced to Lambda architecture and how it can possibly enable merge outputs from two distinctive processing pipelines. In order to leverage big data technologies to solve processing problems, it may be a good idea to marry Lambda architecture with these Big Data architectures such that we can reap the benefits of both. Though big data refers to an end-to-end solution to handle, process, and manage information across all the four Vs, it has become quite synonymous with the Hadoop Big Data framework. While the initial implementation of Hadoop was introduced by the open source Apache community, its immediate demand brought in a lot of commercial offerings for support. Over a period of time, the community witnessed a number of customized distributions of Hadoop. Some of the most popular ones today are Cloudera, Hortonworks, and MapR. As we know, Hadoop as a framework was initially implemented by Yahoo! for internal Big Data scenarios and was later open sourced as Hadoop under an Apache license. Horton works as a spin-off from Yahoo! and continues to maintain its commercial offering in this space, competing closely with Cloudera and MapR. In this chapter, we will have a quick overview of the technologies in the Hadoop landscape and how can they conceptually help us realize a Data Lake.

Here we want to establish certain grounds in terms of the overall landscape of big data and the specific technologies chosen in this book for forthcoming chapters. As far as possible, this book will refer to standard open source distributions so that the examples and concepts are distribution agnostic and can be run on any distribution of your choice. Hence, the content of this book will lean more towards open source distributions.

本周热推：

审计学（第2版）会计信息化基础（金蝶版）非参数统计：基于R语言案例分析增值型内部审计：提升经营效率、强化风险管理、促进价值再造内审兵法