Summary_Mastering Hadoop-同人网

上QQ阅读APP看书，第一时间看更新

Summary

In this chapter, you saw optimizations at different stages of the Hadoop MapReduce pipeline. With the join example, we saw a few other advanced features available for MapReduce jobs. Some key takeaways from this chapter are as follows:

Too many Map tasks that are I/O bound should be avoided. Inputs dictate the number of Map tasks.
Map tasks are primary contributors for job speedup due to parallelism.
Combiners increase efficiency not only in data transfers between Map tasks and Reduce tasks, but also reduce disk I/O on the Map side.
The default setting is a single Reduce task.
Custom partitioners can be used for load balancing among Reducers.
DistributedCache is useful for side file distribution of small files. Too many and too large files in the cache should be avoided.
Custom counters should be used to track global job level statistics. But too many counters are bad.
Compression should be used more often. Different compression techniques have different tradeoffs and the right technique is application-dependent.
Hadoop has many tunable configuration knobs to optimize job execution.
Premature optimizations should be avoided. Built-in counters are your friends.
Higher-level abstractions such as Pig or Hive are recommended instead of bare metal Hadoop jobs.

In the next chapter, we will look at Pig, a framework to script MapReduce jobs on Hadoop. Pig provides higher-level relational operators that a user can employ to do data transformations, eliminating the need to write low-level MapReduce Java code.