Batch layer
Batch processing has been one of the most traditional ways of processing large amounts of data and are usually long-running processes. With the advent of many recent big data technologies, these Batch processes have become much more efficient and performant; this has greatly helped in reducing processing times.
The batch process is usually aware of the expected data to be consumed and the expected output. Historically, these processes were monolithic in nature and would process the entire dataset in a single run and some level of multi-threading, with specific mechanisms for handling failure scenarios and operational procedures to maintain such processes in production.
Hadoop, as a big data technology, provided all of the required framework and technology support for building batch processes that were more efficient and scalable than traditional batch processes. Hadoop came with two major components required for executing batch processes, primarily the process and the storage. Hadoop batch processes proved to be faster than regular batch processes primarily due to the following reasons:
- Fast and optimized execution of processes using the Map-Reduce paradigm.
- Sequential storage for fast sequential reads and writes.
- Replicated storage for higher availability of data.
- Runtime execution of processes near the data managed by job schedulers.
These capabilities of a Hadoop-based batch process provided immense improvements over traditionally built batch processes, wherein the data distribution and process distribution was managed by the underlying Hadoop framework, while the mapper and reducer jobs are focused on specific data processing.
Figure 08: The Map-Reduce paradigm - batch processing
A Map-Reduce paradigm of process execution is not a new paradigm; rather it has been used in many applications ever since mainframe systems came into existence. It is based on Divide and Rule and stems from the traditional multi-threading model. The primary mechanism here is to divide the batch across multiple processes and then combine/reduce output of all the processes into a single output. This way, each and every process is running in parallel but independent of each other on partitions of data. This ensures that the data is processed at least once and the output of each of these processes are combined and results de-duplicated if any. With built-in framework capabilities, this execution of a batch proved to be highly optimized and helped Hadoop technology to get into solving mainstream batch problems. Such Batch processes also provided good window to derive more business intelligence from data processing and are also embedded with more sophisticated capabilities like data science and machine learning to serve batch oriented analytical needs. But then there were always questions around: what can be done for real-time needs?
As an answer to the near real time needs of data processing, multiple frameworks originated. These were aimed at solving this problem. Lambda Architecture also provides mechanisms to use some of these frameworks mainly in its speed layer.
Some of the early attempts made to achieve real-time processing is by triggering frequent batch processes. However, the processing could never get closer to near real time expectations.