Data Lake for Enterprises
上QQ阅读APP看书,第一时间看更新

Data acquisition layer - get data from source systems

In an organization, data exists in various forms, which can be classified as structured data, semi-structured data, or unstructured data.

Some of the examples of structured data are relational databases, XML/JSON data, messages across systems and so on. Semi-structured data is also very prevalent from an organization perspective, particularly in the form of e-mails, chats, documents and so on. Unstructured data also exists in a workplace in the form of images, videos, raw texts, audio and so on.

For all of these types of data, it may not be possible to always define a schema. Schemas are very useful while translating data into meaningful information. While defining the schema of structured data would be very straightforward, a schema cannot be defined for semi-structured or unstructured data.

One of the key roles expected from the acquisition layer is to be able to convert the data into messages that can be further processed in a Data Lake; hence the acquisition layer is expected to be flexible to accommodate a variety of schema specifications. At the same time, it must have a fast connect mechanism to seamlessly push all the translated data messages into the data lake.

Figure 03: Data acquisition components

A Data acquisition layer may be composed of multi-connector components on the acquisition side and push the acquired data into a specific target destination. In the case of Data Lakes, the target destination would be the messaging layer.

There are specific technology frameworks that enable low-latency acquisition of data from various types of source systems; for every data type, the acquisition connectors are generally required to be configured/implemented depending on the framework used. The data acquisition layer is expected to perform limited transformation on the data acquired so as to minimize the latency. The transformation within the data acquisition layer should be performed only to convert the acquired data into a message/event so that it can be posted to the messaging layer.

In the event that the messaging layer is not reachable (either due to a network outage or downtime of the messaging layer), the data acquisition must also support the required fail-safety and fail-over mechanisms.

For this layer to be fail-safe, it should be able to support local and persistent buffering of messages such that, if needed, the messages can be recovered from the local buffer as and when the messaging layer is available again. This component should also support fail-over and if one of the data acquisition processes fails, another process seamlessly takes over.

Figure 04: Data Acquisition Component Design

For this layer to support low-latency acquisition, it needs to be built on fast and scalable parsing and transformation components.

As shown in the preceding figure, an acquisition layer’s simplified component view comprises connectors, data parsers, data transformers, and a message publisher. We will be discussing these components in detail in specific chapters in the context of the specific technologies and frameworks.