Modern Big Data Processing with Hadoop
上QQ阅读APP看书,第一时间看更新

Truncation

Another variant of erasing is truncation, where we make all the input data a uniform size. This is useful when we are pretty sure that information loss is accepted in the further processing of the pipelines.

This can also be an intelligent truncation where we are aware of the data we are dealing with. Let's see this example of email addresses:

Input

Output

What's truncated

alice@localhost.com

alice

@localhost.com

bob@localhost.com

bob

@localhost.com

rob@localhost.com

rob

@localhost.com

 

From the preceding examples, we can see that all the domain portions from the email are truncated as all of them belong to the same domain. This technique saves storage space.