Machine Learning with Apache Spark Quick Start Guide

上QQ阅读APP看书，第一时间看更新

Cloud computing platforms

Traditionally, many large organizations have invested in expensive data centers to house their business-critical computing systems. These data centers are integrated with their corporate networks, allowing users to access both the data stored in these centers and increased processing capacity. One of the main advantages of large organizations maintaining their own data centers is that of security—both data and processing capacity is kept on-premise under their control and administration within largely closed networks.

However, with the advent of big data and more accessible artificial intelligence and machine learning-driven analytics, the need to store, model, process, and analyze huge volumes of data requiring scalable hardware, software, and potentially distributed clusters containing hundreds or thousands of physical or virtual nodes quickly makes maintaining your own 24/7 data centers less and less cost-effective.

Cloud computing platforms, such as Amazon Web Services (AWS), Microsoft Azure, and the Google Cloud Platform (GCP), provide a means to offload some or all of an organization's data storage, management, and processing requirements to remote platforms managed by technology companies and that are accessible over the internet. Today, these cloud computing platforms offer much more than just a place to remotely store an organization's ever-increasing data estate. They offer scalable storage, computational processing, administration, data governance, and security software services, as well as access to artificial intelligence and machine learning libraries and frameworks. These cloud computing platforms also tend to offer a Pay-As-You-Go (PAYG) pricing model, meaning that you only pay for the storage and processing capacity that you actually use, which can also be easily scaled up or down depending on requirements.

Organizations that are weary of storing their sensitive data on remote platforms accessible over the internet tend to instead architect and engineer hybrid systems, whereby sensitive data remains on-premise, but computational processing on anonymized or unattributable data is offloaded to the cloud, for example.

A well-architected and well-engineered system should provide a layer of abstraction between its infrastructure and its end users, including data analysts and data scientists—that is, the fact as to whether the data storage and processing infrastructure is on-premise or cloud-based should be invisible to these types of user. Furthermore, many of the distributed technologies and processing engines that we have discussed so far tend to be written in Java or C++, but expose their API or programming model to other language variants, such as Python, Scala, and R. This therefore makes them accessible to a wide range of end users, deployable on any machines that can run a JVM or compile C++ code. A significant number of cloud services offered by cloud computing platforms are, in fact, commercial service wrappers built around open source technologies guaranteeing availability and support. Therefore, once system administrators and end users become familiar with a particular class of technology, migrating to cloud computing platforms actually becomes a matter of configuration to optimize performance rather than learning an entirely new way to store, model, and process data. This is important, since a significant cost for many organizations is the training of its staff—if underlying technologies and frameworks can be reused as much as possible, then this is preferable to migrating to entirely new storage and processing paradigms.