What this book covers
The book contains two logical parts of roughly equal length. In the first half, I lay down the theme of the book which is the need to bridge the gap between data science and engineering, including in-depth details about the Jupyter + PixieDust solution I'm proposing. The second half is dedicated to applying what we learned in the first half, to four industry cases.
Chapter 1, Programming and Data Science – A New Toolset, I attempt to provide a definition of data science through the prism of my own experience, building a data pipeline that performs sentiment analysis on Twitter posts. I defend the idea that it is a team sport and that most often, silos exist between the data science and engineering teams that cause unnecessary friction, inefficiencies and, ultimately, a failure to realize its full potential. I also argue the point of view that data science is here to stay and that eventually, it will become an integral part of what is known today as computer science (I like to think that someday new terms will emerge, such as computer data science that better capture this duality).
Chapter 2, Python and Jupyter Notebooks to Power your Data Analysis, I start diving into popular data science tools such as Python and its ecosystem of open-source libraries dedicated to data science, and of course Jupyter Notebooks. I explain why I think Jupyter Notebooks will become the big winner in the next few years. I also introduce the PixieDust open-source library capabilities starting from the simple display()
method that lets the user visually explore data in an interactive user interface by building compelling charts. With this API, the user can choose from multiple rendering engines such as Matplotlib, Bokeh, Seaborn, and Mapbox. The display()
capability was the only feature in the PixieDust MVP (minimum viable product) but, over time, as I was interacting with a lot of data science practitioners, I added new features to what would quickly become the PixieDust toolbox:
- sampleData(): A simple API for easily loading data into pandas and Apache Spark DataFrames
- wrangle_data(): A simple API for cleaning and massaging datasets. This capability includes the ability to destructure columns into new columns using regular expressions to extract content from unstructured text. The
wrangle_data()
API can also make recommendations based on predefined patterns. - PackageManager: Lets the user install third-party Apache Spark packages inside a Python Notebook.
- Scala Bridge: Enables the user to run the Scala code inside a Python Notebook. Variables defined in the Python side are accessible in Scala and vice-versa.
- Spark Job Progress Monitor: Lets you track the status of your Spark Job with a real-time progress bar that displays directly in the output cell of the code being executed.
- PixieApp: Provides a programming model centered around HTML/CSS that lets developers build sophisticated dashboards to operationalize the analytics built in the Notebook. PixieApps can run directly in the Jupyter Notebook or be deployed as analytic web applications using the PixieGateway microservice. PixieGateway is an open-source companion project to PixieDust.
The following diagram summarizes the PixieDust development journey, including recent additions such as the PixieGateway and the PixieDebugger which is the first visual Python debugger for Jupyter Notebooks:
PixieDust journey
One key message to take away from this chapter is that PixieDust is first and foremost an open-source project that lives and breathes through the contributions of the developer community. As is the case for countless open-source projects, we can expect many more breakthrough features to be added to PixieDust over time.
Chapter 3, Accelerate your Data Analysis with Python Libraries, I take the reader through a deep dive of the PixieApp programming model, illustrating each concept along the way with a sample application that analyzes GitHub data. I start with a high-level description of the anatomy of a PixieApp including its life cycle and the execution flow with the concept of routes. I then go over the details of how developers can use regular HTML and CSS snippets to build the UI of the dashboard, seamlessly interacting with the analytics and leveraging the PixieDust display()
API to add sophisticated charts.
The PixieApp programming model is the cornerstone of the tooling strategy for bridging the gap between data science and engineering, as it streamlines the process of operationalizing the analytics, thereby increasing collaboration between data scientists and developers and reducing the time-to-market of the application.
Chapter 4, Publish your Data Analysis to the Web - the PixieApp Tool, I discuss the PixieGateway microservice which enables developers to publish PixieApps as analytical web applications. I start by showing how to quickly deploy a PixieGateway microservice instance both locally and on the cloud as a Kubernetes container. I then go over the PixieGateway admin console capabilities, including the various configuration profiles and how to live-monitor the deployed PixieApps instances and the associated backend Python kernels. I also feature the chart sharing capability of the PixieGateway that lets the user turn a chart created with the PixieDust display()
API into a web page accessible by anyone on the team.
The PixieGateway is a ground-breaking innovation with the potential of seriously speeding up the operationalization of analytics—which is sorely needed today—to fully capitalize on the promise of data science. It represents an open-source alternative to similar products that already exist on the market, such as the Shiny Server from R-Studio (https://shiny.rstudio.com/deploy) and Dash from Plotly (https://dash.plot.ly)
Chapter 5, Python and PixieDust Best Practices and Advanced Concepts, I complete the deep-dive of the PixieDust toolbox by going over advanced concepts of the PixieApp programming model:
- @captureOutput decorator: By default, PixieApp routes require developers to provide an HTML fragment that will be injected in the application UI. This is a problem when we want to call a third-party Python library that is not aware of the PixieApp architecture and directly generate the output to the Notebook.
@captureOutput
solves this problem by automatically redirecting the content generated by the third-party Python library and encapsulating it into a proper HTML fragment. - Leveraging Python class inheritance for greater modularity and code reuse: Breaks down the PixieApp code into logical classes that can be composed together using the Python class inheritance capability. I also show how to call an external PixieApp using the
pd_app
custom attribute. - PixieDust support for streaming data: Shows how PixieDust
display()
and PixieApp can also handle streaming data. - Implementing Dashboard drill-down with PixieApp events: Provides a mechanism for letting PixieApp components publish and subscribe to events generated when the user interacts with the UI (for example, charts, and buttons).
- Building a custom display renderer for the PixieDust display() API: Walks through the code of a simple renderer that extends the PixieDust menus. This renderer displays a custom HTML table showing the selected data.
- Debugging techniques: Go over the various debugging techniques that PixieDust offers including the visual Python debugger called PixieDebugger and the
%%PixiedustLog
magic for displaying Python logging messages. - Ability to run Node.js code: We discuss the
pixiedust_node
extension that manages the life cycle of a Node.js process responsible for executing arbitrary Node.js scripts directly from within the Python Notebook.
Thanks to the open-source model with its transparent development process and a growing community of users who provided some valuable feedback, we were able to prioritize and implement a lot of these advanced features over time. The key point I'm trying to make is that following an open-source model with an appropriate license (PixieDust uses the Apache 2.0 license available here https://www.apache.org/licenses/LICENSE-2.0) does work very well. It helped us grow the community of users, which in turn provided us with the necessary feedback to prioritize new features that we knew were high value, and in some instances contributed code in the form of GitHub pull requests.
Chapter 6, Analytics Study: AI and Image Recognition with TensorFlow, I dive into the first of four industry cases. I start with a high-level introduction to machine learning, followed by an introduction to deep learning—a subfield of machine learning—and the TensorFlow framework that makes it easier to build neural network models. I then proceed to build an image recognition sample application including the associated PixieApp in four parts:
- Part 1: Builds an image recognition TensorFlow model by using the pretrain ImageNet model. Using the TensorFlow for poets tutorial, I show how to build analytics to load and score a neural network model.
- Part 2: Creates a PixieApp that operationalizes the analytics created in Part 1. This PixieApp scrapes the images from a web page URL provided by the user, scores them against the TensorFlow model and then graphically shows the results.
- Part 3: I show how to integrate the TensorBoard Graph Visualization component directly in the Notebook, providing the ability to debug the neural network model.
- Part 4: I show how to retrain the model with custom training data and update the PixieApp to show the results from both models.
I decided to start the series of sample applications with deep learning image recognition with TensorFlow because it's an important use case that is growing in popularity and demonstrating how we can build the models and deploy them in an application in the same Notebook represents a powerful statement toward the theme of bridging the gap between data science and engineering.
Chapter 7, Analytics Study: NLP and Big Data with Twitter Sentiment Analysis, I talk about doing natural language processing at Twitter scale. In this chapter, I show how to use the IBM Watson Natural Language Understanding cloud-based service to perform a sentiment analysis of the tweets. This is very important because it reminds the reader that reusing managed hosted services rather building the capability in-house can sometimes be an attractive option.
I start with an introduction to the Apache Spark parallel computing framework, and then move on to building the application in four parts:
- Part 1: Acquiring the Twitter data with Spark Structured Streaming
- Part 2: Enriching the data with sentiment and most relevant entity extracted from the text
- Part 3: Operationalizing the analytics by creating a real-time dashboard PixieApp.
- Part 4: An optional section that re-implements the application with Apache Kafka and IBM Streaming Designer hosted service to demonstrate how to add greater scalability.
I think the reader, especially those who are not familiar with Apache Spark, will enjoy this chapter as it is a little easier to follow than the previous one. The key takeaway is how to build analytics that scale with Jupyter Notebooks that are connected to a Spark cluster.
Chapter 8, Analytics Study: Prediction - Financial Time Series Analysis and Forecasting, I talk about time series analysis which is a very important field of data science with lots of practical applications in the industry. I start the chapter with a deep dive into the NumPy library which is foundational to so many other libraries, such as pandas and SciPy. I then proceed with the building of the sample application, which analyzes a time series comprised of historical stock data, in two parts:
- Part 1: Provides a statistical exploration of the time series including various charts such as autocorrelation function (ACF) and partial autocorrelation function (PACF)
- Part 2: Builds a predictive model based on the ARIMA algorithms using the
statsmodels
Python library
Time series analysis is such an important field of data science that I consider to be underrated. I personally learned a lot while writing this chapter. I certainly hope that the reader will enjoy it as well and that reading it will spur an interest to know more about this great topic. If that's the case, I also hope that you'll be convinced to try out Jupyter and PixieDust on your next learnings about time series analysis.
Chapter 9, Analytics Study: Graph Algorithms - US Domestic Flight Data Analysis, I complete this series of industry use cases with the study of Graphs. I chose a sample application that analyzes flight delays because the data is readily available, and it's a good fit for using graph algorithms (well, for full disclosure, I may also have chosen it because I had already written a similar application to predict flight delays based on weather data where I used Apache Spark MLlib: https://developer.ibm.com/clouddataservices/2016/08/04/predict-flight-delays-with-apache-spark-mllib-flightstats-and-weather-data).
I start with an introduction to graphs and associated graph algorithms including several of the most popular graph algorithms such as Breadth First Search and Depth First Search. I then proceed with an introduction to the networkx
Python library that is used to build the sample application.
The application is made of four parts:
- Part 1: Shows how to load the US domestic flight data into a graph.
- Part 2: Creates the
USFlightsAnalysis
PixieApp that lets the user select an origin and destination airport and then display a Mapbox map of the shortest path between the two airports according to a selected centrality - Part 3: Adds data exploration to the PixieApp that includes various statistics for each airline that flies out of the selected origin airport
- Part 4: Use the techniques learned in Chapter 8, Analytics Study: Prediction - Financial Time Series Analysis and Forecasting to build an ARIMA model for predicting flight delays
Graph theory is also another important and growing field of data science and this chapter nicely rounds up the series, which I hope provides a diverse and representative set of industry use cases. For readers who are particularly interested in using graph algorithms with big data, I recommend looking at Apache Spark GraphX (https://spark.apache.org/graphx) which implements many of the graph algorithms using a very flexible API.
Chapter 10, The Future of Data Analysis and Where to Develop your Skills, I end the book by giving a brief summary and explaining my take on Drew's Conway Venn Diagram. Then I talk about the future of AI and data science and how companies could prepare themselves for the AI and data science revolution. Also, I have listed some great references for further learning.
Appendix, PixieApp Quick-Reference, is a developer quick-reference guide that provides a summary of all the PixieApp attributes. This explains the various annotations, custom HTML attributes, and methods with the help of appropriate examples.
But enough about the introduction: let's get started on our journey with the first chapter titled Programming and Data Science – A New Toolset.