Back to our sentiment analysis of Twitter hashtags project
The quick data pipeline prototype we built gave us a good understanding of the data, but then we needed to design a more robust architecture and make our application enterprise ready. Our primary goal was still to gain experience in building data analytics, and not spend too much time on the data engineering part. This is why we tried to leverage open source tools and frameworks as much as possible:
- Apache Kafka (volume of tweets in a reliable and fault-tolerant way.
- Apache Spark (provides a programming interface that abstracts a complexity of parallel computing.
- Jupyter Notebooks (users remotely connect to a computing environment (Kernel) to create advanced data analytics. Jupyter Kernels support a variety of programming languages (Python, R, Java/Scala, and so on) as well as multiple computing frameworks (Apache Spark, Hadoop, and so on).
For the sentiment analysis part, we decided to replace the code we wrote using the textblob Python library with the Watson Tone Analyzer service (sentiment analysis including detection of emotional, language, and social tone. Even though the Tone Analyzer is not open source, a free version that can be used for development and trial is available on IBM Cloud (https://www.ibm.com/cloud).
Our architecture now looks like this:
Twitter sentiment analysis data pipeline architecture
In the preceding diagram, we can break down the workflow in to the following steps:
- Produce a stream of tweets and publish them into a Kafka topic, which can be thought of as a channel that groups events together. In turn, a receiver component can subscribe to this topic/channel to consume these events.
- Enrich the tweets with emotional, language, and social tone scores: use Spark Streaming to subscribe to Kafka topics from component 1 and send the text to the Watson Tone Analyzer service. The resulting tone scores are added to the data for further downstream processing. This component was implemented using Scala and, for convenience, was run using a Jupyter Scala Notebook.
- Data analysis and exploration: For this part, we decided to go with a Python Notebook simply because Python offer a more attractive ecosystem of libraries, especially around data visualizations.
- Publish results back to Kafka.
- Implement a real-time dashboard as a Node.js application.
With a team of three people, it took us about 8 weeks to get the dashboard working with real-time Twitter sentiment data. There are multiple reasons for this seemingly long time:
- Some of the frameworks and services, such as Kafka and Spark Streaming, were new to us and we had to learn how to use their APIs.
- The dashboard frontend was built as a standalone Node.js application using the Mozaïk framework (https://github.com/plouc/mozaik), which made it easy to build powerful live dashboards. However, we found a few limitations with the code, which forced us to dive into the implementation and write patches, hence adding delays to the overall schedule.
The results are shown in the following screenshot:
Twitter sentiment analysis real-ime dashboard