Offerings
Services
Our Platform
- NucliOS^®
  
  Foundation of Connected Intelligence
WHAT'S NEW?
- Unlock Smarter Insights for CPG in 2025 with Generative AI
  Read more
- 4 Data Engineering Challenges Hurting Your Organization
  Read more
Industries
Industries
FEATURED CASE STUDIES
- Optimizing Recruitment Strategy with Talent Acquisition Data Dashboards
  Read more
- IR and Customer 360 Platform Set Up
  Read more
WHAT'S NEW?
- From Data to Discovery: Harnessing the Power of Knowledge Graphs in Drug Development
  Read more
- Leveraging Multi-Touch Attribution for Optimizing Pharmaceutical Content-Level Messaging
  Read more
Insights
Insights
WHAT’S NEW?
- Procurement of the Future
  Read more
- How Google’s Willow is Paving a New Path for GenAI and Businesses
  Read more
Careers
CAREERS
QUICK LINKS
- Work with MathCo
  
  Join Us
WHAT’S NEW?
- MathCo Certified as a Great Place to Work® in India
  Read more
- TheMathCompany Recognized among the Inspiring Workplaces in North America
  Read more
About
About Us
QUICK LINKS
- Meet Our Leaders
- AI Xecutive Council
  
  Leveraging Collective Knowledge to Lead AI Innovation
  AIXC COMMUNITY
WHAT'S NEW?
- Redefining Enterprise Transformation: The Power of Generative AI
  Read more
- TheMathCompany is among India’s Best Workplaces™ for Diversity, Equity, Inclusion & Belonging 2023, by GPTW
  Read more

Real-Time Streaming Analytics for Faster Decision-Making

Article

Tapas Das

July 17, 2023 5 minute read

The concept of data as a strategic asset has been gaining momentum in recent years. Its importance stems from the invaluable insights it provides, enabling organizations and individuals to make informed choices and drive progress. Data is essential for informed decision-making, problem-solving, innovation, personalization, performance optimization, risk management, scientific advancements, and effective governance. By harnessing the power of data, organizations and individuals can navigate the complexities of the modern world and drive meaningful progress.

Most organizations are continuously striving to make rapid data-driven decisions—both strategic and operational—across multiple critical business units based on trustworthy data. Bad decisions can not only have negative implications internally but can also result in losing an important customer permanently. Therefore, organizations are increasingly adopting emerging technologies, like artificial intelligence (AI), machine learning (ML), Internet of Things (IoT), and cloud computing, both to revolutionize operations and to keep up with competitors.

With the ongoing data explosion, owing to IOT sensors, social media, web and mobile apps, etc., there is a stressing need for real-time and data-driven decision-making by leveraging streaming data analytics. Using streaming analytics, we can analyze multiple streams of data produced by a multitude of components (e.g., IoT devices, social media interactions, financial transactions, customer click-stream, etc.) to generate real-time insights and facilitate faster business decision-making.
This article showcases how we designed a platform-agnostic, real-time streaming analytics engine and how it processes live data feed from Twitter to perform real-time sentiment analysis of users.

Application design

Real-time systems are required to collect data from various sources and process them as they arrive, within a specified time interval, typically in the order of milli-, micro-, or even nanoseconds, and generate a response that delivers value.

Here are some of the key characteristics of real-time applications:

Low Latency (extremely short processing durations)
High Availability (fault-tolerant systems)
Horizontal Scalability (dynamic addition of compute or storage servers based on need)

The above flowchart represents the overall streaming analytics process. Let’s look at each component in detail.

Data ingestion

The Twitter Developer API, along with the “tweepy ^[1]” Python library, is used to read live Twitter feeds and ingest them into Apache Kafka ^[2] queues.

Tweepy is a popular Python library that provides a convenient way to access and interact with the Twitter API. It simplifies the process of connecting to the Twitter API, retrieving tweets, posting tweets, and performing various other Twitter-related tasks, making it easier for developers to incorporate Twitter functionality into their Python applications.

Data transformation

For sentiment analysis, “TextBlob ^[3]” Python library, which provides a simple API for common NLP tasks such as part-of-speech tagging, sentiment analysis, text classification, etc., is used. Text Blob’s intuitive API and wide range of NLP functionalities make it a popular choice for quick prototyping, educational purposes, and lightweight NLP tasks. Its integration with NLTK and availability of pre-trained models make it a convenient library for various text processing needs.
Apache Beam ^[4]–a unified model for both batch and streaming data-parallel processing pipelines—is being used to perform data transformations on the raw Twitter data, then pushed to the Kafka “outgoing” topic.

Data storage

Apache Druid ^[5]—a real-time analytics database designed for fast ad-hoc analytics and supporting high concurrency on large datasets—is being used to store the transformed streaming data from Apache Kafka.

Real-time dashboarding

Apache Superset ^[6]—an open-source, highly scalable application for data exploration and data visualization—is being used as the dashboarding tool for data analysis on users’ sentiments (as shown in the figure below).

Application hosting

The working prototype of a real-time streaming analytics engine has been set up on individual Docker containers, which are hosted on a single virtual machine and orchestrated using Docker Compose.

Summary

Computational analysis of streaming data, such as a Twitter data feed, is a challenging task due to the unstructured and noisy nature of the tweets, which requires integrating the tweet stream processing with ETL tools and ML algorithms to transform and analyze the streaming data.

The framework showcased in this article can be leveraged for a wide variety of use cases, such as real-time recommendation engines, real-time stock trades, fault detection, monitoring and reporting on internal IT systems, real-time cybersecurity for enhanced threat mitigation, and many more.

Further research work in this stream is currently in progress, including integration with fine-tuned ML models for NLP and hosting the Docker containers in a Kubernetes cluster for scalability.

Check out our white paper to learn more about real-time streaming analytics and what it can achieve here.

References

1.“Tweepy/Tweepy: Twitter for Python!” GitHub. Accessed February 1, 2023. https://github.com/tweepy/tweepy.

2. Apache Kafka. Accessed February 1, 2023. https://kafka.apache.org/.

3. “Installation.” Installation – TextBlob 0.16.0 documentation. Accessed February 1, 2023. https://textblob.readthedocs.io/en/dev/install.html.

4.“The Unified Apache Beam Model.” Brand. Accessed February 1, 2023. https://beam.apache.org/.

5. “Apache® Druid.” Apache Druid. Accessed February 1, 2023. https://druid.apache.org/.

6. “Superset.” Welcome. Accessed February 1, 2023. https://superset.apache.org/.

Tapas Das

Data Architect

Tapas has close to a decade's experience in the data engineering field and has expertise in functions related to architecture design, business impact analysis, and much more. A deep-learning enthusiast, he is also making strides in the ML space, frequently cracking ML competitions and picking up titles such as MachineHack Grand Master and Kaggle Notebooks Expert. Beyond his work, he enjoys imparting his tech-knowhow by writing blogs and getting a chuckle out of fine DE-humour.

Keep Reading

All February 13, 2023

The Value of Real-Time Streaming Analytics for Businesses, Demonstrated

February 13, 2023 Read more

All March 17, 2023

A Simple Guide to a Smooth Legacy System Migration

March 17, 2023 Read more

Pharma & Life Sciences May 2, 2023

Architecting MLOps Solutions for Healthcare

May 2, 2023 Read more

Real-Time Streaming Analytics for Faster Decision-Making

Application design

Data ingestion

Data transformation

Data storage

Real-time dashboarding

Application hosting

Summary

References

Tapas Das

Chicago

TheMathCompany Inc.
306W Erie St, Suite 300,
Chicago, IL 60654,
United States
+1 (646) 220-0901

Amsterdam

TheMathCompany BV
Keizersgracht 555
1017 DR
Amsterdam, Netherlands
+31 627613405

Bengaluru

TheMathCompany Pvt Ltd,
8th Floor, Tower A, IWF Campus,
Whitefield Main Rd, B Narayanapura,
Mahadevapura, Bengaluru, Karnataka,
India – 560048
+91-80-46245900