A leading media and entertainment company, with a diverse portfolio spanning television, film, sports, news, streaming, and gaming, sought to gain a competitive edge in the OTT landscape. The goal was to develop a real-time analytics platform to enhance observability and optimize video quality of experience (VQoE) for the client.
Their existing infrastructure faced critical challenges, including performance bottlenecks, data latency issues, and the need for 24/7 availability across multiple devices. To address these concerns, the client partnered with us to architect and implement a scalable, high-performance analytics solution that enabled real-time monitoring, predictive analytics, and intelligent personalization.
Key achievements of this modernization project include:
Operational Efficiency: Five 9s system availability, proactive issue detection, scalable data ingestion pipeline for high-traffic streaming environments
In today’s world, data is being generated at an unprecedented rate. Every click, every tap, every swipe, every tweet, every post, every like, every share, every search, and every view generates a trail of data. Businesses are struggling to keep up with the speed and volume of this data, and traditional batch-processing systems cannot handle the scale and complexity of this data in real-time.
This is where streaming analytics comes into play, providing faster insights and more timely decision-making. Streaming analytics is particularly useful for scenarios that require quick reactions to events, such as financial fraud detection or IoT data processing. It can handle large volumes of data and provide continuous monitoring and alerts in real-time, allowing for immediate action to be taken when necessary.
Stream processing or real-time analytics is a method of analyzing and processing data as it is generated, rather than in batches. It allows for faster insights and more timely decision-making. Popular open-source stream processing engines include Apache Flink, Apache Spark Streaming, and Apache Kafka Streams. In this blog, we are going to talk about Apache Flink and its fundamentals and how it can be useful for streaming analytics.
Introduction
Apache Flink is an open-source stream processing framework first introduced in 2014. Flink has been designed to process large amounts of data streams in real-time, and it supports both batch and stream processing. It is built on top of the Java Virtual Machine (JVM) and is written in Java and Scala.
Flink is a distributed system that can run on a cluster of machines, and it has been designed to be highly available, fault-tolerant, and scalable. It supports a wide range of data sources and provides a unified API for batch and stream processing, which makes it easy to build complex data processing applications.
Advantages of Apache Flink
Real-time analytics is the process of analyzing data as it is generated. It requires a system that can handle large volumes of data in real-time and provide insights into the data as soon as possible. Apache Flink has been designed to meet these requirements and has several advantages over other real-time data processing systems.
Low Latency: Flink processes data streams in real-time, which means it can provide insights into the data almost immediately. This makes it an ideal solution for applications that require low latency, such as fraud detection and real-time recommendations.
High Throughput: Flink has been designed to handle large volumes of data and can scale horizontally to handle increasing volumes of data. This makes it an ideal solution for applications that require high throughput, such as log processing and IoT applications.
Flexible Windowing: Flink provides a flexible windowing API that enables the creation of complex windows for processing data streams. This enables the creation of windows based on time, count, or custom triggers, which makes it easy to create complex data processing applications.
Fault Tolerance: Flink is designed to be highly available and fault-tolerant. It can recover from failures quickly and can continue processing data even if some of the nodes in the cluster fail.
Compatibility: Flink is compatible with a wide range of data sources, including Kafka, Hadoop, and Elasticsearch. This makes it easy to integrate with existing data processing systems.
Flink Architecture
Apache Flink processes data streams in a distributed manner. The Flink cluster consists of several nodes, each of which is responsible for processing a portion of the data. The nodes communicate with each other using a messaging system, such as Apache Kafka.
The Flink cluster processes data streams in parallel by dividing the data into small chunks, or partitions, and processing them independently. Each partition is processed by a task, which is a unit of work that runs on a node in the cluster.
Flink provides several APIs for building data processing applications, including the DataStream API, the DataSet API, and the Table API. The below diagram illustrates what a Flink cluster looks like.
Apache Flink Cluster
Flink application runs on a cluster.
A Flink cluster has a job manager and a bunch of task managers.
A job manager is responsible for effective allocation and management of computing resources.
Task managers are responsible for the execution of a job.
Flink Job Execution
Client system submits job graph to the job manager
A client system prepares and sends a dataflow/job graph to the job manager.
It can be your Java/Scala/Python Flink application or the CLI.
The runtime and program execution do not include the client.
After submitting the job, the client can either disconnect and operate in detached mode or remain connected to receive progress reports in attached mode.
Given below is an illustration of how the job graph converted from code looks like
Job Graph
The job graph is converted to an execution graph by the job manager
The execution graph is a parallel version of the job graph.
For each job vertex, it contains an execution vertex per parallel subtask.
An operator that exhibits a parallelism level of 100 will consist of a single job vertex and 100 execution vertices.
Given below is an illustration of what an execution graph looks like:
Execution Graph
Job manager submits the parallel instances of execution graph to task managers
Execution resources in Flink are defined through task slots.
Each task manager will have one or more task slots, each of which can run one pipeline of parallel tasks.
A pipeline consists of multiple successive tasks
Parallel instances of execution graph being submitted to task slots
Flink Program
Flink programs look like regular programs that transform DataStreams. Each program consists of the same basic parts:
Obtain an execution environment
ExecutionEnvironment is the context in which a program is executed. This is how execution environment is set up in Flink code:
ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment(); // if program is running on local machineExecutionEnvironment env =newCollectionEnvironment(); // if source is collectionsExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // will do the right thing based on context
Connect to data stream
We can use an instance of the execution environment to connect to the data source which can be file System, a streaming application or collection. This is how we can connect to data source in Flink:
DataSet<String> data = env.readTextFile("file:///path/to/file"); // to read from fileDataSet<User> users = env.fromCollection( /* get elements from a Java Collection */); // to read from collectionsDataSet<User> users = env.addSource(/*streaming application or database*/);
Perform Transformations
We can perform transformation on the events/data that we receive from the data sources. A few of the data transformation operations are map, filter, keyBy, flatmap, etc.
Specify where to send the data
Once we have performed the transformation/analytics on the data that is flowing through the stream, we can specify where we will send the data. The destination can be a filesystem, database, or data streams.
dataStream.sinkTo(/*streaming application or database api */);
Flink Transformations
Map: Takes one element at a time from the stream and performs some transformation on it, and gives one element of any type as an output.
Given below is an example of Flink’s map operator:
stream.map(newMapFunction<Integer, String>(){public String map(Integerinteger){return" input -> "+integer +" : "+" output -> "+""+numberToWords(integer .toString(). toCharArray()); // converts number to words}}).print();
Filter: Evaluates a boolean function for each element and retains those for which the function returns true.
Given below is an example of Flink’s filter operator:
Logically partitions a stream into disjoint partitions.
All records with the same key are assigned to the same partition.
Internally, keyBy() is implemented with hash partitioning.
The figure below illustrates how key by operator works in Flink.
Fault Tolerance
Flink combines stream replay and checkpointing to achieve fault tolerance.
At a checkpoint, each operator’s corresponding state and the specific point in each input stream are marked.
Whenever Checkpointing is done, a snapshot of the data of all the operators is saved in the state backend, which is generally the job manager’s memory or configurable durable storage.
Whenever there is a failure, operators are reset to the most recent state in the state backend, and event processing is resumed.
Checkpointing
Checkpointing is implemented using stream barriers.
Barriers are injected into the data stream at the source. E.g., kafka, kinesis, etc.
Barriers flow with the records as part of the data stream.
Refer below diagram to understand how checkpoint barriers flow with the events:
Checkpoint Barriers
Saving Snapshots
Operators snapshot their state at the point in time when they have received all snapshot barriers from their input streams, and before emitting the barriers to their output streams.
Once a sink operator (the end of a streaming DAG) has received the barrier n from all of its input streams, it acknowledges that snapshot n to the checkpoint coordinator.
After all sinks have acknowledged a snapshot, it is considered completed.
The below diagram illustrates how checkpointing is achieved in Flink with the help of barrier events, state backends, and checkpoint table.
Checkpointing
Recovery
Flink selects the latest completed checkpoint upon failure.
The system then re-deploys the entire distributed dataflow.
Gives each operator the state that was snapshotted as part of the checkpoint.
The sources are set to start reading the stream from the position given in the snapshot.
For example, in Apache Kafka, that means telling the consumer to start fetching messages from an offset given in the snapshot.
Scalability
A Flink job can be scaled up and scaled down as per requirement.
This can be done manually by:
Triggering a savepoint (manually triggered checkpoint)
Adding/Removing nodes to/from the cluster
Restarting the job from savepoint
OR
Automatically by Reactive Scaling:
The configuration of a job in Reactive Mode ensures that it utilizes all available resources in the cluster at all times.
Adding a Task Manager will scale up your job, and removing resources will scale it down.
Reactive Mode restarts a job on a rescaling event, restoring it from the latest completed checkpoint.
The only downside is that it works only in standalone mode.
Alternatives
Spark Streaming: It is an open-source distributed computing engine that has added streaming capabilities, but Flink is optimized for low-latency processing of real-time data streams and supports more complex processing scenarios.
Apache Storm: It is another open-source stream processing system that has a steeper learning curve than Flink and uses a different architecture based on spouts and bolts.
Apache Kafka Streams: It is a lightweight stream processing library built on top of Kafka, but it is not as feature-rich as Flink or Spark, and is better suited for simpler stream processing tasks.
Conclusion
In conclusion, Apache Flink is a powerful solution for real-time analytics. With its ability to process data in real-time and support for streaming data sources, it enables businesses to make data-driven decisions with minimal delay. The Flink ecosystem also provides a variety of tools and libraries that make it easy for developers to build scalable and fault-tolerant data processing pipelines.
One of the key advantages of Apache Flink is its support for event-time processing, which allows it to handle delayed or out-of-order data in a way that accurately reflects the sequence of events. This makes it particularly useful for use cases such as fraud detection, where timely and accurate data processing is critical.
Additionally, Flink’s support for multiple programming languages, including Java, Scala, and Python, makes it accessible to a broad range of developers. And with its seamless integration with popular big data platforms like Hadoop and Apache Kafka, it is easy to incorporate Flink into existing data infrastructure.
In summary, Apache Flink is a powerful and flexible solution for real-time analytics, capable of handling a wide range of use cases and delivering timely insights that drive business value.
Stream processing is a technology used to process large amounts of data in real-time as it is generated rather than storing it and processing it later.
Think of it like a conveyor belt in a factory. The conveyor belt constantly moves, bringing in new products that need to be processed. Similarly, stream processing deals with data that is constantly flowing, like a stream of water. Just like the factory worker needs to process each product as it moves along the conveyor belt, stream processing technology processes each piece of data as it arrives.
Stateful and stateless processing are two different approaches to stream processing, and the right choice depends on the specific requirements and needs of the application.
Stateful processing is useful in scenarios where the processing of an event or data point depends on the state of previous events or data points. For example, it can be used to maintain a running total or average across multiple events or data points.
Stateless processing, on the other hand, is useful in scenarios where the processing of an event or data point does not depend on the state of previous events or data points. For example, in a simple data transformation application, stateless processing can be used to transform each event or data point independently without the need to maintain state.
Streaming analytics refers to the process of analyzing and processing data in real time as it is generated. Streaming analytics enable applications to react to events and make decisions in near real time.
Why Stream Processing and Analytics?
Stream processing is important because it allows organizations to make real-time decisions based on the data they are receiving. This is particularly useful in situations where timely information is critical, such as in financial transactions, network security, and real-time monitoring of industrial processes.
For example, in financial trading, stream processing can be used to analyze stock market data in real time and make split-second decisions to buy or sell stocks. In network security, it can be used to detect and respond to cyber-attacks in real time. And in industrial processes, it can be used to monitor production line efficiency and quickly identify and resolve any issues.
Stream processing is also important because it can process massive amounts of data, making it ideal for big data applications. With the growth of the Internet of Things (IoT), the amount of data being generated is growing rapidly, and stream processing provides a way to process this data in real time and derive valuable insights.
Collectively, stream processing provides organizations with the ability to make real-time decisions based on the data they are receiving, allowing them to respond quickly to changing conditions and improve their operations.
How is it different from Batch Processing?
Batch Data Processing:
Batch Data Processing is a method of processing where a group of transactions or data is collected over a period of time and is then processed all at once in a “batch”. The process begins with the extraction of data from its sources, such as IoT devices or web/application logs. This data is then transformed and integrated into a data warehouse. The process is generally called the Extract, Transform, Load (ETL) process. The data warehouse is then used as the foundation for an analytical layer, which is where the data is analyzed, and insights are generated.
Stream/Real-time Data Processing:
Real-Time Data Streaming involves the continuous flow of data that is generated in real-time, typically from multiple sources such as IoT devices or web/application logs. A message broker is used to manage the flow of data between the stream processors, the analytical layer, and the data sink. The message broker ensures that the data is delivered in the correct order and that it is not lost. Stream processors used to perform data ingestion and processing. These processors take in the data streams and process them in real time. The processed data is then sent to an analytical layer, where it is analyzed, and insights are generated.
Processes involved in Stream processing and Analytics:
The process of stream processing can be broken down into the following steps:
Data Collection: The first step in stream processing is collecting data from various sources, such as sensors, social media, and transactional systems. The data is then fed into a stream processing system in real time.
Data Ingestion: Once the data is collected, it needs to be ingested or taken into the stream processing system. This involves converting the data into a standard format that can be processed by the system.
Data Processing: The next step is to process the data as it arrives. This involves applying various processing algorithms and rules to the data, such as filtering, aggregating, and transforming the data. The processing algorithms can be applied to individual events in the stream or to the entire stream of data.
Data Storage: After the data has been processed, it is stored in a database or data warehouse for later analysis. The storage can be configured to retain the data for a specific amount of time or to retain all the data.
Data Analysis: The final step is to analyze the processed data and derive insights from it. This can be done using data visualization tools or by running reports and queries on the stored data. The insights can be used to make informed decisions or to trigger actions, such as sending notifications or triggering alerts.
It’s important to note that stream processing is an ongoing process, with data constantly being collected, processed, and analyzed in real time. The visual representation of this process can be represented as a continuous cycle of data flowing through the system, being processed and analyzed at each step along the way.
Stream Processing Platforms & Frameworks:
Stream Processing Platforms & Tools are software systems that enable the collection, processing, and analysis of real-time data streams.
Stream Processing Frameworks:
A stream processing framework is a software library or framework that provides a set of tools and APIs for developers to build custom stream processing applications. Frameworks typically require more development effort and configuration to set up and use. They provide more flexibility and control over the stream processing pipeline but also require more development and maintenance resources.
Let’s first look into the most commonly used stream processing frameworks: Apache Flink & Apache Spark Streaming.
Apache Flink :
Flink is an open-source, unified stream-processing and batch-processing framework. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner, making it ideal for processing huge amounts of data in real-time.
Flink provides out-of-the-box checkpointing and state management, two features that make it easy to manage enormous amounts of data with relative ease.
The event processing function, the filter function, and the mapping function are other features that make handling a large amount of data easy.
Flink also comes with real-time indicators and alerts which make abig difference when it comes to data processing and analysis.
Note: We have discussed the stream processing and analytics in detail in Stream Processing and Analytics with Apache Flink
Apache Spark Streaming :
Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads. Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data.
Great for solving complicated transformative logic
Easy to program
Runs at blazing speeds
Processes large data within a fraction of second
Stream Processing Platforms:
A stream processing platform is an end-to-end solution for processing real-time data streams. Platforms typically require less development effort and maintenance as they provide pre-built tools and functionality for processing, analyzing, and visualizing data.
Examples: Apache Kafka, Amazon Kinesis, Google Cloud Pub-Sub
Let’s look into the most commonly used stream processing platforms: Apache Kafka & AWS Kinesis.
Apache Kafka:
Apache Kafka is an open-source distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Because it’s an open-source, “Kafka generally requires a higher skill set to operate and manage, so it’s typically used for development and testing.
APIs allow “producers” to publish data streams to “topics;” a “topic” is a partitioned log of records; a “partition” is ordered and immutable; “consumers” subscribe to “topics.”
It can run on a cluster of “brokers” with partitions split across cluster nodes.
Messages can be effectively unlimited in size (2GB).
AWS Kinesis:
Amazon Kinesis is a cloud-based service on Amazon Web Services (AWS) that allows you to ingest real-time data such as application logs, website clickstreams, and IoT telemetry data for machine learning and analytics, as well as video and audio.
Amazon Kinesis is a SaaS offering, reducing the complexities in the design, build, and manage stages compared to open-source Apache Kafka. It’s ideally suited for building microservices architectures.
“Producers” can push data as soon as it is put on the stream. Kinesis breaks the stream across “shards” (which are like partitions).
Shards have a hard limit on the number of transactions and data volume per second. If you need more volume, you must subscribe to more shards. You pay for what you use.
Most maintenance and configurations are hidden from the user. Scaling is easy (adding shards) compared to Kafka.
Maximum message size is 1MB.
Three Characteristics of Event Stream processing Platform:
Publish and Subscribe:
In a publish-subscribe model, producers publish events or messages to streams or topics, and consumers subscribe to streams or topics to receive the events or messages. This is similar to a message queue or enterprise messaging system. It allows for the decoupling of the producer and consumer, enabling them to operate independently and asynchronously.
Store streams of events in a fault-tolerant way
This means that the platform is able to store and manage events in a reliable and resilient manner, even in the face of failures or errors. To achieve fault tolerance, event streaming platforms typically use a variety of techniques, such as replicating data across multiple nodes, and implementing data recovery and failover mechanisms.
Process streams of events in real-time, as they occur
This means that the platform can process and analyze data as it is generated rather than waiting for data to be batch-processed or stored for later processing.
Challenges when designing the stream processing and analytics solution:
Stream processing is a powerful technology, but there are also several challenges associated with it, including:
Late arriving data: Data that is delayed or arrives out of order can disrupt the processing pipeline and lead to inaccurate results. Stream processing systems need to be able to handle out-of-order data and reconcile it with the data that has already been processed.
Missing data: If data is missing or lost, it can impact the accuracy of the processing results. Stream processing systems need to be able to identify missing data and handle it appropriately, whether by skipping it, buffering it, or alerting a human operator.
Duplicate data: Duplicate data can lead to over-counting and skewed results. Stream processing systems need to be able to identify and de-duplicate data to ensure accurate results.
Data skew: data skew occurs when there is a disproportionate amount of data for certain key fields or time periods. This can lead to performance issues, processing delays, and inaccurate results. Stream processing systems need to be able to handle data skew by load balancing and scaling resources appropriately.
Fault tolerance: Stream processing systems need to be able to handle hardware and software failures without disrupting the processing pipeline. This requires fault-tolerant design, redundancy, and failover mechanisms.
Data security and privacy: Real-time data processing often involves sensitive data, such as personal information, financial data, or intellectual property. Stream processing systems need to ensure that data is securely transmitted, stored, and processed in compliance with regulatory requirements.
Latency: Another challenge with stream processing is latency or the amount of time it takes for data to be processed and analyzed. In many applications, the results of the analysis need to be produced in real-time, which puts pressure on the stream processing system to process the data quickly.
Scalability: Stream processing systems must be able to scale to handle large amounts of data as the amount of data being generated continues to grow. This can be a challenge because the systems must be designed to handle data in real-time while also ensuring that the results of the analysis are accurate and reliable.
Maintenance: Maintaining a stream processing system can also be challenging, as the systems are complex and require specialized knowledge to operate effectively. In addition, the systems must be able to evolve and adapt to changing requirements over time.
Despite these challenges, stream processing remains an important technology for organizations that need to process data in real time and make informed decisions based on that data. By understanding these challenges and designing the systems to overcome them, organizations can realize the full potential of stream processing and improve their operations.
Key benefits of stream processing and analytics:
Real-time processing keeps you in sync all the time:
For Example: Suppose an online retailer uses a distributed system to process orders. The system might include multiple components, such as a web server, a database server, and an application server. The different components could be kept in sync by real-time processing by processing orders as they are received and updating the database accordingly. As a result, orders would be accurate and processed efficiently by maintaining a consistent view of the data.
Real-time data processing is More Accurate and timely:
For Example a financial trading system that processes data in real-time can help to ensure that trades are executed at the best possible prices, improving the accuracy and timeliness of the trades.
Deadlines are met with Real-time processing:
For example: In a control system, it may be necessary to respond to changing conditions within a certain time frame in order to maintain the stability of the system.
Real-time processing is quite reactive:
For example, a real-time processing system might be used to monitor a manufacturing process and trigger an alert if it detects a problem or to analyze sensor data from a power plant and adjust the plant’s operation in response to changing conditions.
Real-time processing involves multitasking:
For example, consider a real-time monitoring system that is used to track the performance of a large manufacturing plant. The system might receive data from multiple sensors and sources, including machine sensors, temperature sensors, and video cameras. In this case, the system would need to be able to multitask in order to process and analyze data from all of these sources in real time and to trigger alerts or take other actions as needed.
Real-time processing works independently:
For example, a real-time processing system may rely on a database or message queue to store and retrieve data, or it may rely on external APIs or services to access additional data or functionality.
Use case studies:
There are many real-life examples of stream processing in different industries that demonstrate the benefits of this technology. Here are a few examples:
Financial Trading: In the financial industry, stream processing is used to analyze stock market data in real time and make split-second decisions to buy or sell stocks. This allows traders to respond to market conditions in real time and improve their chances of making a profit.
Network Security: Stream processing is also used in network security to detect and respond to cyber-attacks in real-time. By processing network data in real time, security systems can quickly identify and respond to threats, reducing the risk of a data breach.
Industrial Monitoring: In the industrial sector, stream processing is used to monitor production line efficiency and quickly identify and resolve any issues. For example, it can be used to monitor the performance of machinery and identify any potential problems before they cause a production shutdown.
Social Media Analysis: Stream processing is also used to analyze social media data in real time. This allows organizations to monitor brand reputation, track customer sentiment, and respond to customer complaints in real time.
Healthcare: In the healthcare industry, stream processing is used to monitor patient data in real time and quickly identify any potential health issues. For example, it can be used to monitor vital signs and alert healthcare providers if a patient’s condition worsens.
These are just a few examples of the many real-life applications of stream processing. Across all industries, stream processing provides organizations with the ability to process data in real time and make informed decisions based on the data they are receiving.
How to start stream analytics?
Our recommendation in building a dedicated platform is to keep the focus on choosing a diverse stream processor to pair with your existing analytical interface.
Or, keep an eye on vendors who offer both stream processing and BI as a service.
Resources:
Here are some useful resources for learning more about stream processing:
These resources will provide a good starting point for learning more about stream processing and how it can be used to solve real-world problems.
Conclusion:
Real-time data analysis and decision-making require stream processing and analytics in diverse industries, including finance, healthcare, and e-commerce. Organizations can improve operational efficiency, customer satisfaction, and revenue growth by processing data in real time. A robust infrastructure, skilled personnel, and efficient algorithms are required for stream processing and analytics. Businesses need stream processing and analytics to stay competitive and agile in today’s fast-paced world as data volumes and complexity continue to increase.