When React got introduced, it had an edge over other libraries and frameworks present in that era because of a very interesting concept called one-way data binding or in simpler words uni-directional flow of data introduced as a part of Virtual DOM.
It made for a fantastic developer experience where one didn’t have to think about how the updates flow in the UI when data (”state” to be more technical) changes.
However, as more and more hooks got introduced there were some syntactical rules to make sure they perform in the most optimum way. Essentially, a deviation from the original purpose of React which is unidirectional flow or explicit mutations
To call out a few
Filling out the dependency arrays correctly
Memoizing the right values or callbacks for rendering optimization
Consciously avoiding prop drilling
And possibly a few more that if done the wrong way could cause some serious performance issues i.e. everything just re-renders. A slight deviation from the original purpose of just writing components to build UIs.
The use of signals is a good example of how adopting Reactive programming primitives can help remove all this complexity and help improve developer experience by shifting focus on the right things without having to explicitly follow a set of syntactical rules for gaining performance.
What Is a Signal?
A signal is one of the key primitives of Reactive programming. Syntactically, they are very similar to states in React. However, the reactive capabilities of a signal is what gives it the edge.
const [state, setState] =useState(0);// state -> value// setState -> setterconst [signal, setSignal] =createSignal(0);// signal -> getter // setSignal -> setter
At this point, they look pretty much the same—except that useState returns a value and useSignal returns a getter function.
How is a signal better than a state?
Once useState returns a value, the library generally doesn’t concern itself with how the value is used. It’s the developer who has to decide where to use that value and has to explicitly make sure that any effects, memos or callbacks that want to subscribe to changes to that value has that value mentioned in their dependency list and in addition to that, memoizing that value to avoid unnecessary re-renders. A lot of additional effort.
functionParentComponent() {const [state, setState] =useState(0);conststateVal=useMemo(() => {returndoSomeExpensiveStateCalculation(state); }, [state]); // Explicitly memoize and make sure dependencies are accurateuseEffect(() => {sendDataToServer(state); }, [state]); // Explicilty call out subscription to statereturn ( <div> <ChildComponentstateVal={stateVal} /> </div> );}
A createSignal, however, returns a getter function since signals are reactive in nature. To break it down further, signals keep track of who is interested in the state’s changes, and if the changes occur, it notifies these subscribers.
To gain this subscriber information, signals keep track of the context in which these state getters, which are essentially a function, are called. Invoking the getter creates a subscription.
This is super helpful as the library is now, by itself, taking care of the subscribers who are subscribing to the state’s changes and notifying them without the developer having to explicitly call it out.
createEffect(() => {updateDataElswhere(state());}); // effect only runs when `state` changes - an automatic subscription
The contexts (not to be confused with React Context API) that are invoking the getter are the only ones the library will notify, which means memoizing, explicitly filling out large dependency arrays, and the fixing of unnecessary re-renders can all be avoided. This helps to avoid using a lot of additional hooks meant for this purpose, such as useRef, useCallback, useMemo, and a lot of re-renders.
This greatly enhances the developer experience and shifts focus back on building components for the UI rather than spending that extra 10% of developer efforts in abiding by strict syntactical rules for performance optimization.
functionParentComponent() {const [state, setState] =createSignal(0);conststateVal=doSomeExpensiveStateCalculation(state()); // no need memoize explicitycreateEffect(() => {sendDataToServer(state()); }); // will only be fired if state changes - the effect is automatically added as a subscriberreturn ( <div> <ChildComponentstateVal={stateVal} /> </div> );}
Conclusion
It might look like there’s a very biased stance toward using signals and reactive programming in general. However, that’s not the case.
React is a high-performance, optimized library—even though there are some gaps or misses in using your state in an optimum way, which leads to unnecessary re-renders, it’s still really fast. After years of using React a certain way, frontend developers are used to visualizing a certain flow of data and re-rendering, and replacing that entirely with a reactive programming mindset is not natural. React is still the de facto choice for building user interfaces, and it will continue to be with every iteration and new feature added.
Reactive programming, in addition to performance enhancements, also makes the developer experience much simpler by boiling down to three major primitives: Signal, Memo, and Effects. This helps focus more on building components for UIs rather than worrying about dealing explicitly with performance optimization.
Signals are increasingly getting popular and are a part of many modern web frameworks, such as Solid.js, Preact, Qwik, and Vue.js.
In today’s world, data is being generated at an unprecedented rate. Every click, every tap, every swipe, every tweet, every post, every like, every share, every search, and every view generates a trail of data. Businesses are struggling to keep up with the speed and volume of this data, and traditional batch-processing systems cannot handle the scale and complexity of this data in real-time.
This is where streaming analytics comes into play, providing faster insights and more timely decision-making. Streaming analytics is particularly useful for scenarios that require quick reactions to events, such as financial fraud detection or IoT data processing. It can handle large volumes of data and provide continuous monitoring and alerts in real-time, allowing for immediate action to be taken when necessary.
Stream processing or real-time analytics is a method of analyzing and processing data as it is generated, rather than in batches. It allows for faster insights and more timely decision-making. Popular open-source stream processing engines include Apache Flink, Apache Spark Streaming, and Apache Kafka Streams. In this blog, we are going to talk about Apache Flink and its fundamentals and how it can be useful for streaming analytics.
Introduction
Apache Flink is an open-source stream processing framework first introduced in 2014. Flink has been designed to process large amounts of data streams in real-time, and it supports both batch and stream processing. It is built on top of the Java Virtual Machine (JVM) and is written in Java and Scala.
Flink is a distributed system that can run on a cluster of machines, and it has been designed to be highly available, fault-tolerant, and scalable. It supports a wide range of data sources and provides a unified API for batch and stream processing, which makes it easy to build complex data processing applications.
Advantages of Apache Flink
Real-time analytics is the process of analyzing data as it is generated. It requires a system that can handle large volumes of data in real-time and provide insights into the data as soon as possible. Apache Flink has been designed to meet these requirements and has several advantages over other real-time data processing systems.
Low Latency: Flink processes data streams in real-time, which means it can provide insights into the data almost immediately. This makes it an ideal solution for applications that require low latency, such as fraud detection and real-time recommendations.
High Throughput: Flink has been designed to handle large volumes of data and can scale horizontally to handle increasing volumes of data. This makes it an ideal solution for applications that require high throughput, such as log processing and IoT applications.
Flexible Windowing: Flink provides a flexible windowing API that enables the creation of complex windows for processing data streams. This enables the creation of windows based on time, count, or custom triggers, which makes it easy to create complex data processing applications.
Fault Tolerance: Flink is designed to be highly available and fault-tolerant. It can recover from failures quickly and can continue processing data even if some of the nodes in the cluster fail.
Compatibility: Flink is compatible with a wide range of data sources, including Kafka, Hadoop, and Elasticsearch. This makes it easy to integrate with existing data processing systems.
Flink Architecture
Apache Flink processes data streams in a distributed manner. The Flink cluster consists of several nodes, each of which is responsible for processing a portion of the data. The nodes communicate with each other using a messaging system, such as Apache Kafka.
The Flink cluster processes data streams in parallel by dividing the data into small chunks, or partitions, and processing them independently. Each partition is processed by a task, which is a unit of work that runs on a node in the cluster.
Flink provides several APIs for building data processing applications, including the DataStream API, the DataSet API, and the Table API. The below diagram illustrates what a Flink cluster looks like.
Apache Flink Cluster
Flink application runs on a cluster.
A Flink cluster has a job manager and a bunch of task managers.
A job manager is responsible for effective allocation and management of computing resources.
Task managers are responsible for the execution of a job.
Flink Job Execution
Client system submits job graph to the job manager
A client system prepares and sends a dataflow/job graph to the job manager.
It can be your Java/Scala/Python Flink application or the CLI.
The runtime and program execution do not include the client.
After submitting the job, the client can either disconnect and operate in detached mode or remain connected to receive progress reports in attached mode.
Given below is an illustration of how the job graph converted from code looks like
Job Graph
The job graph is converted to an execution graph by the job manager
The execution graph is a parallel version of the job graph.
For each job vertex, it contains an execution vertex per parallel subtask.
An operator that exhibits a parallelism level of 100 will consist of a single job vertex and 100 execution vertices.
Given below is an illustration of what an execution graph looks like:
Execution Graph
Job manager submits the parallel instances of execution graph to task managers
Execution resources in Flink are defined through task slots.
Each task manager will have one or more task slots, each of which can run one pipeline of parallel tasks.
A pipeline consists of multiple successive tasks
Parallel instances of execution graph being submitted to task slots
Flink Program
Flink programs look like regular programs that transform DataStreams. Each program consists of the same basic parts:
Obtain an execution environment
ExecutionEnvironment is the context in which a program is executed. This is how execution environment is set up in Flink code:
ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment(); // if program is running on local machineExecutionEnvironment env =newCollectionEnvironment(); // if source is collectionsExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // will do the right thing based on context
Connect to data stream
We can use an instance of the execution environment to connect to the data source which can be file System, a streaming application or collection. This is how we can connect to data source in Flink:
DataSet<String> data = env.readTextFile("file:///path/to/file"); // to read from fileDataSet<User> users = env.fromCollection( /* get elements from a Java Collection */); // to read from collectionsDataSet<User> users = env.addSource(/*streaming application or database*/);
Perform Transformations
We can perform transformation on the events/data that we receive from the data sources. A few of the data transformation operations are map, filter, keyBy, flatmap, etc.
Specify where to send the data
Once we have performed the transformation/analytics on the data that is flowing through the stream, we can specify where we will send the data. The destination can be a filesystem, database, or data streams.
dataStream.sinkTo(/*streaming application or database api */);
Flink Transformations
Map: Takes one element at a time from the stream and performs some transformation on it, and gives one element of any type as an output.
Given below is an example of Flink’s map operator:
stream.map(newMapFunction<Integer, String>(){public String map(Integerinteger){return" input -> "+integer +" : "+" output -> "+""+numberToWords(integer .toString(). toCharArray()); // converts number to words}}).print();
Filter: Evaluates a boolean function for each element and retains those for which the function returns true.
Given below is an example of Flink’s filter operator:
Logically partitions a stream into disjoint partitions.
All records with the same key are assigned to the same partition.
Internally, keyBy() is implemented with hash partitioning.
The figure below illustrates how key by operator works in Flink.
Fault Tolerance
Flink combines stream replay and checkpointing to achieve fault tolerance.
At a checkpoint, each operator’s corresponding state and the specific point in each input stream are marked.
Whenever Checkpointing is done, a snapshot of the data of all the operators is saved in the state backend, which is generally the job manager’s memory or configurable durable storage.
Whenever there is a failure, operators are reset to the most recent state in the state backend, and event processing is resumed.
Checkpointing
Checkpointing is implemented using stream barriers.
Barriers are injected into the data stream at the source. E.g., kafka, kinesis, etc.
Barriers flow with the records as part of the data stream.
Refer below diagram to understand how checkpoint barriers flow with the events:
Checkpoint Barriers
Saving Snapshots
Operators snapshot their state at the point in time when they have received all snapshot barriers from their input streams, and before emitting the barriers to their output streams.
Once a sink operator (the end of a streaming DAG) has received the barrier n from all of its input streams, it acknowledges that snapshot n to the checkpoint coordinator.
After all sinks have acknowledged a snapshot, it is considered completed.
The below diagram illustrates how checkpointing is achieved in Flink with the help of barrier events, state backends, and checkpoint table.
Checkpointing
Recovery
Flink selects the latest completed checkpoint upon failure.
The system then re-deploys the entire distributed dataflow.
Gives each operator the state that was snapshotted as part of the checkpoint.
The sources are set to start reading the stream from the position given in the snapshot.
For example, in Apache Kafka, that means telling the consumer to start fetching messages from an offset given in the snapshot.
Scalability
A Flink job can be scaled up and scaled down as per requirement.
This can be done manually by:
Triggering a savepoint (manually triggered checkpoint)
Adding/Removing nodes to/from the cluster
Restarting the job from savepoint
OR
Automatically by Reactive Scaling:
The configuration of a job in Reactive Mode ensures that it utilizes all available resources in the cluster at all times.
Adding a Task Manager will scale up your job, and removing resources will scale it down.
Reactive Mode restarts a job on a rescaling event, restoring it from the latest completed checkpoint.
The only downside is that it works only in standalone mode.
Alternatives
Spark Streaming: It is an open-source distributed computing engine that has added streaming capabilities, but Flink is optimized for low-latency processing of real-time data streams and supports more complex processing scenarios.
Apache Storm: It is another open-source stream processing system that has a steeper learning curve than Flink and uses a different architecture based on spouts and bolts.
Apache Kafka Streams: It is a lightweight stream processing library built on top of Kafka, but it is not as feature-rich as Flink or Spark, and is better suited for simpler stream processing tasks.
Conclusion
In conclusion, Apache Flink is a powerful solution for real-time analytics. With its ability to process data in real-time and support for streaming data sources, it enables businesses to make data-driven decisions with minimal delay. The Flink ecosystem also provides a variety of tools and libraries that make it easy for developers to build scalable and fault-tolerant data processing pipelines.
One of the key advantages of Apache Flink is its support for event-time processing, which allows it to handle delayed or out-of-order data in a way that accurately reflects the sequence of events. This makes it particularly useful for use cases such as fraud detection, where timely and accurate data processing is critical.
Additionally, Flink’s support for multiple programming languages, including Java, Scala, and Python, makes it accessible to a broad range of developers. And with its seamless integration with popular big data platforms like Hadoop and Apache Kafka, it is easy to incorporate Flink into existing data infrastructure.
In summary, Apache Flink is a powerful and flexible solution for real-time analytics, capable of handling a wide range of use cases and delivering timely insights that drive business value.
Stream processing is a technology used to process large amounts of data in real-time as it is generated rather than storing it and processing it later.
Think of it like a conveyor belt in a factory. The conveyor belt constantly moves, bringing in new products that need to be processed. Similarly, stream processing deals with data that is constantly flowing, like a stream of water. Just like the factory worker needs to process each product as it moves along the conveyor belt, stream processing technology processes each piece of data as it arrives.
Stateful and stateless processing are two different approaches to stream processing, and the right choice depends on the specific requirements and needs of the application.
Stateful processing is useful in scenarios where the processing of an event or data point depends on the state of previous events or data points. For example, it can be used to maintain a running total or average across multiple events or data points.
Stateless processing, on the other hand, is useful in scenarios where the processing of an event or data point does not depend on the state of previous events or data points. For example, in a simple data transformation application, stateless processing can be used to transform each event or data point independently without the need to maintain state.
Streaming analytics refers to the process of analyzing and processing data in real time as it is generated. Streaming analytics enable applications to react to events and make decisions in near real time.
Why Stream Processing and Analytics?
Stream processing is important because it allows organizations to make real-time decisions based on the data they are receiving. This is particularly useful in situations where timely information is critical, such as in financial transactions, network security, and real-time monitoring of industrial processes.
For example, in financial trading, stream processing can be used to analyze stock market data in real time and make split-second decisions to buy or sell stocks. In network security, it can be used to detect and respond to cyber-attacks in real time. And in industrial processes, it can be used to monitor production line efficiency and quickly identify and resolve any issues.
Stream processing is also important because it can process massive amounts of data, making it ideal for big data applications. With the growth of the Internet of Things (IoT), the amount of data being generated is growing rapidly, and stream processing provides a way to process this data in real time and derive valuable insights.
Collectively, stream processing provides organizations with the ability to make real-time decisions based on the data they are receiving, allowing them to respond quickly to changing conditions and improve their operations.
How is it different from Batch Processing?
Batch Data Processing:
Batch Data Processing is a method of processing where a group of transactions or data is collected over a period of time and is then processed all at once in a “batch”. The process begins with the extraction of data from its sources, such as IoT devices or web/application logs. This data is then transformed and integrated into a data warehouse. The process is generally called the Extract, Transform, Load (ETL) process. The data warehouse is then used as the foundation for an analytical layer, which is where the data is analyzed, and insights are generated.
Stream/Real-time Data Processing:
Real-Time Data Streaming involves the continuous flow of data that is generated in real-time, typically from multiple sources such as IoT devices or web/application logs. A message broker is used to manage the flow of data between the stream processors, the analytical layer, and the data sink. The message broker ensures that the data is delivered in the correct order and that it is not lost. Stream processors used to perform data ingestion and processing. These processors take in the data streams and process them in real time. The processed data is then sent to an analytical layer, where it is analyzed, and insights are generated.
Processes involved in Stream processing and Analytics:
The process of stream processing can be broken down into the following steps:
Data Collection: The first step in stream processing is collecting data from various sources, such as sensors, social media, and transactional systems. The data is then fed into a stream processing system in real time.
Data Ingestion: Once the data is collected, it needs to be ingested or taken into the stream processing system. This involves converting the data into a standard format that can be processed by the system.
Data Processing: The next step is to process the data as it arrives. This involves applying various processing algorithms and rules to the data, such as filtering, aggregating, and transforming the data. The processing algorithms can be applied to individual events in the stream or to the entire stream of data.
Data Storage: After the data has been processed, it is stored in a database or data warehouse for later analysis. The storage can be configured to retain the data for a specific amount of time or to retain all the data.
Data Analysis: The final step is to analyze the processed data and derive insights from it. This can be done using data visualization tools or by running reports and queries on the stored data. The insights can be used to make informed decisions or to trigger actions, such as sending notifications or triggering alerts.
It’s important to note that stream processing is an ongoing process, with data constantly being collected, processed, and analyzed in real time. The visual representation of this process can be represented as a continuous cycle of data flowing through the system, being processed and analyzed at each step along the way.
Stream Processing Platforms & Frameworks:
Stream Processing Platforms & Tools are software systems that enable the collection, processing, and analysis of real-time data streams.
Stream Processing Frameworks:
A stream processing framework is a software library or framework that provides a set of tools and APIs for developers to build custom stream processing applications. Frameworks typically require more development effort and configuration to set up and use. They provide more flexibility and control over the stream processing pipeline but also require more development and maintenance resources.
Let’s first look into the most commonly used stream processing frameworks: Apache Flink & Apache Spark Streaming.
Apache Flink :
Flink is an open-source, unified stream-processing and batch-processing framework. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner, making it ideal for processing huge amounts of data in real-time.
Flink provides out-of-the-box checkpointing and state management, two features that make it easy to manage enormous amounts of data with relative ease.
The event processing function, the filter function, and the mapping function are other features that make handling a large amount of data easy.
Flink also comes with real-time indicators and alerts which make abig difference when it comes to data processing and analysis.
Note: We have discussed the stream processing and analytics in detail in Stream Processing and Analytics with Apache Flink
Apache Spark Streaming :
Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads. Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data.
Great for solving complicated transformative logic
Easy to program
Runs at blazing speeds
Processes large data within a fraction of second
Stream Processing Platforms:
A stream processing platform is an end-to-end solution for processing real-time data streams. Platforms typically require less development effort and maintenance as they provide pre-built tools and functionality for processing, analyzing, and visualizing data.
Examples: Apache Kafka, Amazon Kinesis, Google Cloud Pub-Sub
Let’s look into the most commonly used stream processing platforms: Apache Kafka & AWS Kinesis.
Apache Kafka:
Apache Kafka is an open-source distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Because it’s an open-source, “Kafka generally requires a higher skill set to operate and manage, so it’s typically used for development and testing.
APIs allow “producers” to publish data streams to “topics;” a “topic” is a partitioned log of records; a “partition” is ordered and immutable; “consumers” subscribe to “topics.”
It can run on a cluster of “brokers” with partitions split across cluster nodes.
Messages can be effectively unlimited in size (2GB).
AWS Kinesis:
Amazon Kinesis is a cloud-based service on Amazon Web Services (AWS) that allows you to ingest real-time data such as application logs, website clickstreams, and IoT telemetry data for machine learning and analytics, as well as video and audio.
Amazon Kinesis is a SaaS offering, reducing the complexities in the design, build, and manage stages compared to open-source Apache Kafka. It’s ideally suited for building microservices architectures.
“Producers” can push data as soon as it is put on the stream. Kinesis breaks the stream across “shards” (which are like partitions).
Shards have a hard limit on the number of transactions and data volume per second. If you need more volume, you must subscribe to more shards. You pay for what you use.
Most maintenance and configurations are hidden from the user. Scaling is easy (adding shards) compared to Kafka.
Maximum message size is 1MB.
Three Characteristics of Event Stream processing Platform:
Publish and Subscribe:
In a publish-subscribe model, producers publish events or messages to streams or topics, and consumers subscribe to streams or topics to receive the events or messages. This is similar to a message queue or enterprise messaging system. It allows for the decoupling of the producer and consumer, enabling them to operate independently and asynchronously.
Store streams of events in a fault-tolerant way
This means that the platform is able to store and manage events in a reliable and resilient manner, even in the face of failures or errors. To achieve fault tolerance, event streaming platforms typically use a variety of techniques, such as replicating data across multiple nodes, and implementing data recovery and failover mechanisms.
Process streams of events in real-time, as they occur
This means that the platform can process and analyze data as it is generated rather than waiting for data to be batch-processed or stored for later processing.
Challenges when designing the stream processing and analytics solution:
Stream processing is a powerful technology, but there are also several challenges associated with it, including:
Late arriving data: Data that is delayed or arrives out of order can disrupt the processing pipeline and lead to inaccurate results. Stream processing systems need to be able to handle out-of-order data and reconcile it with the data that has already been processed.
Missing data: If data is missing or lost, it can impact the accuracy of the processing results. Stream processing systems need to be able to identify missing data and handle it appropriately, whether by skipping it, buffering it, or alerting a human operator.
Duplicate data: Duplicate data can lead to over-counting and skewed results. Stream processing systems need to be able to identify and de-duplicate data to ensure accurate results.
Data skew: data skew occurs when there is a disproportionate amount of data for certain key fields or time periods. This can lead to performance issues, processing delays, and inaccurate results. Stream processing systems need to be able to handle data skew by load balancing and scaling resources appropriately.
Fault tolerance: Stream processing systems need to be able to handle hardware and software failures without disrupting the processing pipeline. This requires fault-tolerant design, redundancy, and failover mechanisms.
Data security and privacy: Real-time data processing often involves sensitive data, such as personal information, financial data, or intellectual property. Stream processing systems need to ensure that data is securely transmitted, stored, and processed in compliance with regulatory requirements.
Latency: Another challenge with stream processing is latency or the amount of time it takes for data to be processed and analyzed. In many applications, the results of the analysis need to be produced in real-time, which puts pressure on the stream processing system to process the data quickly.
Scalability: Stream processing systems must be able to scale to handle large amounts of data as the amount of data being generated continues to grow. This can be a challenge because the systems must be designed to handle data in real-time while also ensuring that the results of the analysis are accurate and reliable.
Maintenance: Maintaining a stream processing system can also be challenging, as the systems are complex and require specialized knowledge to operate effectively. In addition, the systems must be able to evolve and adapt to changing requirements over time.
Despite these challenges, stream processing remains an important technology for organizations that need to process data in real time and make informed decisions based on that data. By understanding these challenges and designing the systems to overcome them, organizations can realize the full potential of stream processing and improve their operations.
Key benefits of stream processing and analytics:
Real-time processing keeps you in sync all the time:
For Example: Suppose an online retailer uses a distributed system to process orders. The system might include multiple components, such as a web server, a database server, and an application server. The different components could be kept in sync by real-time processing by processing orders as they are received and updating the database accordingly. As a result, orders would be accurate and processed efficiently by maintaining a consistent view of the data.
Real-time data processing is More Accurate and timely:
For Example a financial trading system that processes data in real-time can help to ensure that trades are executed at the best possible prices, improving the accuracy and timeliness of the trades.
Deadlines are met with Real-time processing:
For example: In a control system, it may be necessary to respond to changing conditions within a certain time frame in order to maintain the stability of the system.
Real-time processing is quite reactive:
For example, a real-time processing system might be used to monitor a manufacturing process and trigger an alert if it detects a problem or to analyze sensor data from a power plant and adjust the plant’s operation in response to changing conditions.
Real-time processing involves multitasking:
For example, consider a real-time monitoring system that is used to track the performance of a large manufacturing plant. The system might receive data from multiple sensors and sources, including machine sensors, temperature sensors, and video cameras. In this case, the system would need to be able to multitask in order to process and analyze data from all of these sources in real time and to trigger alerts or take other actions as needed.
Real-time processing works independently:
For example, a real-time processing system may rely on a database or message queue to store and retrieve data, or it may rely on external APIs or services to access additional data or functionality.
Use case studies:
There are many real-life examples of stream processing in different industries that demonstrate the benefits of this technology. Here are a few examples:
Financial Trading: In the financial industry, stream processing is used to analyze stock market data in real time and make split-second decisions to buy or sell stocks. This allows traders to respond to market conditions in real time and improve their chances of making a profit.
Network Security: Stream processing is also used in network security to detect and respond to cyber-attacks in real-time. By processing network data in real time, security systems can quickly identify and respond to threats, reducing the risk of a data breach.
Industrial Monitoring: In the industrial sector, stream processing is used to monitor production line efficiency and quickly identify and resolve any issues. For example, it can be used to monitor the performance of machinery and identify any potential problems before they cause a production shutdown.
Social Media Analysis: Stream processing is also used to analyze social media data in real time. This allows organizations to monitor brand reputation, track customer sentiment, and respond to customer complaints in real time.
Healthcare: In the healthcare industry, stream processing is used to monitor patient data in real time and quickly identify any potential health issues. For example, it can be used to monitor vital signs and alert healthcare providers if a patient’s condition worsens.
These are just a few examples of the many real-life applications of stream processing. Across all industries, stream processing provides organizations with the ability to process data in real time and make informed decisions based on the data they are receiving.
How to start stream analytics?
Our recommendation in building a dedicated platform is to keep the focus on choosing a diverse stream processor to pair with your existing analytical interface.
Or, keep an eye on vendors who offer both stream processing and BI as a service.
Resources:
Here are some useful resources for learning more about stream processing:
These resources will provide a good starting point for learning more about stream processing and how it can be used to solve real-world problems.
Conclusion:
Real-time data analysis and decision-making require stream processing and analytics in diverse industries, including finance, healthcare, and e-commerce. Organizations can improve operational efficiency, customer satisfaction, and revenue growth by processing data in real time. A robust infrastructure, skilled personnel, and efficient algorithms are required for stream processing and analytics. Businesses need stream processing and analytics to stay competitive and agile in today’s fast-paced world as data volumes and complexity continue to increase.
Machine learning has become part of day-to-day life. Small tasks like searching songs on YouTube and suggestions on Amazon are using ML in the background. This is a well-developed field of technology with immense possibilities. But how we can use it?
This blog is aimed at explaining how easy it is to use machine learning models (which will act as a brain) to build powerful ML-based Flutter applications. We will briefly touch base on the following points
1. Definitions
Let’s jump to the part where most people are confused. A person who is not exposed to the IT industry might think AI, ML, & DL are all the same. So, let’s understand the difference.
Figure 01
1.1. Artificial Intelligence (AI):
AI, i.e. artificial intelligence, is a concept of machines being able to carry out tasks in a smarter way. You all must have used YouTube. In the search bar, you can type the lyrics of any song, even lyrics that are not necessarily the starting part of the song or title of songs, and get almost perfect results every time. This is the work of a very powerful AI. Artificial intelligence is the ability of a machine to do tasks that are usually done by humans. This ability is special because the task we are talking about requires human intelligence and discernment.
1.2. Machine Learning (ML):
Machine learning is a subset of artificial intelligence. It is based on the idea that we expose machines to new data, which can be a complete or partial row, and let the machine decide the future output. We can also say it is a sub-field of AI that deals with the extraction of patterns from data sets. With a new data set and processing, the last result machine will slowly reach the expected result. This means that the machine can find rules for optical behavior to get new output. It also can adapt itself to new changing data just like humans.
1.3. Deep Learning (ML):
Deep learning is again a smaller subset of machine learning, which is essentially a neural network with multiple layers. These neural networks attempt to simulate the behavior of the human brain, so you can say we are trying to create an artificial human brain. With one layer of a neural network, we can still make approximate predictions, and additional layers can help to optimize and refine for accuracy.
2. Types of ML
Before starting the implementation, we need to know the types of machine learning because it is very important to know which type is more suitable for our expected functionality.
Figure 02
2.1. Supervised Learning
As the name suggests, in supervised learning, the learning happens under supervision. Supervision means the data that is provided to the machine is already classified data i.e., each piece of data has fixed labels, and inputs are already mapped to the output. Once the machine is learned, it is ready for the classification of new data. This learning is useful for tasks like fraud detection, spam filtering, etc.
2.2. Unsupervised Learning
In unsupervised learning, the data given to machines to learn is purely raw, with no tags or labels. Here, the machine is the one that will create new classes by extracting patterns. This learning can be used for clustering, association, etc.
2.3. Semi-Supervised Learning
Both supervised and unsupervised have their own limitations, because one requires labeled data, and the other does not, so this learning combines the behavior of both learnings, and with that, we can overcome the limitations. In this learning, we feed row data and categorized data to the machine so it can classify the row data, and if necessary, create new clusters.
2.4. : Reinforcement Learning
For this learning, we feed the last output’s feedback with new incoming data to machines so they can learn from their mistakes. This feedback-based process will continue until the machine reaches the perfect output. This feedback is given by humans in the form of punishment or reward. This is like when a search algorithm gives you a list of results, but users do not click on other than the first result. It is like a human child who is learning from every available option and by correcting its mistakes, it grows.
3. TensorFlow
Machine learning is a complex process where we need to perform multiple activities like processing of acquiring data, training models, serving predictions, and refining future results.
To perform such operations, Google developed a framework in November 2015 called TensorFlow. All the above-mentioned processes can become easy if we use the TensorFlow framework.
For this project, we are not going to use a complete TensorFlow framework but a small tool called TensorFlow Lite
3.1. TensorFlow Lite
TensorFlow Lite allows us to run the machine learning models on devices with limited resources, like limited RAM or memory.
3.2. TensorFlow Lite Features
Optimized for on-device ML by addressing five key constraints:
Latency: because there’s no round-trip to a server
Privacy: because no personal data leaves the device
Connectivity: because internet connectivity is not required
Size: because of a reduced model and binary size
Power consumption: because of efficient inference and a lack of network connections
Support for Android and iOS devices, embedded Linux, and microcontrollers
Support for Java, Swift, Objective-C, C++, and Python programming languages
High performance, with hardware acceleration and model optimization
End-to-end examples for common machine learning tasks such as image classification, object detection, pose estimation, question answering, text classification, etc., on multiple platforms
4. What is Flutter?
Flutter is an open source, cross-platform development framework. With the help of Flutter by using a single code base, we can create applications for Android, iOS, web, as well as desktop. It was created by Google and uses Dart as a development language. The first stable version of Flutter was released in Apr 2018, and since then, there have been many improvements.
5. Building an ML-Flutter Application
We are now going to build a Flutter application through which we can find the state of mind of a person from their facial expressions. The below steps explain the update we need to do for an Android-native application. For an iOS application, please refer to the links provided in the steps.
5.1. TensorFlow Lite – Native setup (Android)
In android/app/build.gradle, add the following setting in the android block:
There are three different categories of ML projects available. We’ll choose an image project since we’re going to develop a project that analyzes a person’s facial expression to determine their emotional condition.
The other two types, audio project and pose project, will be useful for creating projects that involve audio operation and human pose indication, respectively.
Select Standard Image model
Once more, there are two distinct groups of image machine learning projects. Since we are creating a project for an Android smartphone, we will select a standard picture project.
The other type, an Embedded Image Model project, is designed for hardware with relatively little memory and computing power.
Upload images for training the classes
We will create new classes by clicking on “Add a class.”
We must upload photographs to these classes as we are developing a project that analyzes a person’s emotional state from their facial expression.
The more photographs we upload, the more precise our result will be.
Click on train model and wait till training is over
Click on Export model
Select TensorFlow Lite Tab -> Quantized button -> Download my model
5.4. Add files/models to the Flutter project
Labels.txt
File contains all the class names which you created during model creation.
*.tflite
File contains the original model file as well as associated files a ZIP.
5.5. Load & Run ML-Model
We are importing the model from assets, so this line of code is crucial. This model will serve as the project’s brain.
Here, we’re configuring the camera using a camera controller and obtaining a live feed (Cameras[0] is the front camera).
6. Conclusion
We can achieve good performance of a Flutter app with an appropriate architecture, as discussed in this blog.
In this blog, we will be talking about design systems, diving into the different types of CSS frameworks/libraries, then looking into issues that come with choosing a framework that is not right for your type of project. Then we will be going over different business use cases where these different frameworks/libraries match their requirements.
Let’s paint a scenario: when starting a project, you start by choosing a JS framework. Let’s say, for example, that you went with a popular framework like React. Depending on whether you want an isomorphic app, you will look at Next.js. Next, we choose a UI framework, and that’s when our next set of problems appears.
WHICH ONE?
It’s hard to go with even the popular ones because it might not be what you are looking for. There are different types of libraries handling different types of use cases, and there are so many similar ones that each handle stuff slightly differently.
These frameworks come and go, so it’s essential to understand the fundamentals of CSS. These libraries and frameworks help you build faster; they don’t change how CSS works.
But, continuing with our scenario, let’s say we choose a popular library like Bootstrap, or Material. Then, later on, as you’re working through the project, you notice issues like:
– Overriding default classes more than required
– End up with ugly-looking code that’s hard to read
– Bloated CSS that reduces performance (flash of unstyled content issues, reduced CLS, FCP score)
– Swift changing designs, but you’re stuck with rigid frameworks, so migrating is hard and requires a lot more effort
– Require swift development but end up building from scratch
– Ending up with a div soup with no semantic meaning
To solve these problems and understand how these frameworks work, we have segregated them into the following category types.
We will dig into each category and look at how they work, their pros/cons and their business use case.
Categorizing the available frameworks:
Vanilla Libraries
These libraries allow you to write vanilla CSS with some added benefits like vendor prefixing, component-level scoping, etc. You can use this as a building block to create your own styling methodology. Essentially, it’s mainly CSS in JS-type libraries that come in this type of category. CSS modules would also come under these as well since you are writing CSS in a module file.
Also, inline styles in React seem to resemble a css-in-js type method, but they are different. For inline styles, you would lose out on media queries, keyframe animations, and selectors like pseudo-class, pseudo-element, and attribute selectors. But css-in-js type libraries have these abilities.
It also differs in how the out the CSS; inline styling would result in inline CSS in the HTML for that element, whereas css-in-js outputs as internal styles with class names.
Nowadays, these css-in-js types are popular for their optimized critical render path strategy for performance.
constButton= styled.a`/* This renders the buttons above... Edit me! */display: inline-block;border-radius: 3px;padding: 0.5rem 0;margin: 0.5rem 1rem;width: 11rem;background: transparent;color: white;border: 2px solid white;/* The GitHub button is a primary button* edit this to target it specifically! */${props ⇒ props. primary&&css`background: white;color: black;`}
List of example frameworks:
– Styled components
– Emotion
– Vanilla-extract
– Stitches
– CSS modules (CSS modules is not an official spec or an implementation in the browser, but rather, it’s a process in a build step (with the help of Webpack or Browserify) that changes class names and selectors to be scoped.)
Pros:
Fully customizable—you can build on top of it
Doesn’t bloat CSS, only loads needed CSS
Performance
Little to no style collision
Cons:
Requires effort and time to make components from scratch
Danger of writing smelly code
Have to handle accessibility on your own
Where would you use these?
A website with an unconventional design that must be built from scratch.
Where performance and high webvital scores are required—the performance, in this case, refers to an optimized critical render path strategy that affects FCP and CLS.
Generally, it would be user-facing applications like B2C.
Unstyled / Functional Libraries
Before coming to the library, we would like to cover a bit on accessibility.
Apart from a website’s visual stuff, there is also a functional aspect, accessibility.
And many times, when we say accessibility in the context of web development, people automatically think of screen readers. But it doesn’t just mean website accessibility to people with a disability; it also means enabling as many people as possible to use your websites, even people with or without disabilities or people who are limited.
Different age groups
Font size settings on phones and browser settings should be reflected on the app
Situational limitation
Dark mode and light mode
Different devices
Mobile, desktop, tablet
Different screen sizes
Ultra wide 21:9, normal monitor screen size 16:9
Interaction method
Websites can be accessible with keyboard only, mouse, touch, etc.
But these libraries mostly handle accessibility for the disabled, then interaction methods and focus management. The rest is left to developers, which includes settings that are more visual in nature, like screen sizes, light/dark mode etc.
In general, ARIA attributes and roles are used to provide information about the interaction of a complex widget. The libraries here sprinkle this information onto their components before giving them to be styled.
So, in short, these are low-level UI libraries that handle the functional part of the UI elements, like accessibility, keyboard navigation, or how they work. They come with little-to-no styling, which is meant to be overridden.
Radix UI
// Compose a Dialog with custom focus managementexportconstInfoDialog= ({ children }) => {constdialogCloseButton= React.useRef(null);return ( <Dialog.Root> <Dialog.Trigger>View details</Dialog.Trigger> <Dialog.Overlay /> <Dialog.Portal> <DialogContentonOpenAutoFocus={(event) => {// Focus the close button when dialog opens dialogCloseButton.current?.focus(); event.preventDefault(); }}> {children} <Dialog.Closeref={dialogCloseButton}> Close </Dialog.Close> </DialogContent> </Dialog.Portal> </Dialog.Root> )}
Gives the flexibility to create composable elements
Unopinionated styling, free to override
Cons:
Can’t be used for a rapid development project or prototyping
Have to understand the docs thoroughly to continue development at a normal pace
Where would you use these?
Websites like news or articles won’t require this.
Applications where accessibility is more important than styling and design (Government websites, banking, or even internal company apps).
Applications where importance is given to both accessibility and design, so customizability to these components is preferred (Teamflow, CodeSandbox, Vercel).
Can be paired with Vanilla libraries to provide performance with accessibility.
Can be paired with utility-style libraries to provide relatively faster development with accessibility.
Utility Styled Library / Framework
These types of libraries allow you to style your elements through their interfaces, either through class names or component props using composable individual CSS properties as per your requirements. The strongest point you have with such libraries is the flexibility of writing custom CSS properties. With these libraries, you would often require a “wrapper” class or components to be able to reuse them.
These libraries dump these utility classes into your HTML, impacting your performance. Though there is still an option to improve the performance by purging the unused CSS from your project in a build step, even with that, the performance won’t be as good as css-in-js. The purging would look at the class names throughout the whole project and remove them if there is no reference. So, when loading a page, it would still load CSS that is not being used on the current page but another one.
Chakra UI (although it has some prebuilt components, its concept is driven from Tailwind)
Tachyons
xStyled
Pros:
Rapid development and prototyping
Gives flexibility to styling
Enforces a little consistency; you don’t have to use magic numbers while creating the layout (spacing values, responsive variables like xs, sm, etc.)
Less context switching—you’ll write CSS in your HTML elements
Cons:
Endup with ugly-looking/hard-to-read code
Lack of importance to components, you would have to handle accessibility yourself
Creates a global stylesheet that would have unused classes
Where would you use these?
Easier composition of simpler components to build large applications.
Modular applications where rapid customization is required, like font sizes, color pallets, themes, etc.
FinTech or healthcare applications where you need features like theme-based toggling in light/dark mode to be already present.
Application where responsive design is supported out of the box, along with ease of accessibility and custom breakpoints for responsiveness.
Pre-styled / All-In-One Framework
These are popular frameworks that come with pre-styled, ready-to-use components out of the box with little customization.
These are heavy libraries that have fixed styling that can be overridden. However, generally speaking, overriding the classes would just load in extra CSS, which just clogs up the performance. These kinds of libraries are generally more useful for rapid prototyping and not in places with heavy customization and priority on performance.
These are quite beginner friendly as well, but if you are a beginner, it is best to understand the basics and fundamentals of CSS rather than fully relying on frameworks like these as your crutches. Although, these frameworks have their pros with speed of development.
<AccordionisExpanded={true} useArrow={true}> <AccordionLabelclassName="editor-accordion-label">RULES</AccordionLabel> <AccordionSection> <divclassName="editor-detail-panel editor-detail-panel-column"> <divclassName="label">Define conditional by adding a rule</div> <divclassName="rule-actions"></div> </div> </AccordionSection></Accordion>
List of the framework:
Bootstrap
Semantic UI
Material UI
Bulma
Mantine
Pros:
Faster development, saves time since everything comes out of the box.
Helps avoid cross-browser bugs
Helps follow best practices (accessibility)
Cons:
Low customization
Have to become familiar with the framework and its nuisance
Bloated CSS, since its loading in everything from the framework on top of overridden styles
Where would you use these?
Focus is not on nitty-gritty design but on development speed and functionality.
Enterprise apps where the UI structure of the application isn’t dynamic and doesn’t get altered a lot.
B2B apps mostly where the focus is on getting the functionality out fast—UX is mostly driven by ease of use of the functionality with a consistent UI design.
Applications where you want to focus more on cross-browser compatibility.
Conclusion:
This is not a hard and fast rule; there are still a bunch of parameters that aren’t covered in this blog, like developer preference, or legacy code that already uses a pre-existing framework. So, pick one that seems right for you, considering the parameters in and outside this blog and your judgment.
To summarize a little on the pros and cons of the above categories, here is a TLDR diagram:
Historically, embedded analytics was thought of as an integral part of a comprehensive business intelligence (BI) system. However, when we considered our particular needs, we soon realized something more innovative was necessary. That is when we came across Cube (formerly CubeJS), a powerful platform that could revolutionize how we think about embedded analytics solutions.
This new way of modularizing analytics solutions means businesses can access the exact services and features they require at any given time without purchasing a comprehensive suite of analytics services, which can often be more expensive and complex than necessary.
Furthermore, Cube makes it very easy to link up data sources and start to get to grips with analytics, which provides clear and tangible benefits for businesses. This new tool has the potential to be a real game changer in the world of embedded analytics, and we are very excited to explore its potential.
Understanding Embedded Analytics
When you read a word like “embedded analytics” or something similar, you probably think of an HTML embed tag or an iFrame tag. This is because analytics was considered a separate application and not part of the SaaS application, so the market had tools specifically for analytics.
“Embedded analytics is a digital workplace capability where data analysis occurs within a user’s natural workflow, without the need to toggle to another application. Moreover, embedded analytics tends to be narrowly deployed around specific processes such as marketing campaign optimization, sales lead conversions, inventory demand planning, and financial budgeting.” – Gartner
Embedded Analytics is not just about importing data into an iFrame—it’s all about creating an optimal user experience where the analytics feel like they are an integral part of the native application. To ensure that the user experience is as seamless as possible, great attention must be paid to how the analytics are integrated into the application. This can be done with careful thought to design and by anticipating user needs and ensuring that the analytics are intuitive and easy to use. This way, users can get the most out of their analytics experience.
Existing Solutions
With the rising need for SaaS applications and the number of SaaS applications being built daily, analytics must be part of the SaaS application.
We have identified three different categories of exciting solutions available in the market.
Traditional BI Platforms
Many tools, such as GoodData, Tableau, Metabase, Looker, and Power BI, are part of the big and traditional BI platforms. Despite their wide range of features and capabilities, these platforms need more support with their Big Monolith Architecture, limited customization, and less-than-intuitive user interfaces, making them difficult and time-consuming.
Here are a few reasons these are not suitable for us:
They lack customization, and their UI is not intuitive, so they won’t be able to match our UX needs.
They charge a hefty amount, which is unsuitable for startups or small-scale companies.
They have a big monolith architecture, making integrating with other solutions difficult.
New Generation Tools
The next experiment taking place in the market is the introduction of tools such as Hex, Observable, Streamlit, etc. These tools are better suited for embedded needs and customization, but they are designed for developers and data scientists. Although the go-to-market time is shorter, all these tools cannot integrate into SaaS applications.
Here are a few reasons why these are not suitable for us:
They are not suitable for non-technical people and cannot integrate with Software-as-a-Service (SaaS) applications.
Since they are mainly built for developers and data scientists, they don’t provide a good user experience.
They are not capable of handling multiple data sources simultaneously.
They do not provide pre-aggregation and caching solutions.
In House Tools
Building everything in-house, instead of paying other platforms to build everything from scratch, is possible using API servers and GraphQL. However, there is a catch: the requirements for analytics are not straightforward, which will require a lot of expertise to build, causing a big hurdle in adaptation and resulting in a longer time-to-market.
Here are a few reasons why these are not suitable for us:
Building everything in-house requires a lot of expertise and time, thus resulting in a longer time to market.
It requires developing a secure authentication and authorization system, which adds to the complexity.
It requires the development of a caching system to improve the performance of analytics.
It requires the development of a real-time system for dynamic dashboards.
It requires the development of complex SQL queries to query multiple data sources.
Typical Analytics Features
If you want to build analytics features, the typical requirements look like this:
Multi-Tenancy
When developing software-as-a-service (SaaS) applications, it is often necessary to incorporate multi-tenancy into the architecture. This means multiple users will be accessing the same software application, but with a unique and individualized experience. To guarantee that this experience is not compromised, it is essential to ensure that the same multi-tenancy principles are carried over into the analytics solution that you are integrating into your SaaS application. It is important to remember that this will require additional configuration and setup on your part to ensure that all of your users have access to the same level of tools and insights.
Intuitive Charts
If you look at some of the available analytics tools, they may have good charting features, but they may not be able to meet your specific UX needs. In today’s world, many advanced UI libraries and designs are available, which are often far more effective than the charting features of analytics tools. Integrating these solutions could help you create a more user-friendly experience tailored specifically to your business requirements.
Security
You want to have authentication and authorization for your analytics so that managers can get an overview of the analytics for their entire team, while individual users can only see their own analytics. Furthermore, you may want to grant users with certain roles access to certain analytics charts and other data to better understand how their team is performing. To ensure that your analytics are secure and that only the right people have access to the right information, it is vital to set up an authentication and authorization system.
Caching
Caching is an incredibly powerful tool for improving the performance and economics of serving your analytics. By implementing a good caching solution, you can see drastic improvements in the speed and efficiency of your analytics, while also providing an improved user experience. Additionally, the cost savings associated with this approach can be quite significant, providing you with a greater return on investment. Caching can be implemented in various ways, but the most effective approaches are tailored to the specific needs of your analytics. By leveraging the right caching solutions, you can maximize the benefits of your analytics and ensure that your users have an optimized experience.
Real-time
Nowadays, every successful SaaS company understands the importance of having dynamic and real-time dashboards; these dashboards provide users with the ability to access the latest data without requiring them to refresh the tab each and every time. By having real-time dashboards, companies can ensure their customers have access to the latest information, which can help them make more informed decisions. This is why it is becoming increasingly important for SaaS organizations to invest in robust, low-latency dashboard solutions that can deliver accurate, up-to-date data to their customers.
Drilldowns
Drilldown is an incredibly powerful analytics capability that enables users to rapidly transition from an aggregated, top-level overview of their data to a more granular, in-depth view. This can be achieved simply by clicking on a metric within a dashboard or report. With drill-down, users can gain a greater understanding of the data by uncovering deeper insights, allowing them to more effectively evaluate the data and gain a more accurate understanding of their data trends.
Data Sources
With the prevalence of software as a service (SaaS) applications, there could be a range of different data sources used, including PostgreSQL, DynamoDB, and other types of databases. As such, it is important for analytics solutions to be capable of accommodating multiple data sources at once to provide the most comprehensive insights. By leveraging the various sources of information, in conjunction with advanced analytics, businesses can gain a thorough understanding of their customers, as well as trends and behaviors. Additionally, accessing and combining data from multiple sources can allow for more precise predictions and recommendations, thereby optimizing the customer experience and improving overall performance.
Budget
Pricing is one of the most vital aspects to consider when selecting an analytics tool. There are various pricing models, such as AWS Quick-sight, which can be quite complex, or per-user basis costs, which can be very expensive for larger organizations. Additionally, there is custom pricing, which requires you to contact customer care to get the right pricing; this can be quite a difficult process and may cause a big barrier to adoption. Ultimately, it is important to understand the different pricing models available and how they may affect your budget before selecting an analytics tool.
After examining all the requirements, we came across a solution like Cube, which is an innovative solution with the following features:
Open Source: Since it is open source, you can easily do a proof-of-concept (POC) and get good support, as any vulnerabilities will be fixed quickly.
Modular Architecture: It can provide good customizations, such as using Cube to use any custom charting library you prefer in your current framework.
Embedded Analytics-as-a-Code: You can easily replicate your analytics and version control it, as Cube is analytics in the form of code.
Cloud Deployments: It is a new-age tool, so it comes with good support with Docker or Kubernetes (K8s). Therefore, you can easily deploy it on the cloud.
Cube Architecture
Let’s look at the Cube architecture to understand why Cube is an innovative solution.
Cube supports multiple data sources simultaneously; your data may be stored in Postgres, Snowflake, and Redshift, and you can connect to all of them simultaneously. Additionally, they have a long list of data sources they can support.
Cube provides analytics over a REST API; very few analytics solutions provide chart data or metrics over REST APIs.
The security you might be using for your application can easily be mirrored for Cube. This helps simplify the security aspects, as you don’t need to maintain multiple tokens for the app and analytics tool.
Cube provides a unique way to model your data in JSON format; it’s more similar to an ORM. You don’t need to write complex SQL queries; once you model your data, Cube will generate the SQL to query the data source.
Cube has very good pre-aggregation and caching solutions.
Cube Deep Dive
Let’s look into different concepts that we just saw briefly in the architecture diagram.
Data Modeling
Cube
A cube represents a table of data and is conceptually similar to a view in SQL. It’s like an ORM where you can define schema, extend it, or define abstract cubes to make use of code reusable. For example, if you have a Customer table, you need to write a Cube for it. Using Cubes, you can build analytical queries.
Each cube contains definitions of measures, dimensions, segments, and joins between cubes. Cube bifurcates columns into measures and dimensions. Similar to tables, every cube can be referenced in another cube. Even though a cube is a table representation, you can choose which columns you want to expose for analytics. You can only add columns you want to expose to analytics; this will translate into SQL for the dimensions and measures you use in the SQL query (Push Down Mechanism).
cube('Orders', { sql: `SELECT * FROM orders`,});
Dimensions
You can think about a dimension as an attribute related to a measure, for example, the measure userCount. This measure can have different dimensions, such as country, age, occupation, etc.
Dimensions allow us to further subdivide and analyze the measure, providing a more detailed and comprehensive picture of the data.
These parameters/SQL columns allow you to define the aggregations for numeric or quantitative data. Measures can be used to perform calculations such as sum, minimum, maximum, average, and count on any set of data.
Measures also help you define filters if you want to add some conditions for a metric calculation. For example, you can set thresholds to filter out any data that is not within the range of values you are looking for.
Additionally, measures can be used to create additional metrics, such as the ratio between two different measures or the percentage of a measure. With these powerful tools, you can effectively analyze and interpret your data to gain valuable insights.
Joins define the relationships between cubes, which then allows accessing and comparing properties from two or more cubes at the same time. In Cube, all joins are LEFT JOINs. This also allows you to represent one-to-one, many-to-one relationships easily.
cube('Orders', {..., joins: { LineItems: { relationship: `belongsTo`,// Here we use the `CUBE` global to refer to the current cube,// so the following is equivalent to `Orders.id = LineItems.order_id` sql: `${CUBE}.id = ${LineItems}.order_id`, }, },});
There are three kinds of join relationships:
belongsTo
hasOne
hasMany
Segments
Segments are filters predefined in the schema instead of a Cube query. Segments help pre-build complex filtering logic, simplifying Cube queries and making it easy to re-use common filters across a variety of queries.
To add a segment that limits results to completed orders, we can do the following:
Pre-aggregations are a powerful way of caching frequently-used, expensive queries and keeping the cache up-to-date periodically. The most popular roll-up pre-aggregation is summarized data of the original cube grouped by any selected dimensions of interest. It works on “measure types” like count, sum, min, max, etc.
Cube analyzesqueriesagainst a defined set of pre-aggregation rules to choose the optimal one that will be used to create pre-aggregation table. When there is a smaller dataset that queries execute over, the application works well and delivers responses within acceptable thresholds. However, as the size of the dataset grows, the time-to-response from a user’s perspective can often suffer quite heavily. It specifies attributes from the source, which Cube uses to condense (or crunch) the data. This simple yet powerful optimization can reduce the size of the dataset by several orders of magnitude, and ensures subsequentqueries can be served by the same condensed dataset if any matching attributes are found.
Even granularity can be specified, which defines the granularity of data within the pre-aggregation. If set to week, for example, then Cube will pre-aggregate the data by week and persist it to Cube Store.
Cube can also take care of keeping pre-aggregations up-to-date with the refreshKey property. By default, it is set to every: ‘1 hour’.
Let’s look into some of the additional concepts that Cube provides that make it a very unique solution.
Caching
Cube provides a two-level caching system. The first level is in-memory cache, which is active by default. Cube in-memory cache acts as a buffer for your database when there is a burst of requests hitting the same data from multiple concurrent users, while pre-aggregations are designed to provide the right balance between time to insight and querying performance.
The second level of caching is called pre-aggregations, and requires explicit configuration to activate.
Drilldowns
Drilldowns are a powerful feature to facilitate data exploration. It allows building an interface to let users dive deeper into visualizations and data tables. See ResultSet.drillDown() on how to use this feature on the client side.
A drilldown is defined on the measure level in your data schema. It is defined as a list of dimensions called drill members. Once defined, these drill members will always be used to show underlying data when drilling into that measure.
Subquery
You can use subqueries within dimensions to reference measures from other cubes inside a dimension. Under the hood, it behaves as a correlated subquery, but is implemented via joins for optimal performance and portability.
For example, the following SQL can be written using a subquery in cubes as:
Apart from these, Cube also provides advanced concepts such as Export and Import, Extending Cubes, Data Blending, Dynamic Schema Creation, and Polymorphic Cubes. You can read more about them in the Cube documentation.
Getting Started with Cube
Getting started with Cube is very easy. All you need to do is follow the instructions on the Cube documentation page.
To get started you can use Docker to get started quickly. With Docker, you can install Cube in a few easy steps:
1. In a new folder for your project, run the following command:
docker run -p 4000:4000-p 3000:3000-v ${PWD}:/cube/conf -e CUBEJS_DEV_MODE=true cubejs/cube
The Developer Playground has a database connection wizard that loads when Cube is first started up and no .env file is found. After database credentials have been set up, an .env file will automatically be created and populated with the same credentials.
Click on the type of database to connect to, and you’ll be able to enter credentials:
After clicking Apply, you should see available tables from the configured database. Select one to generate a data schema. Once the schema is generated, you can execute queries on the Build tab.****
Conclusion
Cube is a revolutionary, open-source framework for building embedded analytics applications. It offers a unified API for connecting to any data source, comprehensive visualization libraries, and a data-driven user experience that makes it easy for developers to build interactive applications quickly. With Cube, developers can focus on the application logic and let the framework take care of the data, making it an ideal platform for creating data-driven applications that can be deployed on the web, mobile, and desktop. It is an invaluable tool for any developer interested in building sophisticated analytics applications quickly and easily.
GitHub Actions jobs are run in the cloud by default; however, sometimes we want to run jobs in our own customized/private environment where we have full control. That is where a self-hosted runner saves us from this problem.
To get a basic understanding of running self-hosted runners on the Kubernetes cluster, this blog is perfect for you.
We’ll be focusing on running GitHub Actions on a self-hosted runner on Kubernetes.
An example use case would be to create an automation in GitHub Actions to execute MySQL queries on MySQL Database running in a private network (i.e., MySQL DB, which is not accessible publicly).
A self-hosted runner requires the provisioning and configuration of a virtual machine instance; here, we are running it on Kubernetes. For running a self-hosted runner on a Kubernetes cluster, the action-runner-controller helps us to make that possible.
This blog aims to try out self-hosted runners on Kubernetes and covers:
Deploying MySQL Database on minikube, which is accessible only within Kubernetes Cluster.
Deploying self-hosted action runners on the minikube.
Running GitHub Action on minikube to execute MySQL queries on MySQL Database.
Steps for completing this tutorial:
Create a GitHub repository
Create a private repository on GitHub. I am creating it with the name velotio/action-runner-poc.
By default, actions-runner-controller uses cert-manager for certificate management of admission webhook, so we have to make sure cert-manager is installed on Kubernetes before we install actions-runner-controller.
Run the below helm commands to install cert-manager on minikube.
Verify installation using “kubectl –namespace cert-manager get all”. If everything is okay, you will see an output as below:
Setting Up Authentication for Hosted Runners
There are two ways for actions-runner-controller to authenticate with the GitHub API (only 1 can be configured at a time, however):
Using a GitHub App (not supported for enterprise-level runners due to lack of support from GitHub.)
Using a PAT (personal access token)
To keep this blog simple, we are going with PAT.
To authenticate an action-runner-controller with the GitHub API, we can use a PAT with the action-runner-controller registers a self-hosted runner.
Go to account > Settings > Developers settings > Personal access token. Click on “Generate new token”. Under scopes, select “Full control of private repositories”.
Click on the “Generate token” button.
Copy the generated token and run the below commands to create a Kubernetes secret, which will be used by action-runner-controller deployment.
Verify that the action-runner-controller installed properly using below command
kubectl --namespaceactions-runner-systemgetall
Create a Repository Runner
Create a RunnerDeployment Kubernetes object, which will create a self-hosted runner named k8s-action-runner for the GitHub repository velotio/action-runner-poc
Please Update Repo name from “velotio/action-runner-poc” to “<Your-repo-name>”
To create the RunnerDeployment object, create the file runner.yaml as follows:
Check that the pod is running using the below command:
kubectl get pod -n actions-runner-system | grep -i "k8s-action-runner"
If everything goes well, you should see two action runners on the Kubernetes, and the same are registered on Github. Check under Settings > Actions > Runner of your repository.
Check the pod with kubectl get po -n actions-runner-system
Install a MySQL Database on the Kubernetes cluster
Create the file mysql-svc-deploy.yaml and add the below content to mysql-svc-deploy.yaml
Here, we have used MYSQL_ROOT_PASSWORD as “password”.
apiVersion: v1kind: Servicemetadata:name: mysqlspec:ports:-port: 3306selector:app: mysqlclusterIP: None---apiVersion: apps/v1kind: Deploymentmetadata:name: mysqlspec:selector:matchLabels:app: mysqlstrategy:type: Recreatetemplate:metadata:labels:app: mysqlspec:containers:-image: mysql:5.6name: mysqlenv: # Use secret in real usage-name: MYSQL_ROOT_PASSWORDvalue: passwordports:-containerPort: 3306name: mysqlvolumeMounts:-name: mysql-persistent-storagemountPath: /var/lib/mysqlvolumes:-name: mysql-persistent-storagepersistentVolumeClaim:claimName: mysql-pv-claim
Create the service and deployment
kubectl create -f mysql-svc-deploy.yaml -n mysql
Verify that the MySQL database is running
kubectl get po -n mysql
Create a GitHub repository secret to store MySQL password
As we will use MySQL password in the GitHub action workflow file as a good practice, we should not use it in plain text. So we will store MySQL password in GitHub secrets, and we will use this secret in our GitHub action workflow file.
Create a secret in the GitHub repository and give the name to the secret as “MYSQL_PASS”, and in the values, enter “password”.
Create a GitHub workflow file
YAML syntax is used to write GitHub workflows. For each workflow, we use a separate YAML file, which we store at .github/workflows/ directory. So, create a .github/workflows/ directory in your repository and create a file .github/workflows/mysql_workflow.yaml as follows.
---name: Example 1on:push:branches: [ main ]jobs:build:name: Build-job runs-on: self-hostedsteps:-name: Checkoutuses: actions/checkout@v2-name: MySQLQueryenv:PASS: ${{ secrets.MYSQL_PASS }}run: | docker run -v ${GITHUB_WORKSPACE}:/var/lib/docker --rm mysql:5.6sh -c"mysql -u root -p$PASS -hmysql.mysql.svc.cluster.local </var/lib/docker/test.sql"
If you check the docker run command in the mysql_workflow.yaml file, we are referring to the .sql file, i.e., test.sql. So, create a test.sql file in your repository as follows:
use mysql;CREATETABLEIFNOTEXISTSPersons ( PersonID int, LastName varchar(255), FirstName varchar(255), Address varchar(255), City varchar(255));SHOWTABLES;
In test.sql, we are running MySQL queries like create tables.
Push changes to your repository main branch.
If everything is fine, you will be able to see that the GitHub action is getting executed in a self-hosted runner pod. You can check it under the “Actions” tab of your repository.
You can check the workflow logs to see the output of SHOW TABLES—a command we have used in the test.sql file—and check whether the persons tables is created.
As businesses move their data to the public cloud, one of the most pressing issues is how to keep it safe from illegal access.
Using a tool like HashiCorp Vault gives you greater control over your sensitive credentials and fulfills cloud security regulations.
In this blog, we’ll walk you through HashiCorp Vault High Availability Setup.
Hashicorp Vault
Hashicorp Vault is an open-source tool that provides a secure, reliable way to store and distribute sensitive information like API keys, access tokens, passwords, etc. Vault provides high-level policy management, secret leasing, audit logging, and automatic revocation to protect this information using UI, CLI, or HTTP API.
High Availability
Vault can run in a High Availability mode to protect against outages by running multiple Vault servers. When running in HA mode, Vault servers have two additional states, i.e., active and standby. Within a Vault cluster, only a single instance will be active, handling all requests, and all standby instances redirect requests to the active instance.
Integrated Storage Raft
The Integrated Storage backend is used to maintain Vault’s data. Unlike other storage backends, Integrated Storage does not operate from a single source of data. Instead, all the nodes in a Vault cluster will have a replicated copy of Vault’s data. Data gets replicated across all the nodes via the Raft Consensus Algorithm.
Raft is officially supported by Hashicorp.
Architecture
Prerequisites
This setup requires Vault, Sudo access on the machines, and the below configuration to create the cluster.
Install Vault v1.6.3+ent or later on all nodes in the Vault cluster
In this example, we have 3 CentOs VMs provisioned using VMware.
Setup
1. Verify the Vault version on all the nodes using the below command (in this case, we have 3 nodes node1, node2, node3):
vault --version
2. Configure SSL certificates
Note: Vault should always be used with TLS in production to provide secure communication between clients and the Vault server. It requires a certificate file and key file on each Vault host.
We can generate SSL certs for the Vault Cluster on the Master and copy them on the other nodes in the cluster.
You can view the systemd unit file if interested by:
cat /etc/systemd/system/vault.servicesystemctl enable vault.servicesystemctl start vault.servicesystemctl status vault.service
9. Check Vault status on all nodes:
vault status
10. Initialize Vault with the following command on vault node 1 only. Store unseal keys securely.
[user@node1 vault.d]$ vault operator init -key-shares=1-key-threshold=1Unseal Key 1: HPY/g5OiT8ivD6L4Bqfjx9L1We2MVb4WZAqKZk6zFf8=Initial Root Token: hvs.j4qTq1IZP9nscILMtN2p9GE0Vault initialized with1 key shares and a key threshold of1.Please securely distribute the key shares printed above. When the Vault is re-sealed, restarted, or stopped, you must supply at least 1of these keys to unseal itbefore it can start servicing requests.Vault does not store the generated root key. Without at least 1 keys to reconstruct the root key, Vault will remain permanently sealed!It is possible to generate new unseal keys, provided you have aquorum of existing unseal keys shares. See "vault operator rekey" for more information.
11. Set Vault token environment variable for the vault CLI command to authenticate to the server. Use the following command, replacing <initial-root- token> with the value generated in the previous step.
exportVAULT_TOKEN=<initial-root-token>echo "export VAULT_TOKEN=$VAULT_TOKEN" >> /root/.bash_profile### Repeat this step for the other 2 servers.
12. Unseal Vault1 using the unseal key generated in step 10. Notice the Unseal Progress key-value change as you present each key. After meeting the key threshold, the status of the key value for Sealed should change from true to false.
[user@node1 vault.d]$ vault operator unseal HPY/g5OiT8ivD6L4Bqfjx9L1We2MVb4WZAqKZk6zFf8=Key Value--------Seal Type shamirInitialized trueSealed falseTotal Shares 1Threshold 1Version 1.11.0Build Date 2022-06-17T15:48:44ZStorage Type raftCluster Name POCCluster ID 109658fe-36bd-7d28-bf92-f095c77e860cHA Enabled trueHA Cluster https://node1.int.us-west-1-dev.central.example.com:8201HA Mode activeActive Since 2022-06-29T12:50:46.992698336ZRaft Committed Index 36Raft Applied Index 36
13. Unseal Vault2 (Use the same unseal key generated in step 10 for Vault1):
[user@node2 vault.d]$ vault operator unseal HPY/g5OiT8ivD6L4Bqfjx9L1We2MVb4WZAqKZk6zFf8=Key Value--------Seal Type shamirInitialized trueSealed trueTotal Shares 1Threshold 1Unseal Progress 0/1Unseal Nonce n/aVersion 1.11.0Build Date 2022-06-17T15:48:44ZStorage Type raftHA Enabled true[user@node2 vault.d]$ vault statusKey Value--------Seal Type shamirInitialized trueSealed trueTotal Shares 1Threshold 1Version 1.11.0Build Date 2022-06-17T15:48:44ZStorage Type raftCluster Name POCCluster ID 109658fe-36bd-7d28-bf92-f095c77e860cHA Enabled trueHA Cluster https://node1.int.us-west-1-dev.central.example.com:8201HA Mode standbyActive Node Address https://node1.int.us-west-1-dev.central.example.com:8200Raft Committed Index 37Raft Applied Index 37
14. Unseal Vault3 (Use the same unseal key generated in step 10 for Vault1):
[user@node3 ~]$ vault operator unseal HPY/g5OiT8ivD6L4Bqfjx9L1We2MVb4WZAqKZk6zFf8=Key Value--------Seal Type shamirInitialized trueSealed trueTotal Shares 1Threshold 1Unseal Progress 0/1Unseal Nonce n/aVersion 1.11.0Build Date 2022-06-17T15:48:44ZStorage Type raftHA Enabled true[user@node3 ~]$ vault statusKey Value--------Seal Type shamirInitialized trueSealed falseTotal Shares 1Threshold 1Version 1.11.0Build Date 2022-06-17T15:48:44ZStorage Type raftCluster Name POCCluster ID 109658fe-36bd-7d28-bf92-f095c77e860cHA Enabled trueHA Cluster https://node1.int.us-west-1-dev.central.example.com:8201HA Mode standbyActive Node Address https://node1.int.us-west-1-dev.central.example.com:8200Raft Committed Index 39Raft Applied Index 39
15. Check the cluster’s raft status with the following command:
16. Currently, node1 is the active node. We can experiment to see what happens if node1 steps down from its active node duty.
In the terminal where VAULT_ADDR is set to: https://node1.int.us-west-1-dev.central.example.com, execute the step-down command.
$ vault operator step-down # equivalent of stopping the node or stopping the systemctl serviceSuccess! Stepped down: https://node2.int.us-west-1-dev.central.example.com:8200
In the terminal, where VAULT_ADDR is set to https://node2.int.us-west-1-dev.central.example.com:8200, examine the raft peer set.
Vault servers are now operational in High Availability mode, and we can test this by writing a secret from either the active or standby Vault instance and see it succeed as a test of request forwarding. Also, we can shut down the active vault instance (sudo systemctl stop vault) to simulate a system failure and see the standby instance assumes the leadership.
We will cover the security concepts of Kafka and walkthrough the implementation of encryption, authentication, and authorization for the Kafka cluster.
This article will explain how to configure SASL_SSL (simple authentication security layer) security for your Kafka cluster and how to protect the data in transit. SASL_SSL is a communication type in which clients use authentication mechanisms like PLAIN, SCRAM, etc., and the server uses SSL certificates to establish secure communication. We will use the SCRAM authentication mechanism here for the client to help establish mutual authentication between the client and server. We’ll also discuss authorization and ACLs, which are important for securing your cluster.
Prerequisites
Running Kafka Cluster, basic understanding of security components.
Need for Kafka Security
The primary reason is to prevent unlawful internet activities for the purpose of misuse, modification, disruption, and disclosure. So, to understand the security in Kafka cluster a secure Kafka cluster, we need to know three terms:
Authentication – It is a security method used for servers to determine whether users have permission to access their information or website.
Authorization – The authorization security method implemented with authentication enables servers to have a methodology of identifying clients for access. Basically, it gives limited access, which is sufficient for the client.
Encryption – It is the process of transforming data to make it distorted and unreadable without a decryption key. Encryption ensures that no other client can intercept and steal or read data.
Here is the quick start guide by Apache Kafka, so check it out if you still need to set up Kafka.
We’ll not cover the theoretical aspects here, but you can find a ton of sources on how these three components work internally. For now, we’ll focus on the implementation part and how Kafka revolves around security.
This image illustrates SSL communication between the Kafka client and server.
We are going to implement the steps in the below order:
Create a Certificate Authority
Create a Truststore & Keystore
Certificate Authority – It is a trusted entity that issues SSL certificates. As such, a CA is an independent entity that acts as a trusted third party, issuing certificates for use by others. A certificate authority validates the credentials of a person or organization that requests a certificate before issuing one.
Truststore – A truststore contains certificates from other parties with which you want to communicate or certificate authorities that you trust to identify other parties. In simple words, a list of CAs that can validate the certificate signed by the trusted CA.
KeyStore – A KeyStore contains private keys and certificates with their corresponding public keys. Keystores can have one or more CA certificates depending upon what’s needed.
For Kafka Server, we need a server certificate, and here, Keystore comes into the picture since it stores a server certificate. The server certificate should be signed by Certificate Authority (CA). The KeyStore requests to sign the server certificate and in response, CA send a signed CRT to Keystore.
We will create our own certificate authority for demonstration purposes. If you don’t want to create a private certificate authority, there are many certificate providers you can go with, like IdenTrust and GoDaddy. Since we are creating one, we need to tell our Kafka client to trust our private certificate authority using the Trust Store.
This block diagram shows you how all the components communicate with each other and their role to generate the final certificate.
So, let’s create our Certificate Authority. Run the below command in your terminal:
It will ask for a passphrase, and keep it safe for future use cases. After successfully executing the command, we should have two files named private_key_name and public_certificate_name.
Now, let’s create a KeyStore and trust store for brokers; we need both because brokers also interact internally with each other. Let’s understand with the help of an example: Broker A wants to connect with Broker B, so Broker A acts as a client and Broker B as a server. We are using the SASL_SSL protocol, so A needs SASL credentials, and B needs a certificate for authentication. The reverse is also possible where Broker B wants to connect with Broker A, so we need both a KeyStore and a trust store for authentication.
Now let’s create a trust store. Execute the below command in the terminal, and it should ask for the password. Save the password for future use:
“keytool -keystore <truststore_name.jks> -alias <alias name of the entry to process> -import -file <public_certificate_name>”
Here, we are using the .jks extension for the file, which stands for Java KeyStore. You can also use Public-Key Cryptography Standards #12 (pkcs12) instead of .jks, but that’s totally up to you. public_certificate_name is the same certificate while we create CA.
For the KeyStore configuration, run the below command and store the password:
This action creates the KeyStore file in the current working directory. The question “First and Last Name” requires you to enter a fully qualified domain name because some certificate authorities, such as VeriSign, expect this property to be a fully qualified domain name. Not all CAs require a fully qualified domain name, but I recommend using a fully qualified domain name for portability. All other information should be valid. If the information cannot be verified, a certificate authority such as VeriSign will not sign the CSR generated for that record. I’m using localhost for the domain name here, as seen in the above command itself.
Keystore has an entry with alias_name. It contains the private key and information needed for generating a CSR. Now let’s create a signing certificate request, so it will be used to get a signed certificate from Certificate Authority.
So, we have generated a signing certificate request using a KeyStore (the KeyStore name and alias name should be the same). It should ask for the KeyStore password, so enter the same one used while creating the KeyStore.
Now, execute the below command. It will ask for the password, so enter the CA password, and now we have a signed certificate:
Finally, we need to add the public certificate of CA and signed certificate in the KeyStore, so run the below command. It will add the CA certificate to the KeyStore.
As of now, we have generated all the security files for the broker. For internal broker communication, we are using SASL_SSL (see security.inter.broker.protocol in server.properties). Now we need to create a broker username and password using the SCRAM method. For more details, click here.
NOTE: If you are using an external jaas config file, then remove the ScramLoginModule line and set this environment variable before starting broker. “export KAFKA_OPTS=-Djava.security.auth.login.config={path/to/broker.conf}”
Now, if we run Kafka, the broker should be running on port 9092 without any failure, and if you have multiple brokers inside Kafka, the same config file can be replicated among them, but the port should be different for each broker.
Producers and consumers need a username and a password to access the broker, so let’s create their credentials and update respective configurations.
Create a producer user and update producer.properties inside the bin directory, so execute the below command in your terminal.
We need a trust store file for our clients (producer and consumer), but as we already know how to create a trust store, this is a small task for you. It is suggested that producers and consumers should have separate trust stores because when we move Kafka to production, there could be multiple producers and consumers on different machines.
As of now, we have implemented encryption and authentication for Kafka brokers. To verify that our producer and consumer are working properly with SCRAM credentials, run the console producer and consumer on some topics.
Authorization is not implemented yet. Kafka uses access control lists (ACLs) to specify which users can perform which actions on specific resources or groups of resources. Each ACL has a principal, a permission type, an operation, a resource type, and a name.
The default authorizer is ACLAuthorizer provided by Kafka; Confluent also provides the Confluent Server Authorizer, which is totally different from ACLAuthorizer. An authorizer is a server plugin used by Kafka to authorize actions. Specifically, the authorizer controls whether operations should be authorized based on the principal and resource being accessed.
Format of ACLs – Principal P is [Allowed/Denied] Operation O from Host H on any Resource R matching ResourcePattern RP
Execute the below command to create an ACL with writing permission for the producer:
The above line indicates that AclAuthorizer class is used for authorization.
# consumer group idgroup.id=<consumer_group_name>
Consumer group-id is mandatory, and if we do not specify any group, a consumer will not be able to access the data from topics, so to start a consumer, group-id should be provided.
Let’s test the producer and consumer one by one, run the console producer and also run the console consumer in another terminal; both should be running without error.
console-producerconsole-consumer
Voila!! Your Kafka is secured.
Summary
In a nutshell, we have implemented security in our Kafka using the SASL_SSL mechanism and learned how to create ACLs and give different permission to different users.
Apache Kafka is the wild west without security. By default, there is no encryption, authentication, or access control list. Any client can communicate with the Kafka broker using the PLAINTEXT port. Access using this port should be restricted to trusted clients only. You can use network segmentation and/or authentication ACLs to restrict access to trusted IP addresses in these cases. If none of these are used, the cluster is wide open and available to anyone. A basic knowledge of Kafka authentication, authorization, encryption, and audit trails is required to safely move a system into production.
All architectures have one common goal: to manage the complexity of our application. We may not need to worry about it on a smaller project, but it becomes a lifesaver on larger ones. The purpose of Clean Architecture is to minimize code complexity by preventing implementation complexity.
We must first understand a few things to implement the Clean Architecture in an Android project.
Entities: Encapsulate enterprise-wide critical business rules. An entity can be an object with methods or data structures and functions.
Use cases: It demonstrates data flow to and from the entities.
Controllers, gateways, presenters: A set of adapters that convert data from the use cases and entities format to the most convenient way to pass the data to the upper level (typically the UI).
UI, external interfaces, DB, web, devices: The outermost layer of the architecture, generally composed of frameworks such as database and web frameworks.
Here is one thumb rule we need to follow. First, look at the direction of the arrows in the diagram. Entities do not depend on use cases and use cases do not depend on controllers, and so on. A lower-level module should always rely on something other than a higher-level module. The dependencies between the layers must be inwards.
Advantages of Clean Architecture:
Strict architecture—hard to make mistakes
Business logic is encapsulated, easy to use, and tested
Enforcement of dependencies through encapsulation
Allows for parallel development
Highly scalable
Easy to understand and maintain
Testing is facilitated
Let’s understand this using the small case study of the Android project, which gives more practical knowledge rather than theoretical.
A pragmatic approach
A typical Android project typically needs to separate the concerns between the UI, the business logic, and the data model, so taking “the theory” into account, we decided to split the project into three modules:
Domain Layer: contains the definitions of the business logic of the app, the data models, the abstract definition of repositories, and the definition of the use cases.
Domain Module
Data Layer: This layer provides the abstract definition of all the data sources. Any application can reuse this without modifications. It contains repositories and data sources implementations, the database definition and its DAOs, the network APIs definitions, some mappers to convert network API models to database models, and vice versa.
Data Module
Presentation layer: This is the layer that mainly interacts with the UI. It’s Android-specific and contains fragments, view models, adapters, activities, composable, and so on. It also includes a service locator to manage dependencies.
Presentation Module
Marvel’s comic characters App
To elaborate on all the above concepts related to Clean Architecture, we are creating an app that lists Marvel’s comic characters using Marvel’s developer API. The app shows a list of Marvel characters, and clicking on each character will show details of that character. Users can also bookmark their favorite characters. It seems like nothing complicated, right?
Before proceeding further into the sample, it’s good to have an idea of the following frameworks because the example is wholly based on them.
Jetpack Compose – Android’s recommended modern toolkit for building native UI.
Retrofit 2 – A type-safe HTTP client for Android for Network calls.
ViewModel – A class responsible for preparing and managing the data for an activity or a fragment.
Kotlin – Kotlin is a cross-platform, statically typed, general-purpose programming language with type inference.
To get a characters list, we have used marvel’s developer API, which returns the list of marvel characters.
http://gateway.marvel.com/v1/public/characters
The domain layer
In the domain layer, we define the data model, the use cases, and the abstract definition of the character repository. The API returns a list of characters, with some info like name, description, and image links.
data classCharacterEntity(valid: Long,valname: String,valdescription: String,valimageUrl: String,valbookmarkStatus: Boolean)
interfaceMarvelDataRepository { suspend fun getCharacters(dataSource:DataSource):Flow<List<CharacterEntity>> suspend fun getCharacter(characterId:Long):Flow<CharacterEntity> suspend fun toggleCharacterBookmarkStatus(characterId:Long):Boolean suspend fun getComics(dataSource:DataSource, characterId:Long):Flow<List<ComicsEntity>>}
As we said before, the data layer must implement the abstract definition of the domain layer, so we need to put the repository’s concrete implementation in this layer. To do so, we can define two data sources, a “local” data source to provide persistence and a “remote” data source to fetch the data from the API.
classMarvelDataRepositoryImpl(privatevalmarvelRemoteService: MarvelRemoteService,privatevalcharactersDao: CharactersDao,privatevalcomicsDao: ComicsDao,privatevalioDispatcher: CoroutineDispatcher = Dispatchers.IO) : MarvelDataRepository {override suspend fun getCharacters(dataSource:DataSource):Flow<List<CharacterEntity>> = flow {emitAll(when (dataSource) { is DataSource.Cache -> getCharactersCache().map { list ->if (list.isEmpty()) {getCharactersNetwork() } else { list.toDomain() } } .flowOn(ioDispatcher) is DataSource.Network -> flowOf(getCharactersNetwork()) .flowOn(ioDispatcher) } ) }private suspend fun getCharactersNetwork():List<CharacterEntity> = marvelRemoteService.getCharacters().body()?.data?.results?.let { remoteData ->if (remoteData.isNotEmpty()) { charactersDao.upsert(remoteData.toCache()) } remoteData.toDomain() } ?:emptyList()private fun getCharactersCache():Flow<List<CharacterCache>> = charactersDao.getCharacters()override suspend fun getCharacter(characterId:Long):Flow<CharacterEntity> = charactersDao.getCharacterFlow(id = characterId).map { it.toDomain() }override suspend fun toggleCharacterBookmarkStatus(characterId:Long):Boolean { val status = charactersDao.getCharacter(characterId)?.bookmarkStatus?.not() ?:falsereturn charactersDao.toggleCharacterBookmarkStatus(id = characterId, status = status) >0 }override suspend fun getComics(dataSource:DataSource,characterId:Long ):Flow<List<ComicsEntity>> = flow {emitAll(when (dataSource) { is DataSource.Cache -> getComicsCache(characterId= characterId).map { list ->if (list.isEmpty()) {getComicsNetwork(characterId = characterId) } else { list.toDomain() } } is DataSource.Network -> flowOf(getComicsNetwork(characterId= characterId)) .flowOn(ioDispatcher) } ) }private suspend fun getComicsNetwork(characterId:Long):List<ComicsEntity> = marvelRemoteService.getComics(characterId = characterId) .body()?.data?.results?.let { remoteData ->if (remoteData.isNotEmpty()) { comicsDao.upsert(remoteData.toCache(characterId = characterId)) } remoteData.toDomain() } ?:emptyList()private fun getComicsCache(characterId:Long):Flow<List<ComicsCache>> = comicsDao.getComics(characterId = characterId)}
Since we defined the data source to manage persistence, in this layer, we also need to determine the database for which we are using the room database. In addition, it’s good practice to create some mappers to map the API response to the corresponding database entity.
fun List<Characters>.toCache() = map { character -> character.toCache() }fun Characters.toCache() =CharacterCache( id = id ?:0, name = name ?:"", description = description ?:"", imageUrl = thumbnail?.let {"${it.path}.${it.extension}" } ?: "")fun List<Characters>.toDomain() = map { character -> character.toDomain() }fun Characters.toDomain() =CharacterEntity( id = id ?:0, name = name ?:"", description = description ?:"", imageUrl = thumbnail?.let {"${it.path}.${it.extension}" } ?: "", bookmarkStatus =false)
In this layer, we need a UI component like fragments, activity, or composable to display the list of characters; here, we can use the widely used MVVM approach. The view model takes the use cases in its constructors and invokes the corresponding use case according to user actions (get a character, characters & comics, etc.).
Each use case will invoke the appropriate method in the repository.
classCharactersListViewModel(privatevalgetCharacters: GetCharactersUseCase,privatevaltoggleCharacterBookmarkStatus: ToggleCharacterBookmarkStatus) : ViewModel() {private val _characters=MutableStateFlow<UiState<List<CharacterViewState>>>(UiState.Loading()) val characters:StateFlow<UiState<List<CharacterViewState>>> = _characters init { _characters.value = UiState.Loading()getAllCharacters() }private fun getAllCharacters(forceRefresh:Boolean=false) {getCharacters(forceRefresh) .catch { error -> error.printStackTrace()when (error) { is UnknownHostException, is ConnectException, is SocketTimeoutException -> _characters.value = UiState.NoInternetError(error)else-> _characters.value = UiState.ApiError(error) } }.map { list -> _characters.value = UiState.Loaded(list.toViewState()) }.launchIn(viewModelScope) } fun refresh(showLoader:Boolean=false) {if (showLoader) { _characters.value = UiState.Loading() }getAllCharacters(forceRefresh =true) } fun bookmarkCharacter(characterId:Long) { viewModelScope.launch {toggleCharacterBookmarkStatus(characterId = characterId) } }}
As you can see, each layer communicates only with the closest one, keeping inner layers independent from lower layers, this way, we can quickly test each module separately, and the separation of concerns will help developers to collaborate on the different modules of the project.