Category: Engineering blogs

  • An Introduction to Stream Processing & Analytics

    What is Stream Processing and Analytics?

    Stream processing is a technology used to process large amounts of data in real-time as it is generated rather than storing it and processing it later.

    Think of it like a conveyor belt in a factory. The conveyor belt constantly moves, bringing in new products that need to be processed. Similarly, stream processing deals with data that is constantly flowing, like a stream of water. Just like the factory worker needs to process each product as it moves along the conveyor belt, stream processing technology processes each piece of data as it arrives.

    Stateful and stateless processing are two different approaches to stream processing, and the right choice depends on the specific requirements and needs of the application. 

    Stateful processing is useful in scenarios where the processing of an event or data point depends on the state of previous events or data points. For example, it can be used to maintain a running total or average across multiple events or data points.

    Stateless processing, on the other hand, is useful in scenarios where the processing of an event or data point does not depend on the state of previous events or data points. For example, in a simple data transformation application, stateless processing can be used to transform each event or data point independently without the need to maintain state.

    Streaming analytics refers to the process of analyzing and processing data in real time as it is generated. Streaming analytics enable applications to react to events and make decisions in near real time.

    Why Stream Processing and Analytics?

    Stream processing is important because it allows organizations to make real-time decisions based on the data they are receiving. This is particularly useful in situations where timely information is critical, such as in financial transactions, network security, and real-time monitoring of industrial processes.

    For example, in financial trading, stream processing can be used to analyze stock market data in real time and make split-second decisions to buy or sell stocks. In network security, it can be used to detect and respond to cyber-attacks in real time. And in industrial processes, it can be used to monitor production line efficiency and quickly identify and resolve any issues.

    Stream processing is also important because it can process massive amounts of data, making it ideal for big data applications. With the growth of the Internet of Things (IoT), the amount of data being generated is growing rapidly, and stream processing provides a way to process this data in real time and derive valuable insights.

    Collectively, stream processing provides organizations with the ability to make real-time decisions based on the data they are receiving, allowing them to respond quickly to changing conditions and improve their operations.

    How is it different from Batch Processing?

    Batch Data Processing:

    Batch Data Processing is a method of processing where a group of transactions or data is collected over a period of time and is then processed all at once in a “batch”. The process begins with the extraction of data from its sources, such as IoT devices or web/application logs. This data is then transformed and integrated into a data warehouse. The process is generally called the Extract, Transform, Load (ETL) process. The data warehouse is then used as the foundation for an analytical layer, which is where the data is analyzed, and insights are generated.

    Stream/Real-time Data Processing:

    Real-Time Data Streaming involves the continuous flow of data that is generated in real-time, typically from multiple sources such as IoT devices or web/application logs. A message broker is used to manage the flow of data between the stream processors, the analytical layer, and the data sink. The message broker ensures that the data is delivered in the correct order and that it is not lost. Stream processors used to perform data ingestion and processing. These processors take in the data streams and process them in real time. The processed data is then sent to an analytical layer, where it is analyzed, and insights are generated. 

    Processes involved in Stream processing and Analytics:

    The process of stream processing can be broken down into the following steps:

    • Data Collection: The first step in stream processing is collecting data from various sources, such as sensors, social media, and transactional systems. The data is then fed into a stream processing system in real time.
    • Data Ingestion: Once the data is collected, it needs to be ingested or taken into the stream processing system. This involves converting the data into a standard format that can be processed by the system.
    • Data Processing: The next step is to process the data as it arrives. This involves applying various processing algorithms and rules to the data, such as filtering, aggregating, and transforming the data. The processing algorithms can be applied to individual events in the stream or to the entire stream of data.
    • Data Storage: After the data has been processed, it is stored in a database or data warehouse for later analysis. The storage can be configured to retain the data for a specific amount of time or to retain all the data.
    • Data Analysis: The final step is to analyze the processed data and derive insights from it. This can be done using data visualization tools or by running reports and queries on the stored data. The insights can be used to make informed decisions or to trigger actions, such as sending notifications or triggering alerts.

    It’s important to note that stream processing is an ongoing process, with data constantly being collected, processed, and analyzed in real time. The visual representation of this process can be represented as a continuous cycle of data flowing through the system, being processed and analyzed at each step along the way.

    Stream Processing Platforms & Frameworks:

    Stream Processing Platforms & Tools are software systems that enable the collection, processing, and analysis of real-time data streams.

    Stream Processing Frameworks:

    A stream processing framework is a software library or framework that provides a set of tools and APIs for developers to build custom stream processing applications. Frameworks typically require more development effort and configuration to set up and use. They provide more flexibility and control over the stream processing pipeline but also require more development and maintenance resources. 

    Examples: Apache Spark Streaming, Apache Flink, Apache Beam, Apache Storm, Apache Samza

    Let’s first look into the most commonly used stream processing frameworks: Apache Flink & Apache Spark Streaming.

    Apache Flink : 

    Flink is an open-source, unified stream-processing and batch-processing framework. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner, making it ideal for processing huge amounts of data in real-time.

    • Flink provides out-of-the-box checkpointing and state management, two features that make it easy to manage enormous amounts of data with relative ease.
    • The event processing function, the filter function, and the mapping function are other features that make handling a large amount of data easy.

    Flink also comes with real-time indicators and alerts which make abig difference when it comes to data processing and analysis.

    Note: We have discussed the stream processing and analytics in detail in Stream Processing and Analytics with Apache Flink

    Apache Spark Streaming : 

    Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads. Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data.

    • Great for solving complicated transformative logic
    • Easy to program
    • Runs at blazing speeds
    • Processes large data within a fraction of second

    Stream Processing Platforms:

    A stream processing platform is an end-to-end solution for processing real-time data streams. Platforms typically require less development effort and maintenance as they provide pre-built tools and functionality for processing, analyzing, and visualizing data. 

    Examples: Apache Kafka, Amazon Kinesis, Google Cloud Pub-Sub

    Let’s look into the most commonly used stream processing platforms: Apache Kafka & AWS Kinesis.

    Apache Kafka: 

    Apache Kafka is an open-source distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

    • Because it’s an open-source, “Kafka generally requires a higher skill set to operate and manage, so it’s typically used for development and testing.
    • APIs allow “producers” to publish data streams to “topics;” a “topic” is a partitioned log of records; a “partition” is ordered and immutable; “consumers” subscribe to “topics.”
    •  It can run on a cluster of “brokers” with partitions split across cluster nodes. 
    • Messages can be effectively unlimited in size (2GB). 

    AWS Kinesis:

    Amazon Kinesis is a cloud-based service on Amazon Web Services (AWS) that allows you to ingest real-time data such as application logs, website clickstreams, and IoT telemetry data for machine learning and analytics, as well as video and audio. 

    • Amazon Kinesis is a SaaS offering, reducing the complexities in the design, build, and manage stages compared to open-source Apache Kafka. It’s ideally suited for building microservices architectures. 
    • “Producers” can push data as soon as it is put on the stream.  Kinesis breaks the stream across “shards” (which are like partitions). 
    • Shards have a hard limit on the number of transactions and data volume per second. If you need more volume, you must subscribe to more shards. You pay for what you use.
    •  Most maintenance and configurations are hidden from the user. Scaling is easy (adding shards) compared to Kafka. 
    • Maximum message size is 1MB.

    Three Characteristics of Event Stream processing Platform:

    Publish and Subscribe:

    In a publish-subscribe model, producers publish events or messages to streams or topics, and consumers subscribe to streams or topics to receive the events or messages. This is similar to a message queue or enterprise messaging system. It allows for the decoupling of the producer and consumer, enabling them to operate independently and asynchronously. 

    Store streams of events in a fault-tolerant way

    This means that the platform is able to store and manage events in a reliable and resilient manner, even in the face of failures or errors. To achieve fault tolerance, event streaming platforms typically use a variety of techniques, such as replicating data across multiple nodes, and implementing data recovery and failover mechanisms.

    Process streams of events in real-time, as they occur

    This means that the platform can process and analyze data as it is generated rather than waiting for data to be batch-processed or stored for later processing.

    Challenges when designing the stream processing and analytics solution:

    Stream processing is a powerful technology, but there are also several challenges associated with it, including:

    • Late arriving data: Data that is delayed or arrives out of order can disrupt the processing pipeline and lead to inaccurate results. Stream processing systems need to be able to handle out-of-order data and reconcile it with the data that has already been processed.
    • Missing data: If data is missing or lost, it can impact the accuracy of the processing results. Stream processing systems need to be able to identify missing data and handle it appropriately, whether by skipping it, buffering it, or alerting a human operator.
    • Duplicate data: Duplicate data can lead to over-counting and skewed results. Stream processing systems need to be able to identify and de-duplicate data to ensure accurate results.
    • Data skew: data skew occurs when there is a disproportionate amount of data for certain key fields or time periods. This can lead to performance issues, processing delays, and inaccurate results. Stream processing systems need to be able to handle data skew by load balancing and scaling resources appropriately.
    • Fault tolerance: Stream processing systems need to be able to handle hardware and software failures without disrupting the processing pipeline. This requires fault-tolerant design, redundancy, and failover mechanisms.
    • Data security and privacy: Real-time data processing often involves sensitive data, such as personal information, financial data, or intellectual property. Stream processing systems need to ensure that data is securely transmitted, stored, and processed in compliance with regulatory requirements.
    • Latency: Another challenge with stream processing is latency or the amount of time it takes for data to be processed and analyzed. In many applications, the results of the analysis need to be produced in real-time, which puts pressure on the stream processing system to process the data quickly.
    • Scalability: Stream processing systems must be able to scale to handle large amounts of data as the amount of data being generated continues to grow. This can be a challenge because the systems must be designed to handle data in real-time while also ensuring that the results of the analysis are accurate and reliable.
    • Maintenance: Maintaining a stream processing system can also be challenging, as the systems are complex and require specialized knowledge to operate effectively. In addition, the systems must be able to evolve and adapt to changing requirements over time.

    Despite these challenges, stream processing remains an important technology for organizations that need to process data in real time and make informed decisions based on that data. By understanding these challenges and designing the systems to overcome them, organizations can realize the full potential of stream processing and improve their operations.

    Key benefits of stream processing and analytics:

    • Real-time processing keeps you in sync all the time:

    For Example: Suppose an online retailer uses a distributed system to process orders. The system might include multiple components, such as a web server, a database server, and an application server. The different components could be kept in sync by real-time processing by processing orders as they are received and updating the database accordingly. As a result, orders would be accurate and processed efficiently by maintaining a consistent view of the data.

    • Real-time data processing is More Accurate and timely:

    For Example a financial trading system that processes data in real-time can help to ensure that trades are executed at the best possible prices, improving the accuracy and timeliness of the trades. 

    • Deadlines are met with Real-time processing:

    For example: In a control system, it may be necessary to respond to changing conditions within a certain time frame in order to maintain the stability of the system. 

    • Real-time processing is quite reactive:

    For example, a real-time processing system might be used to monitor a manufacturing process and trigger an alert if it detects a problem or to analyze sensor data from a power plant and adjust the plant’s operation in response to changing conditions.

    • Real-time processing involves multitasking:

    For example, consider a real-time monitoring system that is used to track the performance of a large manufacturing plant. The system might receive data from multiple sensors and sources, including machine sensors, temperature sensors, and video cameras. In this case, the system would need to be able to multitask in order to process and analyze data from all of these sources in real time and to trigger alerts or take other actions as needed. 

    • Real-time processing works independently:

    For example, a real-time processing system may rely on a database or message queue to store and retrieve data, or it may rely on external APIs or services to access additional data or functionality.

    Use case studies:

    There are many real-life examples of stream processing in different industries that demonstrate the benefits of this technology. Here are a few examples:

    • Financial Trading: In the financial industry, stream processing is used to analyze stock market data in real time and make split-second decisions to buy or sell stocks. This allows traders to respond to market conditions in real time and improve their chances of making a profit.
    • Network Security: Stream processing is also used in network security to detect and respond to cyber-attacks in real-time. By processing network data in real time, security systems can quickly identify and respond to threats, reducing the risk of a data breach.
    • Industrial Monitoring: In the industrial sector, stream processing is used to monitor production line efficiency and quickly identify and resolve any issues. For example, it can be used to monitor the performance of machinery and identify any potential problems before they cause a production shutdown.
    • Social Media Analysis: Stream processing is also used to analyze social media data in real time. This allows organizations to monitor brand reputation, track customer sentiment, and respond to customer complaints in real time.
    • Healthcare: In the healthcare industry, stream processing is used to monitor patient data in real time and quickly identify any potential health issues. For example, it can be used to monitor vital signs and alert healthcare providers if a patient’s condition worsens.

    These are just a few examples of the many real-life applications of stream processing. Across all industries, stream processing provides organizations with the ability to process data in real time and make informed decisions based on the data they are receiving.

    How to start stream analytics?

    • Our recommendation in building a dedicated platform is to keep the focus on choosing a diverse stream processor to pair with your existing analytical interface. 
    • Or, keep an eye on vendors who offer both stream processing and BI as a service.

    Resources:

    Here are some useful resources for learning more about stream processing:

    Videos:

    Tutorials:

    Articles:

    These resources will provide a good starting point for learning more about stream processing and how it can be used to solve real-world problems. 

    Conclusion:

    Real-time data analysis and decision-making require stream processing and analytics in diverse industries, including finance, healthcare, and e-commerce. Organizations can improve operational efficiency, customer satisfaction, and revenue growth by processing data in real time. A robust infrastructure, skilled personnel, and efficient algorithms are required for stream processing and analytics. Businesses need stream processing and analytics to stay competitive and agile in today’s fast-paced world as data volumes and complexity continue to increase.

  • Machine Learning in Flutter using TensorFlow

    Machine learning has become part of day-to-day life. Small tasks like searching songs on YouTube and suggestions on Amazon are using ML in the background. This is a well-developed field of technology with immense possibilities. But how we can use it?

    This blog is aimed at explaining how easy it is to use machine learning models (which will act as a brain) to build powerful ML-based Flutter applications. We will briefly touch base on the following points

    1. Definitions

    Let’s jump to the part where most people are confused. A person who is not exposed to the IT industry might think AI, ML, & DL are all the same. So, let’s understand the difference.  

    Figure 01

    1.1. Artificial Intelligence (AI): 

    AI, i.e. artificial intelligence, is a concept of machines being able to carry out tasks in a smarter way. You all must have used YouTube. In the search bar, you can type the lyrics of any song, even lyrics that are not necessarily the starting part of the song or title of songs, and get almost perfect results every time. This is the work of a very powerful AI.
    Artificial intelligence is the ability of a machine to do tasks that are usually done by humans. This ability is special because the task we are talking about requires human intelligence and discernment.

    1.2. Machine Learning (ML):

    Machine learning is a subset of artificial intelligence. It is based on the idea that we expose machines to new data, which can be a complete or partial row, and let the machine decide the future output. We can also say it is a sub-field of AI that deals with the extraction of patterns from data sets. With a new data set and processing, the last result machine will slowly reach the expected result. This means that the machine can find rules for optical behavior to get new output. It also can adapt itself to new changing data just like humans.

    1.3. Deep Learning (ML): 

    Deep learning is again a smaller subset of machine learning, which is essentially a neural network with multiple layers. These neural networks attempt to simulate the behavior of the human brain, so you can say we are trying to create an artificial human brain. With one layer of a neural network, we can still make approximate predictions, and additional layers can help to optimize and refine for accuracy.

    2. Types of ML

    Before starting the implementation, we need to know the types of machine learning because it is very important to know which type is more suitable for our expected functionality.

    Figure 02

    2.1. Supervised Learning

    As the name suggests, in supervised learning, the learning happens under supervision. Supervision means the data that is provided to the machine is already classified data i.e., each piece of data has fixed labels, and inputs are already mapped to the output.
    Once the machine is learned, it is ready for the classification of new data.
    This learning is useful for tasks like fraud detection, spam filtering, etc.

    2.2. Unsupervised Learning

    In unsupervised learning, the data given to machines to learn is purely raw, with no tags or labels. Here, the machine is the one that will create new classes by extracting patterns.
    This learning can be used for clustering, association, etc.

    2.3. Semi-Supervised Learning

    Both supervised and unsupervised have their own limitations, because one requires labeled data, and the other does not, so this learning combines the behavior of both learnings, and with that, we can overcome the limitations.
    In this learning, we feed row data and categorized data to the machine so it can classify the row data, and if necessary, create new clusters.

    2.4. : Reinforcement Learning

    For this learning, we feed the last output’s feedback with new incoming data to machines so they can learn from their mistakes. This feedback-based process will continue until the machine reaches the perfect output. This feedback is given by humans in the form of punishment or reward. This is like when a search algorithm gives you a list of results, but users do not click on other than the first result. It is like a human child who is learning from every available option and by correcting its mistakes, it grows.

    3. TensorFlow

    Machine learning is a complex process where we need to perform multiple activities like processing of acquiring data, training models, serving predictions, and refining future results.

    To perform such operations, Google developed a framework in November 2015 called TensorFlow. All the above-mentioned processes can become easy if we use the TensorFlow framework. 

    For this project, we are not going to use a complete TensorFlow framework but a small tool called TensorFlow Lite

    3.1. TensorFlow Lite

    TensorFlow Lite allows us to run the machine learning models on devices with limited resources, like limited RAM or memory.

    3.2. TensorFlow Lite Features

    • Optimized for on-device ML by addressing five key constraints: 
    • Latency: because there’s no round-trip to a server 
    • Privacy: because no personal data leaves the device 
    • Connectivity: because internet connectivity is not required 
    • Size: because of a reduced model and binary size
    • Power consumption: because of efficient inference and a lack of network connections
    • Support for Android and iOS devices, embedded Linux, and microcontrollers
    • Support for Java, Swift, Objective-C, C++, and Python programming languages
    • High performance, with hardware acceleration and model optimization
    • End-to-end examples for common machine learning tasks such as image classification, object detection, pose estimation, question answering, text classification, etc., on multiple platforms

    4. What is Flutter?

    Flutter is an open source, cross-platform development framework. With the help of Flutter by using a single code base, we can create applications for Android, iOS, web, as well as desktop. It was created by Google and uses Dart as a development language. The first stable version of Flutter was released in Apr 2018, and since then, there have been many improvements. 

    5. Building an ML-Flutter Application

    We are now going to build a Flutter application through which we can find the state of mind of a person from their facial expressions. The below steps explain the update we need to do for an Android-native application. For an iOS application, please refer to the links provided in the steps.

    5.1. TensorFlow Lite – Native setup (Android)

    • In android/app/build.gradle, add the following setting in the android block:
    aaptOptions {
            noCompress 'tflite'
            noCompress 'lite'
        }

    5.2. TensorFlow Lite – Flutter setup (Dart)

    • Create an assets folder and place your label file and model file in it. (These files we will create shortly.) In pubspec.yaml add:
    assets:
       - assets/labels.txt
       - assets/<file_name>.tflite

     

    Figure 02

    • Run this command (Install TensorFlow Light package): 
    $ flutter pub add tflite

    • Add the following line to your package’s pubspec.yaml (and run an implicit flutter pub get):
    dependencies:
         tflite: ^0.9.0

    • Now in your Dart code, you can use:
    import 'package:tflite/tflite.dart';

    • Add camera dependencies to your package’s pubspec.yaml (optional):
    dependencies:
         camera: ^0.10.0+1

    • Now in your Dart code, you can use:
    import 'package:camera/camera.dart';

    • As the camera is a hardware feature, in the native code, there are few updates we need to do for both Android & iOS.  To learn more, visit:
    https://pub.dev/packages/camera
    • Following is the code that will appear under dependencies in pubspec.yaml once the the setup is complete.
    Figure 03
    • Flutter will automatically download the most recent version if you ignore the version number of packages.
    • Do not forget to add the assets folder in the root directory.

    5.3. Generate model (using website)

    • Click on Get Started

    • Select Image project
    • There are three different categories of ML projects available. We’ll choose an image project since we’re going to develop a project that analyzes a person’s facial expression to determine their emotional condition.
    • The other two types, audio project and pose project, will be useful for creating projects that involve audio operation and human pose indication, respectively.

    • Select Standard Image model
    • Once more, there are two distinct groups of image machine learning projects. Since we are creating a project for an Android smartphone, we will select a standard picture project.
    • The other type, an Embedded Image Model project, is designed for hardware with relatively little memory and computing power.

    • Upload images for training the classes
    • We will create new classes by clicking on “Add a class.”
    • We must upload photographs to these classes as we are developing a project that analyzes a person’s emotional state from their facial expression.
    • The more photographs we upload, the more precise our result will be.
    • Click on train model and wait till training is over
    • Click on Export model
    • Select TensorFlow Lite Tab -> Quantized  button -> Download my model

    5.4. Add files/models to the Flutter project

    • Labels.txt

    File contains all the class names which you created during model creation.

     

    • *.tflite

    File contains the original model file as well as associated files a ZIP.

    5.5. Load & Run ML-Model

    • We are importing the model from assets, so this line of code is crucial. This model will serve as the project’s brain.
    • Here, we’re configuring the camera using a camera controller and obtaining a live feed (Cameras[0] is the front camera).

    6. Conclusion

    We can achieve good performance of a Flutter app with an appropriate architecture, as discussed in this blog.

  • Demystifying UI Frameworks and Theming for React Apps

    Introduction:

    In this blog, we will be talking about design systems, diving into the different types of CSS frameworks/libraries, then looking into issues that come with choosing a framework that is not right for your type of project. Then we will be going over different business use cases where these different frameworks/libraries match their requirements.

    Let’s paint a scenario: when starting a project, you start by choosing a JS framework. Let’s say, for example, that you went with a popular framework like React. Depending on whether you want an isomorphic app, you will look at Next.js. Next, we choose a UI framework, and that’s when our next set of problems appears.

    WHICH ONE?

    It’s hard to go with even the popular ones because it might not be what you are looking for. There are different types of libraries handling different types of use cases, and there are so many similar ones that each handle stuff slightly differently.

    These frameworks come and go, so it’s essential to understand the fundamentals of CSS. These libraries and frameworks help you build faster; they don’t change how CSS works.

    But, continuing with our scenario, let’s say we choose a popular library like Bootstrap, or Material. Then, later on, as you’re working through the project, you notice issues like:

    – Overriding default classes more than required 

    – End up with ugly-looking code that’s hard to read

    – Bloated CSS that reduces performance (flash of unstyled content issues, reduced CLS, FCP score)

    – Swift changing designs, but you’re stuck with rigid frameworks, so migrating is hard and requires a lot more effort

    – Require swift development but end up building from scratch

    – Ending up with a div soup with no semantic meaning

    To solve these problems and understand how these frameworks work, we have segregated them into the following category types. 

    We will dig into each category and look at how they work, their pros/cons and their business use case.

    Categorizing the available frameworks:

    Vanilla Libraries 

    These libraries allow you to write vanilla CSS with some added benefits like vendor prefixing, component-level scoping, etc.  You can use this as a building block to create your own styling methodology. Essentially, it’s mainly CSS in JS-type libraries that come in this type of category. CSS modules would also come under these as well since you are writing CSS in a module file.

    Also, inline styles in React seem to resemble a css-in-js type method, but they are different. For inline styles, you would lose out on media queries, keyframe animations, and selectors like pseudo-class, pseudo-element, and attribute selectors. But css-in-js type libraries have these abilities.  

    It also differs in how the out the CSS; inline styling would result in inline CSS in the HTML for that element, whereas css-in-js outputs as internal styles with class names.

    Nowadays, these css-in-js types are popular for their optimized critical render path strategy for performance.

    Example:

    Emotion

    import styled from @emotion/styled';
    const Button = styled.button`
        padding: 32px;
        background-color: hotpink;
        font-size: 24px;
        border-radius: 4px;
        color: black;
        font-weight: bold;
        &:hover {
            color: white;
        }
    `
    render(<Button>This my button component.</Button>)

    Styled Components

    const Button = styled.a`
    /* This renders the buttons above... Edit me! */
    display: inline-block;
    border-radius: 3px;
    padding: 0.5rem 0;
    margin: 0.5rem 1rem;
    width: 11rem;
    background: transparent;
    color: white;
    border: 2px solid white;
    /* The GitHub button is a primary button
    * edit this to target it specifically! */
    ${propsprops. primary && css`
    background: white;
    color: black;`} 

    List of example frameworks: 

       – Styled components

       – Emotion

       – Vanilla-extract

       – Stitches

       – CSS modules
    (CSS modules is not an official spec or an implementation in the browser, but rather, it’s a process in a build step (with the help of Webpack or Browserify) that changes class names and selectors to be scoped.)

    Pros:

    • Fully customizable—you can build on top of it
    • Doesn’t bloat CSS, only loads needed CSS
    • Performance
    • Little to no style collision

    Cons:

    • Requires effort and time to make components from scratch
    • Danger of writing smelly code
    • Have to handle accessibility on your own

    Where would you use these?

    • A website with an unconventional design that must be built from scratch.
    • Where performance and high webvital scores are required—the performance, in this case, refers to an optimized critical render path strategy that affects FCP and CLS.
    • Generally, it would be user-facing applications like B2C.

    Unstyled / Functional Libraries

    Before coming to the library, we would like to cover a bit on accessibility.

    Apart from a website’s visual stuff, there is also a functional aspect, accessibility.

    And many times, when we say accessibility in the context of web development, people automatically think of screen readers. But it doesn’t just mean website accessibility to people with a disability; it also means enabling as many people as possible to use your websites, even people with or without disabilities or people who are limited. 

    Different age groups

    Font size settings on phones and browser settings should be reflected on the app

    Situational limitation

    Dark mode and light mode

    Different devices

    Mobile, desktop, tablet

    Different screen sizes

    Ultra wide 21:9, normal monitor screen size 16:9 

    Interaction method

    Websites can be accessible with keyboard only, mouse, touch, etc.

    But these libraries mostly handle accessibility for the disabled, then interaction methods and focus management. The rest is left to developers, which includes settings that are more visual in nature, like screen sizes, light/dark mode etc.

    In general, ARIA attributes and roles are used to provide information about the interaction of a complex widget. The libraries here sprinkle this information onto their components before giving them to be styled.

    So, in short, these are low-level UI libraries that handle the functional part of the UI elements, like accessibility, keyboard navigation, or how they work. They come with little-to-no styling, which is meant to be overridden.

    Radix UI

    // Compose a Dialog with custom focus management
    export const InfoDialog = ({ children }) => {
        const dialogCloseButton = React.useRef(null);
        return (
            <Dialog.Root>
                <Dialog.Trigger>View details</Dialog.Trigger>
                <Dialog.Overlay />
                <Dialog.Portal>
                    <DialogContent
                        onOpenAutoFocus={(event) => {
                        // Focus the close button when dialog opens
                            dialogCloseButton.current?.focus();
                            event.preventDefault();
                        }}>
                        {children}
                        <Dialog.Close ref={dialogCloseButton}>
                            Close
                        </Dialog.Close>
                    </DialogContent>
                </Dialog.Portal>
            </Dialog.Root>
        )
    } 

    React Aria

    import React from "react";
    function Breadcrumbs (props) {
        let { navProps } = useBreadcrumbs(props);
        let children = React. Children.toArray (props.children);
        return (
            <nav {...navProps}>
                <ol style={{ display: 'flex', listStyle: 'none', margin: 0}}>
                    {children.map((child, i) => React.cloneElement(child, { isCurrent: i === children.le})
                    )}
                </ol>
            </nav>
        )
    }

    List of the frameworks:

    • Radix UI
    • Reach UI
    • React Aria, React Stately (by Adobe)
    • Headless-UI

    Pros:

    • Gives perfect accessibility and functionality
    • Gives the flexibility to create composable elements
    • Unopinionated styling, free to override

    Cons:

    • Can’t be used for a rapid development project or prototyping
    • Have to understand the docs thoroughly to continue development at a normal pace

    Where would you use these?

    • Websites like news or articles won’t require this.
    • Applications where accessibility is more important than styling and design (Government websites, banking, or even internal company apps).
    • Applications where importance is given to both accessibility and design, so customizability to these components is preferred (Teamflow, CodeSandbox, Vercel).
    • Can be paired with Vanilla libraries to provide performance with accessibility.
    • Can be paired with utility-style libraries to provide relatively faster development with accessibility.

    Utility Styled Library / Framework

    These types of libraries allow you to style your elements through their interfaces, either through class names or component props using composable individual CSS properties as per your requirements. The strongest point you have with such libraries is the flexibility of writing custom CSS properties. With these libraries, you would often require a “wrapper” class or components to be able to reuse them. 

    These libraries dump these utility classes into your HTML, impacting your performance. Though there is still an option to improve the performance by purging the unused CSS from your project in a build step, even with that, the performance won’t be as good as css-in-js. The purging would look at the class names throughout the whole project and remove them if there is no reference. So, when loading a page, it would still load CSS that is not being used on the current page but another one.

    Tailwind

    const people = [
      {
        name: 'Calvin Hawkins',
        email: 'calvin.hawkins@example.com',
        image:
          'https://images.unsplash.com/photo-1491528323818-fdd1faba62cc?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=facearea&facepad=2&w=256&h=256&q=80',
      },
      {
        name: 'Kristen Ramos',
        email: 'kristen.ramos@example.com',
        image:
          'https://images.unsplash.com/photo-1550525811-e5869dd03032?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=facearea&facepad=2&w=256&h=256&q=80',
      },
      {
        name: 'Ted Fox',
        email: 'ted.fox@example.com',
        image:
          'https://images.unsplash.com/photo-1500648767791-00dcc994a43e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=facearea&facepad=2&w=256&h=256&q=80',
      },
    ]
    
    export default function Example() {
      return (
        <ul className="divide-y divide-gray-200">
          {people.map((person) => (
            <li key={person.email} className="py-4 flex">
              <img className="h-10 w-10 rounded-full" src={person.image} alt="" />
              <div className="ml-3">
                <p className="text-sm font-medium text-gray-900">{person.name}</p>
                <p className="text-sm text-gray-500">{person.email}</p>
              </div>
            </li>
          ))}
        </ul>
      )
    }

    Chakra UI

    import { MdStar } from "react-icons/md";
    
    export default function Example() {
      return (
        <Center h="100vh">
          <Box p="5" maxW="320px" borderWidth="1px">
            <Image borderRadius="md" src="https://bit.ly/2k1H1t6" />
            <Flex align="baseline" mt={2}>
              <Badge colorScheme="pink">Plus</Badge>
              <Text
                ml={2}
                textTransform="uppercase"
                fontSize="sm"
                fontWeight="bold"
                color="pink.800"
              >
                Verified • Cape Town
              </Text>
            </Flex>
            <Text mt={2} fontSize="xl" fontWeight="semibold" lineHeight="short">
              Modern, Chic Penthouse with Mountain, City & Sea Views
            </Text>
            <Text mt={2}>$119/night</Text>
            <Flex mt={2} align="center">
              <Box as={MdStar} color="orange.400" />
              <Text ml={1} fontSize="sm">
                <b>4.84</b> (190)
              </Text>
            </Flex>
          </Box>
        </Center>
      );
    }

    List of the frameworks

    • Tailwind
    • Chakra UI (although it has some prebuilt components, its concept is driven from Tailwind)
    • Tachyons
    • xStyled

    Pros:

    • Rapid development and prototyping
    • Gives flexibility to styling
    • Enforces a little consistency; you don’t have to use magic numbers while creating the layout (spacing values, responsive variables like xs, sm, etc.)
    • Less context switching—you’ll write CSS in your HTML elements

    Cons:

    • Endup with ugly-looking/hard-to-read code
    • Lack of importance to components, you would have to handle accessibility yourself
    • Creates a global stylesheet that would have unused classes

    Where would you use these?

    • Easier composition of simpler components to build large applications.
    • Modular applications where rapid customization is required, like font sizes, color pallets, themes, etc.
    • FinTech or healthcare applications where you need features like theme-based toggling in light/dark mode to be already present.
    • Application where responsive design is supported out of the box, along with ease of accessibility and custom breakpoints for responsiveness.

    Pre-styled / All-In-One Framework 

    These are popular frameworks that come with pre-styled, ready-to-use components out of the box with little customization.

    These are heavy libraries that have fixed styling that can be overridden. However, generally speaking, overriding the classes would just load in extra CSS, which just clogs up the performance. These kinds of libraries are generally more useful for rapid prototyping and not in places with heavy customization and priority on performance.

    These are quite beginner friendly as well, but if you are a beginner, it is best to understand the basics and fundamentals of CSS rather than fully relying on frameworks like these as your crutches. Although, these frameworks have their pros with speed of development.

    Material UI

    <Box
            component="form"
            className="lgn-form-content"
            id="loginForm"
            onSubmit={formik.handleSubmit}
          >
            <Input
              id="activationCode"
              placeholder="Enter 6 Digit Auth Code"
              className="lgn-form-input"
              type="text"
              onChange={formik.handleChange}
              value={formik.values.activationCode}
            />
    
            <Button
              sx={{ marginBottom: "24px", marginTop: "1rem" }}
              type="submit"
              className="lgn-form-submit"
              form="loginForm"
              onKeyUp={(e) =>
                keyUpHandler(e, formik.handleSubmit, formik.isSubmitting)
              }
            >
              <Typography className="lgn-form-submit-text">
                Activate & Sign In
              </Typography>
            </Button>
            {formik.errors.activationCode && formik.touched.activationCode ? (
              <Typography color="white">{formik.errors.activationCode}</Typography>
            ) : null}
    </Box>

    BootStrap

    <Accordion isExpanded={true} useArrow={true}>
       <AccordionLabel className="editor-accordion-label">RULES</AccordionLabel>
       <AccordionSection>
         <div className="editor-detail-panel editor-detail-panel-column">
           <div className="label">Define conditional by adding a rule</div>
           <div className="rule-actions"></div>
         </div>
       </AccordionSection>
    </Accordion>

    List of the framework:

    • Bootstrap
    • Semantic UI
    • Material UI
    • Bulma
    • Mantine 

    Pros: 

    • Faster development, saves time since everything comes out of the box.
    • Helps avoid cross-browser bugs
    • Helps follow best practices (accessibility)

    Cons:

    • Low customization
    • Have to become familiar with the framework and its nuisance
    • Bloated CSS, since its loading in everything from the framework on top of overridden styles

    Where would you use these?

    • Focus is not on nitty-gritty design but on development speed and functionality.
    • Enterprise apps where the UI structure of the application isn’t dynamic and doesn’t get altered a lot.
    • B2B apps mostly where the focus is on getting the functionality out fast—UX is mostly driven by ease of use of the functionality with a consistent UI design.
    • Applications where you want to focus more on cross-browser compatibility.

    Conclusion:

    This is not a hard and fast rule; there are still a bunch of parameters that aren’t covered in this blog, like developer preference, or legacy code that already uses a pre-existing framework. So, pick one that seems right for you, considering the parameters in and outside this blog and your judgment.

    To summarize a little on the pros and cons of the above categories, here is a TLDR diagram:

    Pictorial Representation of the Summary

  • Cube – An Innovative Framework to Build Embedded Analytics

    Historically, embedded analytics was thought of as an integral part of a comprehensive business intelligence (BI) system. However, when we considered our particular needs, we soon realized something more innovative was necessary. That is when we came across Cube (formerly CubeJS), a powerful platform that could revolutionize how we think about embedded analytics solutions.

    This new way of modularizing analytics solutions means businesses can access the exact services and features they require at any given time without purchasing a comprehensive suite of analytics services, which can often be more expensive and complex than necessary.

    Furthermore, Cube makes it very easy to link up data sources and start to get to grips with analytics, which provides clear and tangible benefits for businesses. This new tool has the potential to be a real game changer in the world of embedded analytics, and we are very excited to explore its potential.

    Understanding Embedded Analytics

    When you read a word like “embedded analytics” or something similar, you probably think of an HTML embed tag or an iFrame tag. This is because analytics was considered a separate application and not part of the SaaS application, so the market had tools specifically for analytics.

    “Embedded analytics is a digital workplace capability where data analysis occurs within a user’s natural workflow, without the need to toggle to another application. Moreover, embedded analytics tends to be narrowly deployed around specific processes such as marketing campaign optimization, sales lead conversions, inventory demand planning, and financial budgeting.” – Gartner

    Embedded Analytics is not just about importing data into an iFrame—it’s all about creating an optimal user experience where the analytics feel like they are an integral part of the native application. To ensure that the user experience is as seamless as possible, great attention must be paid to how the analytics are integrated into the application. This can be done with careful thought to design and by anticipating user needs and ensuring that the analytics are intuitive and easy to use. This way, users can get the most out of their analytics experience.

    Existing Solutions

    With the rising need for SaaS applications and the number of SaaS applications being built daily, analytics must be part of the SaaS application.

    We have identified three different categories of exciting solutions available in the market.

    Traditional BI Platforms

    Many tools, such as GoodData, Tableau, Metabase, Looker, and Power BI, are part of the big and traditional BI platforms. Despite their wide range of features and capabilities, these platforms need more support with their Big Monolith Architecture, limited customization, and less-than-intuitive user interfaces, making them difficult and time-consuming.

    Here are a few reasons these are not suitable for us:

    • They lack customization, and their UI is not intuitive, so they won’t be able to match our UX needs.
    • They charge a hefty amount, which is unsuitable for startups or small-scale companies.
    • They have a big monolith architecture, making integrating with other solutions difficult.

    New Generation Tools

    The next experiment taking place in the market is the introduction of tools such as Hex, Observable, Streamlit, etc. These tools are better suited for embedded needs and customization, but they are designed for developers and data scientists. Although the go-to-market time is shorter, all these tools cannot integrate into SaaS applications.

    Here are a few reasons why these are not suitable for us:

    • They are not suitable for non-technical people and cannot integrate with Software-as-a-Service (SaaS) applications.
    • Since they are mainly built for developers and data scientists, they don’t provide a good user experience.
    • They are not capable of handling multiple data sources simultaneously.
    • They do not provide pre-aggregation and caching solutions.

    In House Tools

    Building everything in-house, instead of paying other platforms to build everything from scratch, is possible using API servers and GraphQL. However, there is a catch: the requirements for analytics are not straightforward, which will require a lot of expertise to build, causing a big hurdle in adaptation and resulting in a longer time-to-market.

    Here are a few reasons why these are not suitable for us:

    • Building everything in-house requires a lot of expertise and time, thus resulting in a longer time to market.
    • It requires developing a secure authentication and authorization system, which adds to the complexity.
    • It requires the development of a caching system to improve the performance of analytics.
    • It requires the development of a real-time system for dynamic dashboards.
    • It requires the development of complex SQL queries to query multiple data sources.

    Typical Analytics Features

    If you want to build analytics features, the typical requirements look like this:

    Multi-Tenancy

    When developing software-as-a-service (SaaS) applications, it is often necessary to incorporate multi-tenancy into the architecture. This means multiple users will be accessing the same software application, but with a unique and individualized experience. To guarantee that this experience is not compromised, it is essential to ensure that the same multi-tenancy principles are carried over into the analytics solution that you are integrating into your SaaS application. It is important to remember that this will require additional configuration and setup on your part to ensure that all of your users have access to the same level of tools and insights.

    Intuitive Charts

    If you look at some of the available analytics tools, they may have good charting features, but they may not be able to meet your specific UX needs. In today’s world, many advanced UI libraries and designs are available, which are often far more effective than the charting features of analytics tools. Integrating these solutions could help you create a more user-friendly experience tailored specifically to your business requirements.

    Security

    You want to have authentication and authorization for your analytics so that managers can get an overview of the analytics for their entire team, while individual users can only see their own analytics. Furthermore, you may want to grant users with certain roles access to certain analytics charts and other data to better understand how their team is performing. To ensure that your analytics are secure and that only the right people have access to the right information, it is vital to set up an authentication and authorization system.

    Caching

    Caching is an incredibly powerful tool for improving the performance and economics of serving your analytics. By implementing a good caching solution, you can see drastic improvements in the speed and efficiency of your analytics, while also providing an improved user experience. Additionally, the cost savings associated with this approach can be quite significant, providing you with a greater return on investment. Caching can be implemented in various ways, but the most effective approaches are tailored to the specific needs of your analytics. By leveraging the right caching solutions, you can maximize the benefits of your analytics and ensure that your users have an optimized experience.

    Real-time

    Nowadays, every successful SaaS company understands the importance of having dynamic and real-time dashboards; these dashboards provide users with the ability to access the latest data without requiring them to refresh the tab each and every time. By having real-time dashboards, companies can ensure their customers have access to the latest information, which can help them make more informed decisions. This is why it is becoming increasingly important for SaaS organizations to invest in robust, low-latency dashboard solutions that can deliver accurate, up-to-date data to their customers.

    Drilldowns

    Drilldown is an incredibly powerful analytics capability that enables users to rapidly transition from an aggregated, top-level overview of their data to a more granular, in-depth view. This can be achieved simply by clicking on a metric within a dashboard or report. With drill-down, users can gain a greater understanding of the data by uncovering deeper insights, allowing them to more effectively evaluate the data and gain a more accurate understanding of their data trends.

    Data Sources

    With the prevalence of software as a service (SaaS) applications, there could be a range of different data sources used, including PostgreSQL, DynamoDB, and other types of databases. As such, it is important for analytics solutions to be capable of accommodating multiple data sources at once to provide the most comprehensive insights. By leveraging the various sources of information, in conjunction with advanced analytics, businesses can gain a thorough understanding of their customers, as well as trends and behaviors. Additionally, accessing and combining data from multiple sources can allow for more precise predictions and recommendations, thereby optimizing the customer experience and improving overall performance.

    Budget

    Pricing is one of the most vital aspects to consider when selecting an analytics tool. There are various pricing models, such as AWS Quick-sight, which can be quite complex, or per-user basis costs, which can be very expensive for larger organizations. Additionally, there is custom pricing, which requires you to contact customer care to get the right pricing; this can be quite a difficult process and may cause a big barrier to adoption. Ultimately, it is important to understand the different pricing models available and how they may affect your budget before selecting an analytics tool.

    After examining all the requirements, we came across a solution like Cube, which is an innovative solution with the following features:

    • Open Source: Since it is open source, you can easily do a proof-of-concept (POC) and get good support, as any vulnerabilities will be fixed quickly.
    • Modular Architecture: It can provide good customizations, such as using Cube to use any custom charting library you prefer in your current framework.
    • Embedded Analytics-as-a-Code: You can easily replicate your analytics and version control it, as Cube is analytics in the form of code.
    • Cloud Deployments: It is a new-age tool, so it comes with good support with Docker or Kubernetes (K8s). Therefore, you can easily deploy it on the cloud.

    Cube Architecture

    Let’s look at the Cube architecture to understand why Cube is an innovative solution.

    • Cube supports multiple data sources simultaneously; your data may be stored in Postgres, Snowflake, and Redshift, and you can connect to all of them simultaneously. Additionally, they have a long list of data sources they can support.
    • Cube provides analytics over a REST API; very few analytics solutions provide chart data or metrics over REST APIs.
    • The security you might be using for your application can easily be mirrored for Cube. This helps simplify the security aspects, as you don’t need to maintain multiple tokens for the app and analytics tool.
    • Cube provides a unique way to model your data in JSON format; it’s more similar to an ORM. You don’t need to write complex SQL queries; once you model your data, Cube will generate the SQL to query the data source.
    • Cube has very good pre-aggregation and caching solutions.

    Cube Deep Dive

    Let’s look into different concepts that we just saw briefly in the architecture diagram.

    Data Modeling

    Cube

    A cube represents a table of data and is conceptually similar to a view in SQL. It’s like an ORM where you can define schema, extend it, or define abstract cubes to make use of code reusable. For example, if you have a Customer table, you need to write a Cube for it. Using Cubes, you can build analytical queries.

    Each cube contains definitions of measures, dimensions, segments, and joins between cubes. Cube bifurcates columns into measures and dimensions. Similar to tables, every cube can be referenced in another cube. Even though a cube is a table representation, you can choose which columns you want to expose for analytics. You can only add columns you want to expose to analytics; this will translate into SQL for the dimensions and measures you use in the SQL query (Push Down Mechanism).

    cube('Orders', {
      sql: `SELECT * FROM orders`,
    });

    Dimensions

    You can think about a dimension as an attribute related to a measure, for example, the measure userCount. This measure can have different dimensions, such as country, age, occupation, etc.

    Dimensions allow us to further subdivide and analyze the measure, providing a more detailed and comprehensive picture of the data.

    cube('Orders', {
    
      ...,
    
      dimensions: {
        status: {
          sql: `status`,
          type: `string`},
      },
    });

    Measures

    These parameters/SQL columns allow you to define the aggregations for numeric or quantitative data. Measures can be used to perform calculations such as sum, minimum, maximum, average, and count on any set of data.

    Measures also help you define filters if you want to add some conditions for a metric calculation. For example, you can set thresholds to filter out any data that is not within the range of values you are looking for.

    Additionally, measures can be used to create additional metrics, such as the ratio between two different measures or the percentage of a measure. With these powerful tools, you can effectively analyze and interpret your data to gain valuable insights.

    cube('Orders', {
    
      ...,
    
      measures: {
        count: {
          type: `count`,
        },
      },
    });

    Joins

    Joins define the relationships between cubes, which then allows accessing and comparing properties from two or more cubes at the same time. In Cube, all joins are LEFT JOINs. This also allows you to represent one-to-one, many-to-one relationships easily.

    cube('Orders', {
    
      ...,
    
      joins: {
        LineItems: {
          relationship: `belongsTo`,
          // Here we use the `CUBE` global to refer to the current cube,
          // so the following is equivalent to `Orders.id = LineItems.order_id`
          sql: `${CUBE}.id = ${LineItems}.order_id`,
        },
      },
    });

    There are three kinds of join relationships:

    • belongsTo
    • hasOne
    • hasMany

    Segments

    Segments are filters predefined in the schema instead of a Cube query. Segments help pre-build complex filtering logic, simplifying Cube queries and making it easy to re-use common filters across a variety of queries.

    To add a segment that limits results to completed orders, we can do the following:

    cube('Orders', {
      ...,
      segments: {
        onlyCompleted: {
          sql: `${CUBE}.status = 'completed'`},
      },
    });

    Pre-Aggregations

    Pre-aggregations are a powerful way of caching frequently-used, expensive queries and keeping the cache up-to-date periodically. The most popular roll-up pre-aggregation is summarized data of the original cube grouped by any selected dimensions of interest. It works on “measure types” like count, sum, min, max, etc.

    Cube analyzes queries against a defined set of pre-aggregation rules to choose the optimal one that will be used to create pre-aggregation table. When there is a smaller dataset that queries execute over, the application works well and delivers responses within acceptable thresholds. However, as the size of the dataset grows, the time-to-response from a user’s perspective can often suffer quite heavily. It specifies attributes from the source, which Cube uses to condense (or crunch) the data. This simple yet powerful optimization can reduce the size of the dataset by several orders of magnitude, and ensures subsequent queries can be served by the same condensed dataset if any matching attributes are found.

    Even granularity can be specified, which defines the granularity of data within the pre-aggregation. If set to week, for example, then Cube will pre-aggregate the data by week and persist it to Cube Store.

    Cube can also take care of keeping pre-aggregations up-to-date with the refreshKey property. By default, it is set to every: ‘1 hour’.

    cube('Orders', {
    
      ...,
    
      preAggregations: {
        main: {
          measures: [CUBE.count],
          dimensions: [CUBE.status],
          timeDimension: CUBE.createdAt,
          granularity: 'day',
        },
      },
    });

    Additional Cube Concepts

    Let’s look into some of the additional concepts that Cube provides that make it a very unique solution.

    Caching

    Cube provides a two-level caching system. The first level is in-memory cache, which is active by default. Cube in-memory cache acts as a buffer for your database when there is a burst of requests hitting the same data from multiple concurrent users, while pre-aggregations are designed to provide the right balance between time to insight and querying performance.

    The second level of caching is called pre-aggregations, and requires explicit configuration to activate.

    Drilldowns

    Drilldowns are a powerful feature to facilitate data exploration. It allows building an interface to let users dive deeper into visualizations and data tables. See ResultSet.drillDown() on how to use this feature on the client side.

    A drilldown is defined on the measure level in your data schema. It is defined as a list of dimensions called drill members. Once defined, these drill members will always be used to show underlying data when drilling into that measure.

    Subquery

    You can use subqueries within dimensions to reference measures from other cubes inside a dimension. Under the hood, it behaves as a correlated subquery, but is implemented via joins for optimal performance and portability.

    For example, the following SQL can be written using a subquery in cubes as:

    SELECT
      id,
      (SELECT SUM(amount)FROM dealsWHERE deals.sales_manager_id = sales_managers.id)AS deals_amount
    FROM sales_managers
    GROUPBY 1

    Cube Representation

    cube(`Deals`, {
      sql: `SELECT * FROM deals`,
      measures: {
        amount: {
          sql: `amount`,
          type: `sum`,
        },
      },
    });
    
    cube(`SalesManagers`, {
      sql: `SELECT * FROM sales_managers`,
    
      joins: {
        Deals: {
          relationship: `hasMany`,
          sql: `${SalesManagers}.id = ${Deals}.sales_manager_id`,
        },
      },
    
      measures: {
        averageDealAmount: {
          sql: `${dealsAmount}`,
          type: `avg`,
        },
      },
    
      dimensions: {
        dealsAmount: {
          sql: `${Deals.amount}`,
          type: `number`,
          subQuery: true,
        },
      },
    });

    Apart from these, Cube also provides advanced concepts such as Export and Import, Extending Cubes, Data Blending, Dynamic Schema Creation, and Polymorphic Cubes. You can read more about them in the Cube documentation.

    Getting Started with Cube

    Getting started with Cube is very easy. All you need to do is follow the instructions on the Cube documentation page.

    To get started you can use Docker to get started quickly. With Docker, you can install Cube in a few easy steps:

    1. In a new folder for your project, run the following command:

    docker run -p 4000:4000 -p 3000:3000 
      -v ${PWD}:/cube/conf 
      -e CUBEJS_DEV_MODE=true 
      cubejs/cube

    2. Head to http://localhost:4000 to open Developer Playground.

    The Developer Playground has a database connection wizard that loads when Cube is first started up and no .env file is found. After database credentials have been set up, an .env file will automatically be created and populated with the same credentials.

    Click on the type of database to connect to, and you’ll be able to enter credentials:

    After clicking Apply, you should see available tables from the configured database. Select one to generate a data schema. Once the schema is generated, you can execute queries on the Build tab.****

    Conclusion

    Cube is a revolutionary, open-source framework for building embedded analytics applications. It offers a unified API for connecting to any data source, comprehensive visualization libraries, and a data-driven user experience that makes it easy for developers to build interactive applications quickly. With Cube, developers can focus on the application logic and let the framework take care of the data, making it an ideal platform for creating data-driven applications that can be deployed on the web, mobile, and desktop. It is an invaluable tool for any developer interested in building sophisticated analytics applications quickly and easily.

  • How to deploy GitHub Actions Self-Hosted Runners on Kubernetes

    GitHub Actions jobs are run in the cloud by default; however, sometimes we want to run jobs in our own customized/private environment where we have full control. That is where a self-hosted runner saves us from this problem. 

    To get a basic understanding of running self-hosted runners on the Kubernetes cluster, this blog is perfect for you. 

    We’ll be focusing on running GitHub Actions on a self-hosted runner on Kubernetes. 

    An example use case would be to create an automation in GitHub Actions to execute MySQL queries on MySQL Database running in a private network (i.e., MySQL DB, which is not accessible publicly).

    A self-hosted runner requires the provisioning and configuration of a virtual machine instance; here, we are running it on Kubernetes. For running a self-hosted runner on a Kubernetes cluster, the action-runner-controller helps us to make that possible.

    This blog aims to try out self-hosted runners on Kubernetes and covers:

    1. Deploying MySQL Database on minikube, which is accessible only within Kubernetes Cluster.
    2. Deploying self-hosted action runners on the minikube.
    3. Running GitHub Action on minikube to execute MySQL queries on MySQL Database.

    Steps for completing this tutorial:

    Create a GitHub repository

    1. Create a private repository on GitHub. I am creating it with the name velotio/action-runner-poc.

    Setup a Kubernetes cluster using minikube

    1. Install Docker.
    2. Install Minikube.
    3. Install Helm 
    4. Install kubectl

    Install cert-manager on a Kubernetes cluster

    • By default, actions-runner-controller uses cert-manager for certificate management of admission webhook, so we have to make sure cert-manager is installed on Kubernetes before we install actions-runner-controller. 
    • Run the below helm commands to install cert-manager on minikube.
    • Verify installation using “kubectl –namespace cert-manager get all”. If everything is okay, you will see an output as below:

    Setting Up Authentication for Hosted Runners‍

    There are two ways for actions-runner-controller to authenticate with the GitHub API (only 1 can be configured at a time, however):

    1. Using a GitHub App (not supported for enterprise-level runners due to lack of support from GitHub.)
    2. Using a PAT (personal access token)

    To keep this blog simple, we are going with PAT.

    To authenticate an action-runner-controller with the GitHub API, we can use a  PAT with the action-runner-controller registers a self-hosted runner.

    • Go to account > Settings > Developers settings > Personal access token. Click on “Generate new token”. Under scopes, select “Full control of private repositories”.
    •  Click on the “Generate token” button.
    • Copy the generated token and run the below commands to create a Kubernetes secret, which will be used by action-runner-controller deployment.
    export GITHUB_TOKEN=XXXxxxXXXxxxxXYAVNa 

    kubectl create ns actions-runner-system

    Create secret

    kubectl create secret generic controller-manager  -n actions-runner-system 
    --from-literal=github_token=${GITHUB_TOKEN}

    Install action runner controller on the Kubernetes cluster

    • Run the below helm commands
    helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller
    helm repo update
    helm upgrade --install --namespace actions-runner-system 
    --create-namespace --wait actions-runner-controller 
    actions-runner-controller/actions-runner-controller --set 
    syncPeriod=1m

    • Verify that the action-runner-controller installed properly using below command
    kubectl --namespace actions-runner-system get all

     

    Create a Repository Runner

    • Create a RunnerDeployment Kubernetes object, which will create a self-hosted runner named k8s-action-runner for the GitHub repository velotio/action-runner-poc
    • Please Update Repo name from “velotio/action-runner-poc” to “<Your-repo-name>”
    • To create the RunnerDeployment object, create the file runner.yaml as follows:
    apiVersion: actions.summerwind.dev/v1alpha1
    kind: RunnerDeployment
    metadata:
     name: k8s-action-runner
     namespace: actions-runner-system
    spec:
     replicas: 2
     template:
       spec:
         repository: velotio/action-runner-poc

    • To create, run this command:
    kubectl create -f runner.yaml

    Check that the pod is running using the below command:

    kubectl get pod -n actions-runner-system | grep -i "k8s-action-runner"

    • If everything goes well, you should see two action runners on the Kubernetes, and the same are registered on Github. Check under Settings > Actions > Runner of your repository.
    • Check the pod with kubectl get po -n actions-runner-system

    Install a MySQL Database on the Kubernetes cluster

    • Create PV and PVC for MySQL Database. 
    • Create mysql-pv.yaml with the below content.
    apiVersion: v1
    kind: PersistentVolume
    metadata:
     name: mysql-pv-volume
     labels:
       type: local
    spec:
     capacity:
       storage: 2Gi
     accessModes:
       - ReadWriteOnce
     hostPath:
       path: "/mnt/data"
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
     name: mysql-pv-claim
    spec:
     accessModes:
       - ReadWriteOnce
     resources:
       requests:
         storage: 2Gi

    • Create mysql namespace
    kubectl create ns mysql

    • Now apply mysql-pv.yaml to create PV and PVC 
    kubectl create -f mysql-pv.yaml -n mysql

    Create the file mysql-svc-deploy.yaml and add the below content to mysql-svc-deploy.yaml

    Here, we have used MYSQL_ROOT_PASSWORD as “password”.

    apiVersion: v1
    kind: Service
    metadata:
     name: mysql
    spec:
     ports:
       - port: 3306
     selector:
       app: mysql
     clusterIP: None
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
     name: mysql
    spec:
     selector:
       matchLabels:
         app: mysql
     strategy:
       type: Recreate
     template:
       metadata:
         labels:
           app: mysql
       spec:
         containers:
           - image: mysql:5.6
             name: mysql
             env:
                 # Use secret in real usage
               - name: MYSQL_ROOT_PASSWORD
                 value: password
             ports:
               - containerPort: 3306
                 name: mysql
             volumeMounts:
               - name: mysql-persistent-storage
                 mountPath: /var/lib/mysql
         volumes:
           - name: mysql-persistent-storage
             persistentVolumeClaim:
               claimName: mysql-pv-claim

    • Create the service and deployment
    kubectl create -f mysql-svc-deploy.yaml -n mysql

    • Verify that the MySQL database is running
    kubectl get po -n mysql

    Create a GitHub repository secret to store MySQL password

    As we will use MySQL password in the GitHub action workflow file as a good practice, we should not use it in plain text. So we will store MySQL password in GitHub secrets, and we will use this secret in our GitHub action workflow file.

    • Create a secret in the GitHub repository and give the name to the secret as “MYSQL_PASS”, and in the values, enter “password”. 

    Create a GitHub workflow file

    • YAML syntax is used to write GitHub workflows. For each workflow, we use a separate YAML file, which we store at .github/workflows/ directory. So, create a .github/workflows/ directory in your repository and create a file .github/workflows/mysql_workflow.yaml as follows.
    ---
    name: Example 1
    on:
     push:
       branches: [ main ]
    jobs:
     build:
       name: Build-job
       runs-on: self-hosted
       steps:
       - name: Checkout
         uses: actions/checkout@v2
     
       - name: MySQLQuery
         env:
           PASS: ${{ secrets.MYSQL_PASS }}
         run: |
           docker run -v ${GITHUB_WORKSPACE}:/var/lib/docker --rm mysql:5.6 sh -c "mysql -u root -p$PASS -hmysql.mysql.svc.cluster.local </var/lib/docker/test.sql"

    • If you check the docker run command in the mysql_workflow.yaml file, we are referring to the .sql file, i.e., test.sql. So, create a test.sql file in your repository as follows:
    use mysql;
    CREATE TABLE IF NOT EXISTS Persons (
       PersonID int,
       LastName varchar(255),
       FirstName varchar(255),
       Address varchar(255),
       City varchar(255)
    );
     
    SHOW TABLES;

    • In test.sql, we are running MySQL queries like create tables.
    • Push changes to your repository main branch.
    • If everything is fine, you will be able to see that the GitHub action is getting executed in a self-hosted runner pod. You can check it under the “Actions” tab of your repository.
    • You can check the workflow logs to see the output of SHOW TABLES—a command we have used in the test.sql file—and check whether the persons tables is created.

    References

  • How to Setup HashiCorp Vault HA Cluster with Integrated Storage (Raft)

    As businesses move their data to the public cloud, one of the most pressing issues is how to keep it safe from illegal access.

    Using a tool like HashiCorp Vault gives you greater control over your sensitive credentials and fulfills cloud security regulations.

    In this blog, we’ll walk you through HashiCorp Vault High Availability Setup.

    Hashicorp Vault

    Hashicorp Vault is an open-source tool that provides a secure, reliable way to store and distribute sensitive information like API keys, access tokens, passwords, etc. Vault provides high-level policy management, secret leasing, audit logging, and automatic revocation to protect this information using UI, CLI, or HTTP API.

    High Availability

    Vault can run in a High Availability mode to protect against outages by running multiple Vault servers. When running in HA mode, Vault servers have two additional states, i.e., active and standby. Within a Vault cluster, only a single instance will be active, handling all requests, and all standby instances redirect requests to the active instance.

    Integrated Storage Raft

    The Integrated Storage backend is used to maintain Vault’s data. Unlike other storage backends, Integrated Storage does not operate from a single source of data. Instead, all the nodes in a Vault cluster will have a replicated copy of Vault’s data. Data gets replicated across all the nodes via the Raft Consensus Algorithm.

    Raft is officially supported by Hashicorp.

    Architecture

    Prerequisites

    This setup requires Vault, Sudo access on the machines, and the below configuration to create the cluster.

    • Install Vault v1.6.3+ent or later on all nodes in the Vault cluster 

    In this example, we have 3 CentOs VMs provisioned using VMware. 

    Setup

    1. Verify the Vault version on all the nodes using the below command (in this case, we have 3 nodes node1, node2, node3):

    vault --version

    2. Configure SSL certificates

    Note: Vault should always be used with TLS in production to provide secure communication between clients and the Vault server. It requires a certificate file and key file on each Vault host.

    We can generate SSL certs for the Vault Cluster on the Master and copy them on the other nodes in the cluster.

    Refer to: https://developer.hashicorp.com/vault/tutorials/secrets-management/pki-engine#scenario-introduction for generating SSL certs.

    • Copy tls.crt tls.key tls_ca.pem to /etc/vault.d/ssl/ 
    • Change ownership to `vault`
    [user@node1 ~]$ cd /etc/vault.d/ssl/           
    [user@node1 ssl]$ sudo chown vault. tls*

    • Copy tls* from /etc/vault.d/ssl to of the nodes

    3. Configure the enterprise license. Copy license on all nodes:

    cp /root/vault.hclic /etc/vault.d/vault.hclic
    chown root:vault /etc/vault.d/vault.hclic
    chmod 0640 /etc/vault.d/vault.hclic

    4. Create the storage directory for raft storage on all nodes:

    sudo mkdir --parents /opt/raft
    sudo chown --recursive vault:vault /opt/raft

    5. Set firewall rules on all nodes:

    sudo firewall-cmd --permanent --add-port=8200/tcp
    sudo firewall-cmd --permanent --add-port=8201/tcp
    sudo firewall-cmd --reload

    6. Create vault configuration file on all nodes:

    ### Node 1 ###
    [user@node1 vault.d]$ cat vault.hcl
    storage "raft" {
        path = "/opt/raft"
        node_id = "node1"
        retry_join 
        {
            leader_api_addr = "https://node2.int.us-west-1-dev.central.example.com:8200"
            leader_ca_cert_file = "/etc/vault.d/ssl/tls_ca.pem"
            leader_client_cert_file = "/etc/vault.d/ssl/tls.crt"
            leader_client_key_file = "/etc/vault.d/ssl/tls.key"
        }
        retry_join 
        {
            leader_api_addr = "https://node3.int.us-west-1-dev.central.example.com:8200"
            leader_ca_cert_file = "/etc/vault.d/ssl/tls_ca.pem"
            leader_client_cert_file = "/etc/vault.d/ssl/tls.crt"
            leader_client_key_file = "/etc/vault.d/ssl/tls.key"
        }
    }
    
    listener "tcp" {
       address = "0.0.0.0:8200"
       tls_disable = false
       tls_cert_file = "/etc/vault.d/ssl/tls.crt"
       tls_key_file = "/etc/vault.d/ssl/tls.key"
       tls_client_ca_file = "/etc/vault.d/ssl/tls_ca.pem"
       tls_cipher_suites = "TLS_TEST_128_GCM_SHA256,
                            TLS_TEST_128_GCM_SHA256,
                            TLS_TEST20_POLY1305,
                            TLS_TEST_256_GCM_SHA384,
                            TLS_TEST20_POLY1305,
                            TLS_TEST_256_GCM_SHA384"
    }
    api_addr = "https://node1.int.us-west-1-dev.central.example.com:8200"
    cluster_addr = "https://node1.int.us-west-1-dev.central.example.com:8201"
    disable_mlock = true
    ui = true
    log_level = "trace"
    disable_cache = true
    cluster_name = "POC"
    
    # Enterprise license_path
    # This will be required for enterprise as of v1.8
    license_path = "/etc/vault.d/vault.hclic"

    ### Node 2 ###
    [user@node2 vault.d]$ cat vault.hcl
    storage "raft" {
        path = "/opt/raft"
        node_id = "node2"
        retry_join 
        {
            leader_api_addr = "https://node1.int.us-west-1-dev.central.example.com:8200"
            leader_ca_cert_file = "/etc/vault.d/ssl/tls_ca.pem"
            leader_client_cert_file = "/etc/vault.d/ssl/tls.crt"
            leader_client_key_file = "/etc/vault.d/ssl/tls.key"
        }
        retry_join 
        {
            leader_api_addr = "https://node3.int.us-west-1-dev.central.example.com:8200"
            leader_ca_cert_file = "/etc/vault.d/ssl/tls_ca.pem"
            leader_client_cert_file = "/etc/vault.d/ssl/tls.crt"
            leader_client_key_file = "/etc/vault.d/ssl/tls.key"
        } 
    }
    
    listener "tcp" {
       address = "0.0.0.0:8200"
       tls_disable = false
       tls_cert_file = "/etc/vault.d/ssl/tls.crt"
       tls_key_file = "/etc/vault.d/ssl/tls.key"
       tls_client_ca_file = "/etc/vault.d/ssl/tls_ca.pem"
       tls_cipher_suites = "TLS_TEST_128_GCM_SHA256,
                            TLS_TEST_128_GCM_SHA256,
                            TLS_TEST20_POLY1305,
                            TLS_TEST_256_GCM_SHA384,
                            TLS_TEST20_POLY1305,
                            TLS_TEST_256_GCM_SHA384"
    }
    api_addr = "https://node2.int.us-west-1-dev.central.example.com:8200"
    cluster_addr = "https://node2.int.us-west-1-dev.central.example.com:8201"
    disable_mlock = true
    ui = true
    log_level = "trace"
    disable_cache = true
    cluster_name = "POC"
    
    # Enterprise license_path
    # This will be required for enterprise as of v1.8
    license_path = "/etc/vault.d/vault.hclic"

    ### Node 3 ###
    [user@node3 ~]$ cat /etc/vault.d/vault.hcl
    storage "raft" {
        path = "/opt/raft"
        node_id = "node3"
        retry_join 
        {
            leader_api_addr = "https://node1.int.us-west-1-dev.central.example.com:8200"
            leader_ca_cert_file = "/etc/vault.d/ssl/tls_ca.pem"
            leader_client_cert_file = "/etc/vault.d/ssl/tls.crt"
            leader_client_key_file = "/etc/vault.d/ssl/tls.key"
        }
        retry_join 
        {
            leader_api_addr = "https://node2.int.us-west-1-dev.central.example.com:8200"
            leader_ca_cert_file = "/etc/vault.d/ssl/tls_ca.pem"
            leader_client_cert_file = "/etc/vault.d/ssl/tls.crt"
            leader_client_key_file = "/etc/vault.d/ssl/tls.key"
        }
    }
    
    listener "tcp" {
       address = "0.0.0.0:8200"
       tls_disable = false
       tls_cert_file = "/etc/vault.d/ssl/tls.crt"
       tls_key_file = "/etc/vault.d/ssl/tls.key"
       tls_client_ca_file = "/etc/vault.d/ssl/tls_ca.pem"
       tls_cipher_suites = "TLS_TEST_128_GCM_SHA256,
                            TLS_TEST_128_GCM_SHA256,
                            TLS_TEST20_POLY1305,
                            TLS_TEST_256_GCM_SHA384,
                            TLS_TEST20_POLY1305,
                            TLS_TEST_256_GCM_SHA384"
    }
    api_addr = "https://node3.int.us-west-1-dev.central.example.com:8200"
    cluster_addr = "https://node3.int.us-west-1-dev.central.example.com:8201"
    disable_mlock = true
    ui = true
    log_level = "trace"
    disable_cache = true
    cluster_name = "POC"
    
    # Enterprise license_path
    # This will be required for enterprise as of v1.8
    license_path = "/etc/vault.d/vault.hclic"

    7. Set environment variables on all nodes:

    export VAULT_ADDR=https://$(hostname):8200
    export VAULT_CACERT=/etc/vault.d/ssl/tls_ca.pem
    export CA_CERT=`cat /etc/vault.d/ssl/tls_ca.pem`

    8. Start Vault as a service on all nodes:

    You can view the systemd unit file if interested by: 

    cat /etc/systemd/system/vault.service
    systemctl enable vault.service
    systemctl start vault.service
    systemctl status vault.service

    9. Check Vault status on all nodes:

    vault status

    10. Initialize Vault with the following command on vault node 1 only. Store unseal keys securely.

    [user@node1 vault.d]$ vault operator init -key-shares=1 -key-threshold=1
    Unseal Key 1: HPY/g5OiT8ivD6L4Bqfjx9L1We2MVb4WZAqKZk6zFf8=
    Initial Root Token: hvs.j4qTq1IZP9nscILMtN2p9GE0
    Vault initialized with 1 key shares and a key threshold of 1.
    Please securely distribute the key shares printed above. 
    When the Vault is re-sealed, restarted, or stopped, you must supply at least 1 of these keys to unseal it
    before it can start servicing requests.
    Vault does not store the generated root key. 
    Without at least 1 keys to reconstruct the root key, Vault will remain permanently sealed!
    It is possible to generate new unseal keys, provided you have a
    quorum of existing unseal keys shares. See "vault operator rekey" for more information.

    11. Set Vault token environment variable for the vault CLI command to authenticate to the server. Use the following command, replacing <initial-root- token> with the value generated in the previous step.

    export VAULT_TOKEN=<initial-root-token>
    echo "export VAULT_TOKEN=$VAULT_TOKEN" >> /root/.bash_profile
    ### Repeat this step for the other 2 servers.

    12. Unseal Vault1 using the unseal key generated in step 10. Notice the Unseal Progress key-value change as you present each key. After meeting the key threshold, the status of the key value for Sealed should change from true to false.

    [user@node1 vault.d]$ vault operator unseal HPY/g5OiT8ivD6L4Bqfjx9L1We2MVb4WZAqKZk6zFf8=
    Key                         Value
    ---                         -----
    Seal Type                   shamir
    Initialized                 true
    Sealed                      false
    Total Shares                1
    Threshold                   1
    Version                     1.11.0
    Build Date                  2022-06-17T15:48:44Z
    Storage Type                raft
    Cluster Name                POC
    Cluster ID                  109658fe-36bd-7d28-bf92-f095c77e860c
    HA Enabled                  true
    HA Cluster                  https://node1.int.us-west-1-dev.central.example.com:8201
    HA Mode                     active
    Active Since                2022-06-29T12:50:46.992698336Z
    Raft Committed Index        36
    Raft Applied Index          36

    13. Unseal Vault2 (Use the same unseal key generated in step 10 for Vault1):

    [user@node2 vault.d]$ vault operator unseal HPY/g5OiT8ivD6L4Bqfjx9L1We2MVb4WZAqKZk6zFf8=
    Key                Value
    ---                -----
    Seal Type          shamir
    Initialized        true
    Sealed             true
    Total Shares       1
    Threshold          1
    Unseal Progress    0/1
    Unseal Nonce       n/a
    Version            1.11.0
    Build Date         2022-06-17T15:48:44Z
    Storage Type       raft
    HA Enabled         true
    
    [user@node2 vault.d]$ vault status
    Key                   Value
    ---                   -----
    Seal Type             shamir
    Initialized           true
    Sealed                true
    Total Shares          1
    Threshold             1
    Version               1.11.0
    Build Date            2022-06-17T15:48:44Z
    Storage Type          raft
    Cluster Name          POC
    Cluster ID            109658fe-36bd-7d28-bf92-f095c77e860c
    HA Enabled            true
    HA Cluster            https://node1.int.us-west-1-dev.central.example.com:8201
    HA Mode               standby
    Active Node Address   https://node1.int.us-west-1-dev.central.example.com:8200
    Raft Committed Index  37
    Raft Applied Index    37

    14. Unseal Vault3 (Use the same unseal key generated in step 10 for Vault1):

    [user@node3 ~]$ vault operator unseal HPY/g5OiT8ivD6L4Bqfjx9L1We2MVb4WZAqKZk6zFf8=
    Key                Value
    ---                -----
    Seal Type          shamir
    Initialized        true
    Sealed             true
    Total Shares       1
    Threshold          1
    Unseal Progress    0/1
    Unseal Nonce       n/a
    Version            1.11.0
    Build Date         2022-06-17T15:48:44Z
    Storage Type       raft
    HA Enabled         true
    
    [user@node3 ~]$ vault status
    Key                       Value
    ---                       -----
    Seal Type                 shamir
    Initialized               true
    Sealed                    false
    Total Shares              1
    Threshold                 1
    Version                   1.11.0
    Build Date                2022-06-17T15:48:44Z
    Storage Type              raft
    Cluster Name              POC
    Cluster ID                109658fe-36bd-7d28-bf92-f095c77e860c
    HA Enabled                true
    HA Cluster                https://node1.int.us-west-1-dev.central.example.com:8201
    HA Mode                   standby
    Active Node Address       https://node1.int.us-west-1-dev.central.example.com:8200
    Raft Committed Index      39
    Raft Applied Index        39

    15. Check the cluster’s raft status with the following command:

    [user@node3 ~]$ vault operator raft list-peers
    Node      Address                                            State       Voter
    ----      -------                                            -----       -----
    node1    node1.int.us-west-1-dev.central.example.com:8201    leader      true
    node2    node2.int.us-west-1-dev.central.example.com:8201    follower    true
    node3    node3.int.us-west-1-dev.central.example.com:8201    follower    true

    16. Currently, node1 is the active node. We can experiment to see what happens if node1 steps down from its active node duty.

    In the terminal where VAULT_ADDR is set to: https://node1.int.us-west-1-dev.central.example.com, execute the step-down command.

    $ vault operator step-down # equivalent of stopping the node or stopping the systemctl service
    Success! Stepped down: https://node2.int.us-west-1-dev.central.example.com:8200

    In the terminal, where VAULT_ADDR is set to https://node2.int.us-west-1-dev.central.example.com:8200, examine the raft peer set.

    [user@node1 ~]$ vault operator raft list-peers
    Node      Address                                            State       Voter
    ----      -------                                            -----       -----
    node1    node1.int.us-west-1-dev.central.example.com:8201    follower    true
    node2    node2.int.us-west-1-dev.central.example.com:8201    leader      true
    node3    node3.int.us-west-1-dev.central.example.com:8201    follower    true

    Conclusion 

    Vault servers are now operational in High Availability mode, and we can test this by writing a secret from either the active or standby Vault instance and see it succeed as a test of request forwarding. Also, we can shut down the active vault instance (sudo systemctl stop vault) to simulate a system failure and see the standby instance assumes the leadership.

  • Modern Data Stack: The What, Why and How?

    This post will provide you with a comprehensive overview of the modern data stack (MDS), including its benefits, how it’s components differ from its predecessors’, and what its future holds.

    “Modern” has the connotation of being up-to-date, of being better. This is true for MDS, but how exactly is MDS better than what was before?

    What was the data stack like?…

    A few decades back, the map-reduce technological breakthrough made it possible to efficiently process large amounts of data in parallel on multiple machines.

    It provided the backbone of a standard pipeline that looked like:

    It was common to see HDFS used for storage, spark for computing, and hive to perform SQL queries on top.

    To run this, we had people handling the deployment and maintenance of Hadoop on their own.

    This core attribute of the setup eventually became a pain point and made it complex and inefficient in the long run.

    Being on-prem while facing growing heavier loads meant scalability became a huge concern.

    Hence, unlike today, the process was much more manual. Adding more RAM, increasing storage, and rolling out updates manually reduced productivity

    Moreover,

    • The pipeline wasn’t modular; components were tightly coupled, causing failures when deciding to shift to something new.
    • Teams committed to specific vendors and found themselves locked in, by design, for years.
    • Setup was complex, and the infrastructure was not resilient. Random surges in data crashed the systems. (This randomness in demand has only increased since the early decade of internet, due to social media-triggered virality.)
    • Self-service was non-existent. If you wanted to do anything with your data, you needed data engineers.
    • Observability was a myth. Your pipeline is failing, but you’re unaware, and then you don’t know why, where, how…Your customers become your testers, knowing more about your system’s issues.
    • Data protection laws weren’t as formalized, especially the lack of policies within the organization. These issues made the traditional setup inefficient in solving modern problems, and as we all know…

    For an upgraded modern setup, we needed something that is scalable, has a smaller learning curve, and something that is feasible for both a seed-stage startup or a fortune 500.

    Standing on the shoulders of tech innovations from the 2000s, data engineers started building a blueprint for MDS tooling with three core attributes: 

    Cloud Native (or the ocean)

    Arguably the definitive change of the MDS era, the cloud reduces the hassle of on-prem and welcomes auto-scaling horizontally or vertically in the era of virality and spikes as technical requirements.

    Modularity

    The M in MDS could stand for modular.

    You can integrate any MDS tool into your existing stack, like LEGO blocks.

    You can test out multiple tools, whether they’re open source or managed, choose the best fit, and iteratively build out your data infrastructure.

    This mindset helps instill a habit of avoiding vendor lock-in by continuously upgrading your architecture with relative ease.

    By moving away from the ancient, one-size-fits-all model, MDS recognizes the uniqueness of each company’s budget, domain, data types, and maturity—and provides the correct solution for a given use case.

    Ease of Use

    MDS tools are easier to set up. You can start playing with these tools within a day.

    Importantly, the ease of use is not limited to technical engineers.

    Owing to the rise of self-serve and no-code tools like tableau—data is finally democratized for usage for all kinds of consumers. SQL remains crucial, but for basic metric calculations PMs, Sales, Marketing, etc., can use a simple drag and drop in the UI (sometimes even simpler than Excel pivot tables).

    MDS also enables one to experiment with different architectural frameworks for their use case. For example, ELT vs. ETL (explained under Data Transformation).

    But, one might think such improvements mean MDS is the v1.1 of Data Stack, a tech upgrade that ultimately uses data to solve similar problems.

    Fortunately, that’s far from the case.

    MDS enables data to solve more human problems across the org—problems that employees have long been facing but could never systematically solve for, helping generate much more value from the data.

    Beyond these, employees want transparency and visibility into how any metric was calculated and which data source in Snowflake was used to build what specific tableau dashboard.

    Critically, with compliance finally being focused on, orgs need solutions for giving the right people the right access at the right time.

    Lastly, as opposed to previous eras, these days, even startups have varied infrastructure components with data; if you’re a PM tasked with bringing insights, how do you know where to start? What data assets the organization has?

    Besides these problem statements being tackled, MDS builds a culture of upskilling employees in various data concepts.

    Data security, governance, and data lineage are important irrespective of department or persona in the organization.

    From designers to support executives, the need for a data-driven culture is a given.

    You’re probably bored of hearing how good the MDS is and want to deconstruct it into its components.

    Let’s dive in.

    SOURCES

    In our modern era, every product is inevitably becoming a tech product

    From a smart bulb to an orbiting satellite, each generates data in its own unique flavor of frequency of generation, data format, data size, etc.

    Social media, microservices, IoT devices, smart devices, DBs, CRMs, ERPs, flat files, and a lot more…

    INGESTION

    Post creation of data, how does one “ingest” or take in that data for actual usage? (the whole point of investing).

    Roughly, there are three categories to help describe the ingestion solutions:

    Generic tools allow us to connect various data sources with data storages.

    E.g.: we can connect Google Ads or Salesforce to dump data into BigQuery or S3.

    These generic tools highlight the modularity and low or no code barrier aspect in MDS.

    Things are as easy as drag and drop, and one doesn’t need to be fluent in scripting.

    Then we have programmable tools as well, where we get more control over how we ingest data through code

    For example, we can write Apache Airflow DAGs in Python to load data from S3 and dump it to Redshift.

    Intermediary – these tools cater to a specific use case or are coupled with the source itself.

    E.g. – Snowpipe, a part of the data source snowflake itself, allows us to load data from files as soon as it’s available at the source.

    DATA STORAGE‍

    Where do you ingest data into?

    Here, we’ve expanded from HDFS & SQL DBs to a wider variety of formats (noSQL, document DB).

    Depending on the use case and the way you interact with data, you can choose from a DW, DB, DL, ObjectStores, etc.

    You might need a standard relational DB for transactions in finance, or you might be collecting logs. You might be experimenting with your product at an early stage and be fine with noSQL without worrying about prescribing schemas.

    One key feature to note is that—most are cloud-based. So, no more worrying about scalability and we pay only for what we use.

    PS: Do stick around till the end for new concepts of Lake House and reverse ETL (already prevalent in the industry).

    DATA TRANSFORMATION

    The stored raw data must be cleaned and restructured into the shape we deem best for actual usage. This slicing and dicing is different for every kind of data.

    For example, we have tools for the E-T-L way, which can be categorized into SaaS and Frameworks, e.g., Fivetran and Spark respectively.

    Interestingly, the cloud era has given storage computational capability such that we don’t even need an external system for transformation, sometimes.

    With this rise of E-LT, we leverage the processing capabilities of cloud data warehouses or lake houses. Using tools like DBT, we write templated SQL queries to transform our data in the warehouses or lake house itself.

    This is enabling analysts to perform heavy lifting of traditional DE problems

    We also see stream processing where we work with applications where “micro” data is processed in real time (analyzed as soon as it’s produced, as opposed to large batches).

    DATA VISUALIZATION

    The ability to visually learn from data has only improved in the MDS era with advanced design, methodology, and integration.

    With Embedded analytics, one can integrate analytical capabilities and data visualizations into the software application itself.

    External analytics, on the other hand, are used to build using your processed data. You choose your source, create a chart, and let it run.

    DATA SCIENCE, MACHINE LEARNING, MLOps

    Source: https://medium.com/vertexventures/thinking-data-the-modern-data-stack-d7d59e81e8c6

    In the last decade, we have moved beyond ad-hoc insight generation in Jupyter notebooks to

    production-ready, real-time ML workflows, like recommendation systems and price predictions. Any startup can and does integrate ML into its products.

    Most cloud service providers offer machine learning models and automated model building as a service.

    MDS concepts like data observation are used to build tools for ML practitioners, whether its feature stores (a feature store is a central repository that provides entity values as of a certain time), or model monitoring (checking data drift, tracking model performance, and improving model accuracy).

    This is extremely important as statisticians can focus on the business problem not infrastructure.

    This is an ever-expanding field where concepts for ex MLOps (DevOps for the ML pipelines—optimizing workflows, efficient transformations) and Synthetic media (using AI to generate content itself) arrive and quickly become mainstream.

    ChatGPT is the current buzz, but by the time you’re reading this, I’m sure there’s going to be an updated one—such is the pace of development.

    DATA ORCHESTRATION

    With a higher number of modularized tools and source systems comes complicated complexity.

    More steps, processes, connections, settings, and synchronization are required.

    Data orchestration in MDS needs to be Cron on steroids.

    Using a wide variety of products, MDS tools help bring the right data for the right purposes based on complex logic.

     

    DATA OBSERVABILITY

    Data observability is the ability to monitor and understand the state and behavior of data as it flows through an organization’s systems.

    In a traditional data stack, organizations often rely on reactive approaches to data management, only addressing issues as they arise. In contrast, data observability in an MDS involves adopting a proactive mindset, where organizations actively monitor and understand the state of their data pipelines to identify potential issues before they become critical.

    Monitoring – a dashboard that provides an operational view of your pipeline or system

    Alerting – both for expected events and anomalies 

    Tracking – ability to set and track specific events

    Analysis – automated issue detection that adapts to your pipeline and data health

    Logging – a record of an event in a standardized format for faster resolution

    SLA Tracking – Measure data quality against predefined standards (cost, performance, reliability)

    Data Lineage – graph representation of data assets showing upstream/downstream steps.

    DATA GOVERNANCE & SECURITY

    Data security is a critical consideration for organizations of all sizes and industries and needs to be prioritized to protect sensitive information, ensure compliance, and preserve business continuity. 

    The introduction of stricter data protection regulations, such as the General Data Protection Regulation (GDPR) and CCPA, introduced a huge need in the market for MDS tools, which efficiently and painlessly help organizations govern and secure their data.

    DATA CATALOG

    Now that we have all the components of MDS, from ingestion to BI, we have so many sources, as well as things like dashboards, reports, views, other metadata, etc., that we need a google like engine just to navigate our components.

    This is where a data catalog helps; it allows people to stitch the metadata (data about your data: the #rows in your table, the column names, types, etc.) across sources.

    This is necessary to help efficiently discover, understand, trust, and collaborate on data assets.

    We don’t want PMs & GTM to look at different dashboards for adoption data.

    Previously, the sole purpose of the original data pipeline was to aggregate and upload events to Hadoop/Hive for batch processing. Chukwa collected events and wrote them to S3 in Hadoop sequence file format. In those days, end-to-end latency was up to 10 minutes. That was sufficient for batch jobs, which usually scan data at daily or hourly frequency.

    With the emergence of Kafka and Elasticsearch over the last decade, there has been a growing demand for real-time analytics on Netflix. By real-time, we mean sub-minute latency. Instead of starting from scratch, Netflix was able to iteratively grow its MDS as per changes in market requirements.

    Source: https://blog.transform.co/data-talks/the-metric-layer-why-you-need-it-examples-and-how-it-fits-into-your-modern-data-stack/

     

    This is a snapshot of the MDS stack a data-mature company like Netflix had some years back where instead of a few all in one tools, each data category was solved by a specialized tool.

    FUTURE COMPONENTS OF MDS?

    DATA MESH

    Source: https://martinfowler.com/articles/data-monolith-to-mesh.html

    The top picture shows how teams currently operate, where no matter the feature or product on the Y axis, the data pipeline’s journey remains the same moving along the X. But in an ideal world of data mesh, those who know the data should own its journey.

    As decentralization is the name of the game, data mesh is MDS’s response to this demand for an architecture shift where domain owners use self-service infrastructure to shape how their data is consumed.

    DATA LAKEHOUSE

    Source: https://www.altexsoft.com/blog/data-lakehouse/

    We have talked about data warehouses and data lakes being used for data storage.

    Initially, when we only needed structured data, data warehouses were used. Later, with big data, we started getting all kinds of data, structured and unstructured.

    So, we started using Data Lakes, where we just dumped everything.

    The lakehouse tries to combine the best of both worlds by adding an intelligent metadata layer on top of the data lake. This layer basically classifies and categorizes data such that it can be interpreted in a structured manner.

    Also, all the data in the lake house is open, meaning that it can be utilized by all kinds of tools. They are generally built on top of open data formats like parquet so that they can be easily accessed by all the tools.

    End users can simply run their SQLs as if they’re querying a DWH. 

    REVERSE ETL

    Suppose you’re a salesperson using Salesforce and want to know if a lead you just got is warm or cold (warm indicating a higher chance of conversion).

    The attributes about your lead, like salary and age are fetched from your OLTP into a DWH, analyzed, and then the flag “warm” is sent back to Salesforce UI, ready to be used in live operations.

     METRICS LAYER

    The Metric layer will be all about consistency, accessibility, and trust in the calculations of metrics.

    Earlier, for metrics, you had v1 v1.1 Excels with logic scattered around.

    Currently, in the modern data stack world, each team’s calculation is isolated in the tool they are used to. For example, BI would store metrics in tableau dashboards while DEs would use code.

    A metric layer would exist to ensure global access of the metrics to every other tool in the data stack.

    For example, DBT metrics layer helps define these in the warehouse—something accessible to both BI and engineers. Similarly, looker, mode, and others have their unique approach to it.

    In summary, this blog post discussed the modern data stack and its advantages over older approaches. We examined the components of the modern data stack, including data sources, ingestion, transformation, and more, and how they work together to create an efficient and effective system for data management and analysis. We also highlighted the benefits of the modern data stack, including increased efficiency, scalability, and flexibility. 

    As technology continues to advance, the modern data stack will evolve and incorporate new components and capabilities.

  • Best Practices for Kafka Security

    Overview‍

    We will cover the security concepts of Kafka and walkthrough the implementation of encryption, authentication, and authorization for the Kafka cluster.

    This article will explain how to configure SASL_SSL (simple authentication security layer) security for your Kafka cluster and how to protect the data in transit. SASL_SSL is a communication type in which clients use authentication mechanisms like PLAIN, SCRAM, etc., and the server uses SSL certificates to establish secure communication. We will use the SCRAM authentication mechanism here for the client to help establish mutual authentication between the client and server. We’ll also discuss authorization and ACLs, which are important for securing your cluster.

    Prerequisites

    Running Kafka Cluster, basic understanding of security components.

    Need for Kafka Security

    The primary reason is to prevent unlawful internet activities for the purpose of misuse, modification, disruption, and disclosure. So, to understand the security in Kafka cluster a secure Kafka cluster, we need to know three terms:

    • Authentication – It is a security method used for servers to determine whether users have permission to access their information or website.
    • Authorization – The authorization security method implemented with authentication enables servers to have a methodology of identifying clients for access. Basically, it gives limited access, which is sufficient for the client.
    • Encryption – It is the process of transforming data to make it distorted and unreadable without a decryption key. Encryption ensures that no other client can intercept and steal or read data.

    Here is the quick start guide by Apache Kafka, so check it out if you still need to set up Kafka.

    https://kafka.apache.org/quickstart

    We’ll not cover the theoretical aspects here, but you can find a ton of sources on how these three components work internally. For now, we’ll focus on the implementation part and how Kafka revolves around security.

    This image illustrates SSL communication between the Kafka client and server.

    We are going to implement the steps in the below order:

    • Create a Certificate Authority
    • Create a Truststore & Keystore

    Certificate Authority – It is a trusted entity that issues SSL certificates. As such, a CA is an independent entity that acts as a trusted third party, issuing certificates for use by others. A certificate authority validates the credentials of a person or organization that requests a certificate before issuing one.

    Truststore – A truststore contains certificates from other parties with which you want to communicate or certificate authorities that you trust to identify other parties. In simple words, a list of CAs that can validate the certificate signed by the trusted CA.

    KeyStore – A KeyStore contains private keys and certificates with their corresponding public keys. Keystores can have one or more CA certificates depending upon what’s needed.

    For Kafka Server, we need a server certificate, and here, Keystore comes into the picture since it stores a server certificate. The server certificate should be signed by Certificate Authority (CA). The KeyStore requests to sign the server certificate and in response, CA send a signed CRT to Keystore.

    We will create our own certificate authority for demonstration purposes. If you don’t want to create a private certificate authority, there are many certificate providers you can go with, like IdenTrust and GoDaddy. Since we are creating one, we need to tell our Kafka client to trust our private certificate authority using the Trust Store.

    This block diagram shows you how all the components communicate with each other and their role to generate the final certificate.

    So, let’s create our Certificate Authority. Run the below command in your terminal:

    “openssl req -new -keyout <private_key_name> -out <public_certificate_name>”

    It will ask for a passphrase, and keep it safe for future use cases. After successfully executing the command, we should have two files named private_key_name and public_certificate_name.

    Now, let’s create a KeyStore and trust store for brokers; we need both because brokers also interact internally with each other. Let’s understand with the help of an example: Broker A wants to connect with Broker B, so Broker A acts as a client and Broker B as a server. We are using the SASL_SSL protocol, so A needs SASL credentials, and B needs a certificate for authentication. The reverse is also possible where Broker B wants to connect with Broker A, so we need both a KeyStore and a trust store for authentication.

    Now let’s create a trust store. Execute the below command in the terminal, and it should ask for the password. Save the password for future use:

    “keytool -keystore <truststore_name.jks> -alias <alias name of the entry to process> -import -file <public_certificate_name>”

    Here, we are using the .jks extension for the file, which stands for Java KeyStore. You can also use Public-Key Cryptography Standards #12 (pkcs12) instead of .jks, but that’s totally up to you. public_certificate_name is the same certificate while we create CA.

    For the KeyStore configuration, run the below command and store the password:

    “keytool genkey -keystore <keystore_name.jks> -validity <number_of_days> -storepass <store_password> -genkey -alias <alias_name> -keyalg <key algorithm name> -ext SAN=<“DNS:localhost”>”

    This action creates the KeyStore file in the current working directory. The question “First and Last Name” requires you to enter a fully qualified domain name because some certificate authorities, such as VeriSign, expect this property to be a fully qualified domain name. Not all CAs require a fully qualified domain name, but I recommend using a fully qualified domain name for portability. All other information should be valid. If the information cannot be verified, a certificate authority such as VeriSign will not sign the CSR generated for that record. I’m using localhost for the domain name here, as seen in the above command itself.

    Keystore has an entry with alias_name. It contains the private key and information needed for generating a CSR. Now let’s create a signing certificate request, so it will be used to get a signed certificate from Certificate Authority.

    Execute the below command in your terminal:

    “keytool -keystore <keystore_name.jks> -alias <alias_name> -certreq -file <file_name.csr>”

    So, we have generated a signing certificate request using a KeyStore (the KeyStore name and alias name should be the same). It should ask for the KeyStore password, so enter the same one used while creating the KeyStore.

    Now, execute the below command. It will ask for the password, so enter the CA password, and now we have a signed certificate:

    “openssl x509 -req -CA <public_certificate_name> -CAkey <private_key_name> -in <csr file> -out <signed_file_name> -CAcreateserial”

    Finally, we need to add the public certificate of CA and signed certificate in the KeyStore, so run the below command. It will add the CA certificate to the KeyStore.

    “keytool -keystore <keystore_name.jks> -alias <public_certificate_name> -import -file <public_certificate_name>”

    Now, let’s run the below command; it will add the signed certificate to the KeyStore.

    “keytool -keystore <keystore_name.jks> -alias <alias_name> -import -file <signed_file_name>”

    As of now, we have generated all the security files for the broker. For internal broker communication, we are using SASL_SSL (see security.inter.broker.protocol in server.properties). Now we need to create a broker username and password using the SCRAM method. For more details, click here.

    Run the below command:

    “kafka-configs.sh –zookeeper <host: port> –entity-type users –entity-name <username> –alter –add-config ‘SCRAM-SHA-512=[password=<password>]’”

    NOTE: Credentials for inter-broker communication must be created before Kafka brokers are started.

    Now, we need to configure the Kafka broker property file, so update the file as given below:

    listeners=SASL_SSL://localhost:9092
    advertised.listeners=SASL_SSL://localhost:9092
    ssl.truststore.location={path/to/truststore_name.jks}
    ssl.truststore.password={truststore_password}
    ssl.keystore.location={/path/to/keystore_name.jks}
    ssl.keystore.password={keystore_password}
    security.inter.broker.protocol=SASL_SSL
    ssl.client.auth=none
    ssl.protocol=TLSv1.2
    sasl.enabled.mechanisms=SCRAM-SHA-512
    sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512
    listener.name.sasl_ssl.scram-sha-512.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username={username} password={password};
    super.users=User:{username}

    NOTE: If you are using an external jaas config file, then remove the ScramLoginModule line and set this environment variable before starting broker. “export KAFKA_OPTS=-Djava.security.auth.login.config={path/to/broker.conf}”

    Now, if we run Kafka, the broker should be running on port 9092 without any failure, and if you have multiple brokers inside Kafka, the same config file can be replicated among them, but the port should be different for each broker.

    Producers and consumers need a username and a password to access the broker, so let’s create their credentials and update respective configurations.

    Create a producer user and update producer.properties inside the bin directory, so execute the below command in your terminal.

    “bin/kafka-configs.sh –zookeeper <host: port> –entity-type users –entity-name <producer_name> –alter –add-config ‘SCRAM-SHA-512=[password=<password>]’”

    We need a trust store file for our clients (producer and consumer), but as we already know how to create a trust store, this is a small task for you. It is suggested that producers and consumers should have separate trust stores because when we move Kafka to production, there could be multiple producers and consumers on different machines.

    security.protocol=SASL_SSL
    ssl.protocol=TLSv1.2
    ssl.truststore.location={path/to/client.truststore.jks}
    ssl.truststore.password={password}
    sasl.mechanism=SCRAM-SHA-512
    sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username={producer_name} password={password};

    The below command creates a consumer user, so now let’s update consumer.properties inside the bin directory:

    “bin/kafka-configs.sh –zookeeper <host: port> –entity-type users –entity-name <consumer_name> –alter –add-config ‘SCRAM-SHA-512=[password=<password>]’”

    security.protocol=SASL_SSL
    ssl.protocol=TLSv1.2
    ssl.truststore.location={path/to/client.truststore.jks}
    ssl.truststore.password={password}
    sasl.mechanism=SCRAM-SHA-512
    sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username={consumer_name} password={password};

    As of now, we have implemented encryption and authentication for Kafka brokers. To verify that our producer and consumer are working properly with SCRAM credentials, run the console producer and consumer on some topics.

    Authorization is not implemented yet. Kafka uses access control lists (ACLs) to specify which users can perform which actions on specific resources or groups of resources. Each ACL has a principal, a permission type, an operation, a resource type, and a name.

    The default authorizer is ACLAuthorizer provided by Kafka; Confluent also provides the Confluent Server Authorizer, which is totally different from ACLAuthorizer. An authorizer is a server plugin used by Kafka to authorize actions. Specifically, the authorizer controls whether operations should be authorized based on the principal and resource being accessed.

    Format of ACLs – Principal P is [Allowed/Denied] Operation O from Host H on any Resource R matching ResourcePattern RP

    Execute the below command to create an ACL with writing permission for the producer:

    “bin/kafka-acls.sh –authorizer-properties zookeeper.connect=<host: port> –add –allow-principal User:<producer_name> –operation WRITE –topic <topic_name>”

    The above command should create ACL of write operation for producer_name on topic_name.

    Now, execute the below command to create an ACL with reading permission for the consumer:

    “bin/kafka-acls.sh –authorizer-properties zookeeper.connect=<host: port> –add –allow-principal User:<consumer_name> –operation READ –topic <topic_name>”

    Now we need to define the consumer group ID for this consumer, so the below command associates a consumer with a given consumer group ID.

    “bin/kafka-acls.sh –authorizer-properties zookeeper.connect=<host: port> –add –allow-principal User:<consumer_name> –operation READ –group <consumer_group_name>”

    Now, we need to add some configuration in two files: broker.properties and consumer.properties.

    # Authorizer class
    authorizer.class.name=kafka.security.authorizer.AclAuthorizer

    The above line indicates that AclAuthorizer class is used for authorization.

    # consumer group id
    group.id=<consumer_group_name>

    Consumer group-id is mandatory, and if we do not specify any group, a consumer will not be able to access the data from topics, so to start a consumer, group-id should be provided.

    Let’s test the producer and consumer one by one, run the console producer and also run the console consumer in another terminal; both should be running without error.

    console-producer
    console-consumer

    Voila!! Your Kafka is secured.

    Summary

    In a nutshell, we have implemented security in our Kafka using the SASL_SSL mechanism and learned how to create ACLs and give different permission to different users.

    Apache Kafka is the wild west without security. By default, there is no encryption, authentication, or access control list. Any client can communicate with the Kafka broker using the PLAINTEXT port. Access using this port should be restricted to trusted clients only. You can use network segmentation and/or authentication ACLs to restrict access to trusted IP addresses in these cases. If none of these are used, the cluster is wide open and available to anyone. A basic knowledge of Kafka authentication, authorization, encryption, and audit trails is required to safely move a system into production.

  • Discover the Benefits of Android Clean Architecture

    All architectures have one common goal: to manage the complexity of our application. We may not need to worry about it on a smaller project, but it becomes a lifesaver on larger ones. The purpose of Clean Architecture is to minimize code complexity by preventing implementation complexity.

    We must first understand a few things to implement the Clean Architecture in an Android project.

    • Entities: Encapsulate enterprise-wide critical business rules. An entity can be an object with methods or data structures and functions.
    • Use cases: It demonstrates data flow to and from the entities.
    • Controllers, gateways, presenters: A set of adapters that convert data from the use cases and entities format to the most convenient way to pass the data to the upper level (typically the UI).
    • UI, external interfaces, DB, web, devices: The outermost layer of the architecture, generally composed of frameworks such as database and web frameworks.

    Here is one thumb rule we need to follow. First, look at the direction of the arrows in the diagram. Entities do not depend on use cases and use cases do not depend on controllers, and so on. A lower-level module should always rely on something other than a higher-level module. The dependencies between the layers must be inwards.

    Advantages of Clean Architecture:

    • Strict architecture—hard to make mistakes
    • Business logic is encapsulated, easy to use, and tested
    • Enforcement of dependencies through encapsulation
    • Allows for parallel development
    • Highly scalable
    • Easy to understand and maintain
    • Testing is facilitated

    Let’s understand this using the small case study of the Android project, which gives more practical knowledge rather than theoretical.

    A pragmatic approach

    A typical Android project typically needs to separate the concerns between the UI, the business logic, and the data model, so taking “the theory” into account, we decided to split the project into three modules:

    • Domain Layer: contains the definitions of the business logic of the app, the data models, the abstract definition of repositories, and the definition of the use cases.
    Domain Module
    • Data Layer: This layer provides the abstract definition of all the data sources. Any application can reuse this without modifications. It contains repositories and data sources implementations, the database definition and its DAOs, the network APIs definitions, some mappers to convert network API models to database models, and vice versa.
    Data Module
    • Presentation layer: This is the layer that mainly interacts with the UI. It’s Android-specific and contains fragments, view models, adapters, activities, composable, and so on. It also includes a service locator to manage dependencies.
    Presentation Module

    Marvel’s comic characters App

    To elaborate on all the above concepts related to Clean Architecture, we are creating an app that lists Marvel’s comic characters using Marvel’s developer API. The app shows a list of Marvel characters, and clicking on each character will show details of that character. Users can also bookmark their favorite characters. It seems like nothing complicated, right?

    Before proceeding further into the sample, it’s good to have an idea of the following frameworks because the example is wholly based on them.

    • Jetpack Compose – Android’s recommended modern toolkit for building native UI.
    • Retrofit 2 – A type-safe HTTP client for Android for Network calls.
    • ViewModel – A class responsible for preparing and managing the data for an activity or a fragment.
    • Kotlin – Kotlin is a cross-platform, statically typed, general-purpose programming language with type inference.

    To get a characters list, we have used marvel’s developer API, which returns the list of marvel characters.

    http://gateway.marvel.com/v1/public/characters

    The domain layer

    In the domain layer, we define the data model, the use cases, and the abstract definition of the character repository. The API returns a list of characters, with some info like name, description, and image links.

    data class CharacterEntity(
        val id: Long,
        val name: String,
        val description: String,
        val imageUrl: String,
        val bookmarkStatus: Boolean
    )

    interface MarvelDataRepository {
        suspend fun getCharacters(dataSource: DataSource): Flow<List<CharacterEntity>>
        suspend fun getCharacter(characterId: Long): Flow<CharacterEntity>
        suspend fun toggleCharacterBookmarkStatus(characterId: Long): Boolean
        suspend fun getComics(dataSource: DataSource, characterId: Long): Flow<List<ComicsEntity>>
    }

    class GetCharactersUseCase(
        private val marvelDataRepository: MarvelDataRepository,
        private val ioDispatcher: CoroutineDispatcher = Dispatchers.IO
    ) {
        operator fun invoke(forceRefresh: Boolean = false): Flow<List<CharacterEntity>> {
            return flow {
                emitAll(
                    marvelDataRepository.getCharacters(
                        if (forceRefresh) {
                            DataSource.Network
                        } else {
                            DataSource.Cache
                        }
                    )
                )
            }
                .flowOn(ioDispatcher)
        }
    }

    The data layer

    As we said before, the data layer must implement the abstract definition of the domain layer, so we need to put the repository’s concrete implementation in this layer. To do so, we can define two data sources, a “local” data source to provide persistence and a “remote” data source to fetch the data from the API.

    class MarvelDataRepositoryImpl(
        private val marvelRemoteService: MarvelRemoteService,
        private val charactersDao: CharactersDao,
        private val comicsDao: ComicsDao,
        private val ioDispatcher: CoroutineDispatcher = Dispatchers.IO
    ) : MarvelDataRepository {
    
        override suspend fun getCharacters(dataSource: DataSource): Flow<List<CharacterEntity>> =
            flow {
                emitAll(
                    when (dataSource) {
                        is DataSource.Cache -> getCharactersCache().map { list ->
                            if (list.isEmpty()) {
                                getCharactersNetwork()
                            } else {
                                list.toDomain()
                            }
                        }
                            .flowOn(ioDispatcher)
    
                        is DataSource.Network -> flowOf(getCharactersNetwork())
                            .flowOn(ioDispatcher)
                    }
                )
            }
    
        private suspend fun getCharactersNetwork(): List<CharacterEntity> =
            marvelRemoteService.getCharacters().body()?.data?.results?.let { remoteData ->
                if (remoteData.isNotEmpty()) {
                    charactersDao.upsert(remoteData.toCache())
                }
                remoteData.toDomain()
            } ?: emptyList()
    
        private fun getCharactersCache(): Flow<List<CharacterCache>> =
            charactersDao.getCharacters()
    
        override suspend fun getCharacter(characterId: Long): Flow<CharacterEntity> =
            charactersDao.getCharacterFlow(id = characterId).map {
                it.toDomain()
            }
    
        override suspend fun toggleCharacterBookmarkStatus(characterId: Long): Boolean {
    
            val status = charactersDao.getCharacter(characterId)?.bookmarkStatus?.not() ?: false
    
            return charactersDao.toggleCharacterBookmarkStatus(id = characterId, status = status) > 0
        }
    
        override suspend fun getComics(
            dataSource: DataSource,
            characterId: Long
        ): Flow<List<ComicsEntity>> = flow {
            emitAll(
                when (dataSource) {
                    is DataSource.Cache -> getComicsCache(characterId = characterId).map { list ->
                        if (list.isEmpty()) {
                            getComicsNetwork(characterId = characterId)
                        } else {
                            list.toDomain()
                        }
                    }
                    is DataSource.Network -> flowOf(getComicsNetwork(characterId = characterId))
                        .flowOn(ioDispatcher)
                }
            )
        }
    
        private suspend fun getComicsNetwork(characterId: Long): List<ComicsEntity> =
            marvelRemoteService.getComics(characterId = characterId)
                .body()?.data?.results?.let { remoteData ->
                    if (remoteData.isNotEmpty()) {
                        comicsDao.upsert(remoteData.toCache(characterId = characterId))
                    }
                    remoteData.toDomain()
                } ?: emptyList()
    
        private fun getComicsCache(characterId: Long): Flow<List<ComicsCache>> =
            comicsDao.getComics(characterId = characterId)
    }

    Since we defined the data source to manage persistence, in this layer, we also need to determine the database for which we are using the room database. In addition, it’s good practice to create some mappers to map the API response to the corresponding database entity.

    fun List<Characters>.toCache() = map { character -> character.toCache() }
    
    fun Characters.toCache() = CharacterCache(
        id = id ?: 0,
        name = name ?: "",
        description = description ?: "",
        imageUrl = thumbnail?.let {
            "${it.path}.${it.extension}"
        } ?: ""
    )
    
    fun List<Characters>.toDomain() = map { character -> character.toDomain() }
    
    fun Characters.toDomain() = CharacterEntity(
        id = id ?: 0,
        name = name ?: "",
        description = description ?: "",
        imageUrl = thumbnail?.let {
            "${it.path}.${it.extension}"
        } ?: "",
        bookmarkStatus = false
    )

    @Entity
    data class CharacterCache(
        @PrimaryKey
        val id: Long,
        val name: String,
        val description: String,
        val imageUrl: String,
        val bookmarkStatus: Boolean = false
    ) : BaseCache

    The presentation layer

    In this layer, we need a UI component like fragments, activity, or composable to display the list of characters; here, we can use the widely used MVVM approach. The view model takes the use cases in its constructors and invokes the corresponding use case according to user actions (get a character, characters & comics, etc.).

    Each use case will invoke the appropriate method in the repository.

    class CharactersListViewModel(
        private val getCharacters: GetCharactersUseCase,
        private val toggleCharacterBookmarkStatus: ToggleCharacterBookmarkStatus
    ) : ViewModel() {
    
        private val _characters = MutableStateFlow<UiState<List<CharacterViewState>>>(UiState.Loading())
        val characters: StateFlow<UiState<List<CharacterViewState>>> = _characters
    
        init {
            _characters.value = UiState.Loading()
            getAllCharacters()
        }
    
        private fun getAllCharacters(forceRefresh: Boolean = false) {
            getCharacters(forceRefresh)
                .catch { error ->
                    error.printStackTrace()
                    when (error) {
                        is UnknownHostException, is ConnectException, is SocketTimeoutException -> _characters.value =
                            UiState.NoInternetError(error)
                        else -> _characters.value = UiState.ApiError(error)
                    }
                }.map { list ->
                    _characters.value = UiState.Loaded(list.toViewState())
                }.launchIn(viewModelScope)
        }
    
        fun refresh(showLoader: Boolean = false) {
            if (showLoader) {
                _characters.value = UiState.Loading()
            }
            getAllCharacters(forceRefresh = true)
        }
    
        fun bookmarkCharacter(characterId: Long) {
            viewModelScope.launch {
                toggleCharacterBookmarkStatus(characterId = characterId)
            }
        }
    }

    /*
    * Scaffold(Layout) for Characters list page
    * */
    
    
    @SuppressLint("UnusedMaterialScaffoldPaddingParameter")
    @Composable
    fun CharactersListScaffold(
        showComics: (Long) -> Unit,
        closeAction: () -> Unit,
        modifier: Modifier = Modifier,
        charactersListViewModel: CharactersListViewModel = getViewModel()
    ) {
        Scaffold(
            modifier = modifier,
            topBar = {
                TopAppBar(
                    title = {
                        Text(text = stringResource(id = R.string.characters))
                    },
                    navigationIcon = {
                        IconButton(onClick = closeAction) {
                            Icon(
                                imageVector = Icons.Filled.Close,
                                contentDescription = stringResource(id = R.string.close_icon)
                            )
                        }
                    }
                )
            }
        ) {
            val state = charactersListViewModel.characters.collectAsState()
    
            when (state.value) {
    
                is UiState.Loading -> {
                    Loader()
                }
    
                is UiState.Loaded -> {
                    state.value.data?.let { characters ->
                        val isRefreshing = remember { mutableStateOf(false) }
                        SwipeRefresh(
                            state = rememberSwipeRefreshState(isRefreshing = isRefreshing.value),
                            onRefresh = {
                                isRefreshing.value = true
                                charactersListViewModel.refresh()
                            }
                        ) {
                            isRefreshing.value = false
    
                            if (characters.isNotEmpty()) {
    
                                LazyVerticalGrid(
                                    columns = GridCells.Fixed(2),
                                    modifier = Modifier
                                        .padding(5.dp)
                                        .fillMaxSize()
                                ) {
                                    items(characters) { state ->
                                        CharacterTile(
                                            state = state,
                                            characterSelectAction = {
                                                showComics(state.id)
                                            },
                                            bookmarkAction = {
                                                charactersListViewModel.bookmarkCharacter(state.id)
                                            },
                                            modifier = Modifier
                                                .padding(5.dp)
                                                .fillMaxHeight(fraction = 0.35f)
                                        )
                                    }
                                }
    
                            } else {
                                Info(
                                    messageResource = R.string.no_characters_available,
                                    iconResource = R.drawable.ic_no_data
                                )
                            }
                        }
                    }
                }
    
                is UiState.ApiError -> {
                    Info(
                        messageResource = R.string.api_error,
                        iconResource = R.drawable.ic_something_went_wrong
                    )
                }
    
                is UiState.NoInternetError -> {
                    Info(
                        messageResource = R.string.no_internet,
                        iconResource = R.drawable.ic_no_connection,
                        isInfoOnly = false,
                        buttonAction = {
                            charactersListViewModel.refresh(showLoader = true)
                        }
                    )
                }
            }
        }
    }
    
    @Preview
    @Composable
    private fun CharactersListScaffoldPreview() {
        MarvelComicTheme {
            CharactersListScaffold(showComics = {}, closeAction = {})
        }
    }

    Let’s see how the communication between the layers looks like.

    Source: Clean Architecture Tutorial for Android

    As you can see, each layer communicates only with the closest one, keeping inner layers independent from lower layers, this way, we can quickly test each module separately, and the separation of concerns will help developers to collaborate on the different modules of the project.

    Thank you so much!

  • Building a Collaborative Editor Using Quill and Yjs

    “Hope this email finds you well” is how 2020-2021 has been in a nutshell. Since we’ve all been working remotely since last year, actively collaborating with teammates became one notch harder, from activities like brainstorming a topic on a whiteboard to building documentation.

    Having tools powered by collaborative systems had become a necessity, and to explore the same following the principle of build fast fail fast, I started building up a collaborative editor using existing available, open-source tools, which can eventually be extended for needs across different projects.

    Conflicts, as they say, are inevitable, when multiple users are working on the same document constantly modifying it, especially if it’s the same block of content. Ultimately, the end-user experience is defined by how such conflicts are resolved.

    There are various conflict resolution mechanisms, but two of the most commonly discussed ones are Operational Transformation (OT) and Conflict-Free Replicated Data Type (CRDT). So, let’s briefly talk about those first.

    Operational Transformation

    The order of operations matter in OT, as each user will have their own local copy of the document, and since mutations are atomic, such as insert V at index 4 and delete X at index 2. If the order of these operations is changed, the end result will be different. And that’s why all the operations are synchronized through a central server. The central server can then alter the indices and operations and then forward to the clients. For example, in the below image, User2 makes a delete(0) operation, but as the OT server realizes that User1 has made an insert operation, the User2’s operation needs to be changed as delete(1) before applying to User1.

    OT with a central server is typically easier to implement. Plain text operations with OT in its basic form only has three defined operations: insert, delete, and apply.

    Source: Conclave

    “Fully distributed OT and adding rich text operations are very hard, and that’s why there’s a million papers.”

    CRDT

    Instead of performing operations directly on characters like in OT, CRDT uses a complex data structure to which it can then add/update/remove properties to signify transformation, enabling scope for commutativity and idempotency. CRDTs guarantee eventual consistency.

    There are different algorithms, but in general, CRDT has two requirements: globally unique characters and globally ordered characters. Basically, this involves a global reference for each object, instead of positional indices, in which the ordering is based on the neighboring objects. Fractional indices can be used to assign index to an object.

    Source: Conclave

    As all the objects have their own unique reference, delete operation becomes idempotent. And giving fractional indices is one way to give unique references while insertion and updation.

    There are two types of CRDT, one is state-based, where the whole state (or delta) is shared between the instances and merged continuously. The other is operational based, where only individual operations are sent between replicas. If you want to dive deep into CRDT, here’s a nice resource.

    For our purposes, we choose CRDT since it can also support peer-to-peer networks. If you directly want to jump to the code, you can visit the repo here.

    Tools used for this project:

    As our goal was for a quick implementation, we targeted off-the-shelf tools for editor and backend to manage collaborative operations.

    • Quill.js is an API-driven WYSIWYG rich text editor built for compatibility and extensibility. We choose Quill as our editor because of the ease to plug it into your application and availability of extensions.
    • Yjs is a framework that provides shared editing capabilities by exposing its different shared data types (Array, Map, Text, etc) that are synced automatically. It’s also network agnostic, so the changes are synced when a client is online. We used it because it’s a CRDT implementation, and surprisingly had readily available bindings for quill.js.

    Prerequisites:

    To keep it simple, we’ll set up a client and server both in the same code base. Initialize a project with npm init and install the below dependencies:

    npm i quill quill-cursors webpack webpack-cli webpack-dev-server y-quill y-websocket yjs

    • Quill: Quill is the WYSIWYG rich text editor we will use as our editor.
    • quill-cursors is an extension that helps us to display cursors of other connected clients to the same editor room.
    • Webpack, webpack-cli, and webpack-dev-server are developer utilities, webpack being the bundler that creates a deployable bundle for your application.
    • The Y-quill module provides bindings between Yjs and QuillJS with use of the SharedType y.Text. For more information, you can check out the module’s source on Github.
    • Y-websocket provides a WebsocketProvider to communicate with Yjs server in a client-server manner to exchange awareness information and data.
    • Yjs, this is the CRDT framework which orchestrates conflict resolution between multiple clients. 

    Code to use

    const path = require('path');
    
    module.exports = {
      mode: 'development',
      devtool: 'source-map',
      entry: {
        index: './index.js'
      },
      output: {
        globalObject: 'self',
        path: path.resolve(__dirname, './dist/'),
        filename: '[name].bundle.js',
        publicPath: '/quill/dist'
      },
      devServer: {
        contentBase: path.join(__dirname),
        compress: true,
        publicPath: '/dist/'
      }
    }

    This is a basic webpack config where we have provided which file is the starting point of our frontend project, i.e., the index.js file. Webpack then uses that file to build the internal dependency graph of your project. The output property is to define where and how the generated bundles should be saved. And the devServer config defines necessary parameters for the local dev server, which runs when you execute “npm start”.

    We’ll first create an index.html file to define the basic skeleton:

    <!DOCTYPE html>
    <html>
      <head>
        <title>Yjs Quill Example</title>
        <script src="./dist/index.bundle.js" async defer></script>
        <link rel=stylesheet href="//cdn.quilljs.com/1.3.6/quill.snow.css" async defer>
      </head>
      <body>
        <button type="button" id="connect-btn">Disconnect</button>
        <div id="editor" style="height: 500px;"></div>
      </body>
    </html>

    The index.html has a pretty basic structure. In <head>, we’ve provided the path of the bundled js file that will be created by webpack, and the css theme for the quill editor. And for the <body> part, we’ve just created a button to connect/disconnect from the backend and a placeholder div where the quill editor will be plugged.

    • Here, we’ve just made the imports, registered quill-cursors extension, and added an event listener for window load:
    import Quill from "quill";
    import * as Y from 'yjs';
    import { QuillBinding } from 'y-quill';
    import { WebsocketProvider } from 'y-websocket';
    import QuillCursors from "quill-cursors";
    
    // Register QuillCursors module to add the ability to show multiple cursors on the editor.
    Quill.register('modules/cursors', QuillCursors);
    
    window.addEventListener('load', () => {
      // We'll add more blocks as we continue
    });

    • Let’s initialize the Yjs document, socket provider, and load the document:
    window.addEventListener('load', () => {
      const ydoc = new Y.Doc();
      const provider = new WebsocketProvider('ws://localhost:3312', 'velotio-demo', ydoc);
      const type = ydoc.getText('Velotio-Blog');
    });

    • We’ll now initialize and plug the Quill editor with its bindings:
    window.addEventListener('load', () => {
      // ### ABOVE CODE HERE ###
    
      const editorContainer = document.getElementById('editor');
      const toolbarOptions = [
        ['bold', 'italic', 'underline', 'strike'],  // toggled buttons
        ['blockquote', 'code-block'],
        [{ 'header': 1 }, { 'header': 2 }],               // custom button values
        [{ 'list': 'ordered' }, { 'list': 'bullet' }],
        [{ 'script': 'sub' }, { 'script': 'super' }],      // superscript/subscript
        [{ 'indent': '-1' }, { 'indent': '+1' }],          // outdent/indent
        [{ 'direction': 'rtl' }],                         // text direction
        // array for drop-downs, empty array = defaults
        [{ 'size': [] }],
        [{ 'header': [1, 2, 3, 4, 5, 6, false] }],
        [{ 'color': [] }, { 'background': [] }],          // dropdown with defaults from theme
        [{ 'font': [] }],
        [{ 'align': [] }],
        ['image', 'video'],
        ['clean']                                         // remove formatting button
      ];
    
      const editor = new Quill(editorContainer, {
        modules: {
          cursors: true,
          toolbar: toolbarOptions,
          history: {
            userOnly: true  // only user changes will be undone or redone.
          }
        },
        placeholder: "collab-edit-test",
        theme: "snow"
      });
    
      const binding = new QuillBinding(type, editor, provider.awareness);
    });

    • Finally, let’s implement the Connect/Disconnect button and complete the callback:
    window.addEventListener('load', () => {
      // ### ABOVE CODE HERE ###
    
      const connectBtn = document.getElementById('connect-btn');
      connectBtn.addEventListener('click', () => {
    	if (provider.shouldConnect) {
      	  provider.disconnect();
      	  connectBtn.textContent = 'Connect'
    	} else {
      	  provider.connect();
      	  connectBtn.textContent = 'Disconnect'
    	}
      });
    
      window.example = { provider, ydoc, type, binding, Y }
    });

    Steps to run:

    • Server:

    For simplicity, we’ll directly use the y-websocket-server out of the box.

    NOTE: You can either let it run and open a new terminal for the next commands, or let it run in the background using `&` at the end of the command.

    • Client:

    Start the client by npm start. On successful compilation, it should open on your default browser, or you can just go to http://localhost:8080.

    Show me the repo

    You can find the repository here.

    Conclusion:

    Conflict resolution approaches are not relatively new, but with the trend of remote culture, it is important to have good collaborative systems in place to enhance productivity.

    Although this example was just on rich text editing capabilities, we can extend existing resources to build more features and structures like tabular data, graphs, charts, etc. Yjs shared types can be used to define your own data format based on how your custom editor represents data internally.