Category: Type

  • A Guide to End-to-End API Test Automation with Postman and GitHub Actions

    Objective

    • The blog intends to provide a step-by-step guide on how to automate API testing using Postman. It also demonstrates how we can create a pipeline for periodically running the test suite.
    • Further, it explains how the report can be stored in a central S3 bucket, finally sending the status of the execution back to a designated slack Channel, informing stakeholders about the status, and enabling them to obtain detailed information about the quality of the API.

    Introduction to Postman

    • To speed up the API testing process and improve the accuracy of our APIs, we are going to automate the API functional tests using Postman.
    • Postman is a great tool when trying to dissect RESTful APIs.
    • It offers a sleek user interface to create our functional tests to validate our API’s functionality.
    • Furthermore, the collection of tests will be integrated with GitHub Actions to set up a CI/CD platform that will be used to automate this API testing workflow.

    Getting started with Postman

    Setting up the environment

    • Click on the “New” button on the top left corner. 
    • Select “Environment” as the building block.
    • Give the desired name to the environment file.

    Create a collection

    • Click on the “New” button on the top left corner.
    • Select “Collection” as the building block.
    • Give the desired name to the Collection.

    ‍Adding requests to the collection

    • Configure the Requests under test in folders as per requirement.
    • Enter the API endpoint in the URL field.
    • Set the Auth credentials necessary to run the endpoint. 
    • Set the header values, if required.
    • Enter the request body, if applicable.
    • Send the request by clicking on the “Send” button.
    • Verify the response status and response body.

    Creating TESTS

    • Click on the “Tests” tab.
    • Write the test scripts in JavaScript using the Postman test API.
    • Run the tests by clicking on the “Send” button and validate the execution of the tests written.
    • Alternatively, the prebuilt snippets given by Postman can also be used to create the tests.
    • In case some test data needs to be created, the “Pre-request Script“ tab can be used.

    Running the Collection

    • Click on the ellipses beside the collection created. 
    • Select the environment created in Step 1.
    • Click on the “Run Collection” button.
    • Alternatively, the collection and the env file can be exported and also run via the Newman command.

    Collaboration

    The original collection and the environment file can be exported and shared with others by clicking on the “Export” button. These collections and environments can be version controlled using a system such as Git.

    • While working in a team, team members raise PRs for their changes against the original collection and env via forking. Create a fork.
    • Make necessary changes to the collection and click on Create a Pull request.
    • Validate the changes and approve and merge them to the main collection.

    Integrating with CI/CD

    Creating a pipeline with GitHub Actions

    GitHub Actions is a continuous integration and continuous delivery (CI/CD) platform that allows you to automate your build, test, and deployment pipeline.

    You can create workflows that build and test every pull request to your repository, or deploy merged pull requests to production. To create a pipeline, follow the below steps

    • Create a .yml file inside folder .github/workflows at root level.
    • The same can also be created via GitHub.
    • Configure the necessary actions/steps for the pipeline.

    Workflow File

    • Add a trigger to run the workflow.
    • The schedule in the below code snippet is a GitHub Actions event that triggers the workflow at a specific time interval using a CRON expression.
    • The push and pull_request events denote the Actions event that triggers the workflow for each push and pull request on the develop branch.
    • The workflow_dispatch tag denotes the ability to run the workflow manually, too, from GitHub Actions.
    • Create a job to run the Postman Collection.
    • Check out the code from the current repository. Also, create a directory to store the results.
    • Install Nodejs.
    • Install Newman and necessary dependencies
    • Running the collection
    • Upload Newman report into the directory

    Generating Allure report and hosting the report onto s3.

    • Along with the default report that Newman provides, Allure reporting can also be used in order to get a dashboard of the result. 
    • To generate the Allure report, install the Allure dependencies given in the installation step above.
    • Once, that is done, add below code to your .yml file.
    • Create a bucket in s3, which you will be using for storing the reports
    • Create an iam role for the bucket.
    • The below code snipper user aws-actions/configure-aws-credentials@v1 action to configure your AWS.
    • Credentials Allure generates 2 separate folders eventually combining them to create a dashboard.

    Use the code snippet in the deploy section to upload the contents of the folder onto your s3 bucket.

    • Once done, you should be able to see the Allure dashboard hosted on the Static Website URL for your bucket.

    Send Slack notification with the Status of the job

    • When a job is executed in a CI/CD pipeline, it’s important to keep the team members informed about the status of the job. 
    • Below GitHub Actions step sends a notification to a Slack channel with the status of the job.
    • It uses the “notify-slack-action” GitHub Action, which is defined in the “ravsamhq/notify-slack-action” repository.
    • The “if: always()” condition indicates that this step should always be executed, regardless of whether the previous steps in the workflow succeeded or failed.
  • Getting the Best Out of FLAC on ARMv7: Performance Optimization Tips

    Overview

    FLAC stands for Free Lossless Audio Codec, an audio format similar to MP3 but lossless. This means audio is compressed in FLAC without any loss in quality. It is generally used when we have to encode audio without compromising quality.

    FLAC is an open-source codec (software or hardware that compresses or decompresses digital audio) and is free to use.

    We chose to deploy the FLAC encoder on an ARMv7 embedded platform.

    ARMv7 is a version of the ARM processor architecture; it is used in a wide range of devices, including smartphones, tablets, and embedded systems.

    Let’s dive into how to optimize FLAC’s performance specifically for the ARMv7 architecture. This will provide you with valuable insight with regard to the importance of optimizing FLAC.

    So, tighten your seat belts, and let’s get started.

    Why Do We Need to Optimize FLAC?

    Optimizing FLAC in terms of its performance will make it faster. That way, it will encode/decode(compress/decompress) the audio faster. The below points explain why we need fast codecs.

    • Suppose you’re using one of your favorite music streaming apps, and suddenly, you encounter glitches or pauses in your listening experience.
    • How would you react to the above? A poor user experience will cause this app to lose users to the competition.
    • There can be many reasons for that glitch to happen, possibly a network problem, a server problem, or maybe the audio codec.
    • The app’s audio codec may not be fast enough for your device to deliver the music without any glitches. That’s the reason we need fast codecs. It is a critical component within our control.
    • FLAC is a widely used HiRes audio codec because of its lossless nature.

    Optimizing FLAC for ARMv7

    WHY Optimize for the ARM Platform?

    • Most music devices use ARM-based processors, like mobiles, tablets, car systems, FM radios, wireless headphones, and speakers. 
    • They use ARM because of the small chip size, low energy consumption (good for battery-powered devices), and it’s less prone to heating.

    Optimization Techniques

    FLAC source code is written in the C programming language. So, there are two ways to optimize.

    1. We can rearrange the FLAC source code or write it in a certain way that will execute it faster, as FLAC source code is written in C. So, let’s call this technique C Optimization Technique.
    2. We can convert some parts of the FLAC source code into machine-specific assembly language. Let’s call this technique ARM Assembly Optimization as we are optimizing it for ARMv7.

    According to my experience, assembly optimization gives better results. 

    To discuss optimization techniques, first, we need to identify where codec performance typically lags.

    • Usually, a general codec uses complex algorithms that involve many complex mathematical operations. 
    • Loops are also one of the parts where codecs generally spend more time.
    • Also, we need to access the main memory (RAM) frequently for the above calculations, which is a penalty in performance.
    • Therefore, before optimizing FLAC, we have to keep the above things in mind. Our main goal should be to make mathematical calculations, loops, and memory access faster.

    C Optimization

    There are many ways in which we can approach C optimizations. Most methods are generalized and can be applied to any C source code.

    Loop Optimizations

    As discussed earlier, loops are one of the parts where a codec generally spends more time. We can optimize loops in C itself.

    There are two widely used methods to optimize the loop in C.

    Loop Unrolling – 
    • Loops have three parts: initialization, condition checking, and increment.
    • In the loop, every time we have to test for conditions to exit and increment the counter. 
    • This condition check disrupts the flow of execution and imposes a significant performance penalty when working on a large data set.
    • Loop unrolling reduces branching overhead by working on a larger data chunk before the condition check.

    Let’s try to understand by an example:

    /* Original loop with n iterations. Assuming n is a multiple of 4 */
    for (int i = 0; i < n; i++) {
        Sum += a[i]*b[i];
    }
    
    
    /* Loop unrolling by 4 */
    for (int i = 0; i < k; i += 4) {
        Sum += a[i]*b[i];
        Sum += a[i+1]*b[i+1];
        Sum += a[i+2]*b[i+2];
        Sum += a[i+3]*b[i+3];
    }

    As you can see, after unrolling it by 4, we have to test the exit condition and increment  n/4 times instead of n times.

    Loop Fusion –

    When we use the same data structure in two loops, then instead of executing two loops, we can combine the loops. That way, it will remove the overhead of one loop, and therefore it will execute faster. But we need to ensure the number of loop iterations are the same and the code operations are independent of each other.

    Let’s see an example.

    /* Loop 1 */
    for(i = 0; i < n; i++)
    {
      prod *= a[i]*5;
    }
    
    
    /* Loop 2 */
    for(i = 0; i < n; i++)
    {
      sum += a[i];
    }
    
    
    /* Merging two loops to remove the overhead of one loop */
    for(i = 0; i < n; i++)
    {
      prod *= a[i]*5;
      sum += a[i];
    }

    As you can see in the above code, we are using the array a[ ] in both loops, so we can merge the loops by which it will check for conditions and increment n times instead of 2n.

    Memory Optimizations for Arm Architecture

    Memory access can significantly impact performance in C since multiple processor cycles are consumed for memory accesses. ARM cannot operate on data stored in memory; it needs to be transferred to the register bank first. This highlights the need to streamline the flow of data to the ARM CPU for processing.

    We can also utilize cache memory, which is much faster than main memory, to help minimize this performance penalty.

    To make memory access faster, data can be rearranged to sequential accesses, which consume fewer cycles. By optimizing memory access, we can improve overall performance in FLAC.

    Fig-1 Cache memory lies between the main memory and the processor

    Below are some tips for using the data cache more efficiently.

    • Preload the frequently used data into the cache memory.
    • Group related data together, as sequential memory accesses are faster.
    • Similarly, try to access array values sequentially instead of randomly.
    • Use arrays instead of linked lists wherever possible for sequential memory access.

    Let’s understand the above by an example:

    for(i = 0; i < n; i++)
    {
      for(j = 0; j < m; j++)
      {
        /* Accessing a[j][i] is inefficient because we are not accessing  array sequentially in memory */
      }
    }
    
    
    /* After Interchanging */
    for(j = 0; j < m; j++)
    {
      for(i = 0; i < n; i++)
      {
        /* Accessing a[j][i] is efficient now*/
      }
    }

    As we can see in the above example, loop interchange significantly reduces cache-misses, with optimized code experiencing only 0.1923% of cache-misses. This accumulates over time to a performance improvement of 20% on ARMv7 for an array a[1000][900].

    Assembly Optimizations

    First, we need to understand why assembly optimizations are required.

    • In C optimization, we can access limited hardware features.
    • In ARM Assembly, we can leverage the processor features to the full extent, which will further help in the fast execution of code.
    • We have a Neon Co-processor, Floating Point Unit, and EDSP unit in ARMv7, which accelerate mathematical operations. We can explicitly access such hardware only via assembly language.
    • Compilers convert C code to assembly code, but may not always generate efficient code for certain functions. Writing those functions directly in assembly can lead to further optimization.

    The below points explain why the compiler doesn’t generate efficient assembly for some functions.

    • The first obvious reason is that compilers are designed to convert any C code to assembly without changing the meaning of the code. The compiler does not understand the algorithms or calculations being used.
    • The person who understands the algorithm can, of course, write better assembly than the compiler.
    • An experienced assembly programmer can modify the code to leverage specific hardware features to speed up performance.

    Now let me explain the most widely used hardware units in ARM, which accelerate mathematical operations.

    NEON – 
    • The NEON co-processor is an additional computational unit to which the ARM processor can offload mathematical calculations.
    • It is just like a sub-conscious mind (co-processor) in our brain (processor), which helps ease the workload.
    • NEON does parallel processing; it can do up to 16 addition, subtraction, etc., in just a single instruction. 
    Fig-2 Instead of adding 4 variables one by one, neon adds them in parallel simultaneously
    • FLOATING POINT UNIT – This hardware unit is used to perform operations on floating point numbers. Typical operations it supports are addition, subtraction, multiplications, divisions, square roots, etc.
    • EDSP(Enhanced Digital Signal Processing) – This hardware unit supports fast multiplications, multiply-accumulate, and vector operations.
       Fig-3 ARMv7 CPU, NEON, EDSP, FPU, and Cache under ARM Core

    Approaching Optimizations

    First of all, we have to see which functions we have to optimize. We can get to know about that by profiling FLAC. 

    Profiling is a technique for learning which section of code takes more time to execute and which functions are getting called frequently. Then we can optimize that section of code or function. 

    Below are some tips you can follow for an idea of which optimization technique to use.

    • For performance-critical functions, ARM Assembly should be considered as the first option for optimization, as it typically provides better performance than C optimization as we can directly leverage hardware features.
    • When there is no scope for using the hardware units which primarily deal with mathematical operations then we can go for C optimizations.
    • To determine if assembly code can be improved, we can check the compiler’s assembly output.
    • If there is scope for improvement, we can write code directly in assembly for better utilization of hardware features, such as Neon and FPU.

    Results 

    After applying the above techniques to the FLAC Encoder, we saw an improvement of 22.1% in encoding time. As you can see in the table below, we used a combination of assembly and C optimizations.

    Fig-4 Graphical visualization of average encoding time vs Sampling frequency before and after optimization.

    Conclusion

    FLAC is a lossless audio codec used to preserve quality for HiRes audio applications. Optimizations that target the platform on which the codec is deployed help in providing a great user experience by drastically improving the speed at which audio can be compressed or decompressed. The same techniques can apply to other codecs by identifying and optimizing performance-critical functions.

    The optimization techniques we have used are bit-exact i.e.,: after optimizations, you will get the same audio output as before.

    However, it is important to note that although we can trade bit-exactness for speed, it should be done judiciously, as it can negatively impact the perceived audio quality.

    Looking to the future, with ongoing research into new compression algorithms and hardware, as these technologies continue to evolve, it is likely that we will see new and innovative ways to optimize audio codecs for better performance and quality.

  • The Art of Release Management: Keys to a Seamless Rollout

    Overview

    A little taste of philosophy: Just like how life is unpredictable, so too are software releases. No matter the time and energy invested in planning a release, things go wrong unexpectedly, leaving us (the software team and business) puzzled. 

    Through this blog, I will walk you through:

    1. Cures: the actions (or reactions!) from the first touchpoint of a software release gone haywire, scrutinizing it per user role in the software team. 
    2. Preventions: Later, I will introduce you to a framework that I devised by being part of numerous hiccups with the software releases, which eventually led me to strategize and correct the methodology for executing smoother releases. 

    Software release hiccups: cures

    Production issues are painful. They suck out the energy and impact the software teams and, eventually, the business on different levels. 

    No system has ever been built foolproof, and there will always be occasions when things go wrong. 

    “It’s not what happens to you but how you react to it that matters.”

    – Epictetus

    I have broken down the cures for a software release gone wrong into three phases: 

    1: Discovery phase

    Getting into the right mindset

    Just after the release, you start receiving alerts or user complaints about the issues they are facing with accessing the application. 

    This is the trickiest phase of them all. When a release goes wrong, it is a basic human emotion to find someone to blame or get defensive. But remember, the user is always right.

    And this is the time for acceptance that there indeed is a problem with the application.

    Keeping the focus on the problem that needs to be resolved helps to a quicker and more efficient resolution. 

    As a Business Analyst/Product/Project Manager, you can:

    Handle the communications:

    • Keep the stakeholders updated at all the stages of problem-solving
    • Emails, root cause analysis [RCA] initiation
    • Product level executive decisions [rollback, feature flags, etc.]

    As an engineer, you can:

    • Check the logs, because logs don’t lie
    • If the logs data is insufficient, check at a code level 

    As a QA, you can:

    • Replicate the issue (obviously!)
    • See what test cases missed the scenario and why
    • Was it an edge case?
    • Was it an environment-specific issue?

    Even though I have separate actions per role stated above, most of these are interchangeable. More eyes and ears help for a swift recovery from a bad release. 

    2: Mitigation phase

    Finding the most efficient solutions to the problem at hand

    Once you have discovered the whys and whats of the problem, it is time to move onto the how phase. This is a crucial phase, as the clock ticks and the business is hurting. Everyone is expecting a resolution, and that too sooner. 

    As a Business Analyst/Product/Project Manager, you can:

    • Have team session/s to come up with the best possible solutions. 
    • Multiple solutions help to gauge the trade-offs and to make a wiser decision.
    • PMs can help with making logical business decisions and analyzing the impacts from the business POV.
    • Communicate the solutions and trade-offs, if needed, with stakeholders to have more visibility on the mindsets.

    As an engineer, you can:

    • Check technical feasibility vs. complexity in terms of time vs. code repercussions to help with the decision-making with the solution.
    • Raise red flags upfront, keeping in mind what part of the current problem to avoid reoccurrence. 
    • Avoid quick fixes as much as possible, even when there is pressure for getting the solutions in place.

    As a QA, you can:

    • Focus on what might break with the proposed solution. 
    • Make sure to run the test cases or modify the existing ones to accommodate the new changes.
    • Replicate the final environment and scenarios in the sandbox as much as possible.

    3: Follow-ups and tollgates

    Stop, check and go 

    Tollgates help us in identifying slippages and seal them tight for the future. Every phase of the software release brings us new learnings, and it is mostly about adapting and course correction, taking the best course of action as a team, for the team. 

    Following are some of the tollgates within the release process: 

    Unit Tests

    • Are all the external dependencies accounted for within the test scenarios?
    • Maybe the root cause case wasn’t considered at all, so it was not initially tested?
    • Too much velocity and hence unit tests were ignored to an extent.
    • Avoid the world of quick fixes and workarounds as much as possible.

    User Acceptance Testing [UAT]

    • Is the sandbox environment different than the actual live environment?
    • Have similar configurations for servers so that we are welcomed by surprises after a release.
    • User error
    • Some issues may have been slipped due to human errors.
    • Data quality issue
    • The type of data in sandbox vs live environments is different, which is not catching the issues in sandbox.

    Software release hiccups: Preventions

    Prevention is better than cure; yes, for sure, that sounds cool! 

    Now that we have seen how to tackle the releases gone wild, let me take you through the prevention part of the process. 

    Though we understand the importance of having the processes and tools to set us up for a smoother release, it is only highlighted when a release goes grim. That’s when the checklists get their spotlight and how the team needs to adhere to the set processes within the team. 

    Well, the following is not a checklist, per se, but a framework for us to identify the problems early in the software release and minimize them to some degree. 

    The D.I.A.P.E.R Framework

    So that you don’t have to do a clean-up later!

    This essentially is a set of six activities that should be in place as you are designing your software.

    Design

    This is not the UI/UX of the app and relates to how the application logs should be maintained. 

    Structured logs

    • Logs in a readable and consistent format that monitors for errors.

    Centralized logging

    • Logs in one place and accessible to all the devs, which can be queried easily for advanced metrics.
    • This removes the dependency on specific people within the team. The logs are not needed by everyone, but the point is multiple people having access to them helps within the team.

    Invest

    • Invest time in setting up processes
    • Software development
    • Release process/checklist
    • QA/UAT sign-offs
    • Invest money in getting the right tools which would cater to the needs
    • Monitoring
    • Alerting
    • Task management

    Alerts

    Setting up an alert mechanism automatically raises the flags for the team. Also, not everyone needs to be on these alerts, hence make a logical decision about who would be benefitting from the alerts system

    • Setup alerts
    • Email
    • Incident management software
    • Identify stakeholders who need to receive these alerts

    Prepare

    • Defining strategies: who take action when things go wrong. This helps in avoiding chaotic situations, and the rest of the folks within the team can work on the solution instead
    • Ex: Identifying color codes for different severities (just like we have in hospitals)
    • Plan of Action for each severity
    • Not all situations are as severe as we think. Hence, it is important to set what action is needed for each of the severities.
    • Ops and dev teams should be tightly intertwined.

    Evaluate

    Whenever we see a problem, we usually tend to jump to solutions. In my experience, it has always helped me to take some time and identify the answers to the following: 

    • What is the issue?
    • The focus: problem
    • How severe?
    • Severity level and mentioned in the previous step
    • Who needs to be involved?
    • Not everyone within the team needs to be involved immediately to fix the problem; identifying who needs to be involved saves time for the rest of us. 

    Resolve

    There is a problem at hand, and the business and stakeholders expect a solution. As previously mentioned, keeping a cool head in this phase is of utmost importance.

    • Propose the best possible solution based on
    • Technical feasibility
    • Time
    • Cost
    • Business impact

    Always have multiple solutions to gauge the trade-offs; some take lesser time but involve rework in the future. Make a logical decision based on the application and the nature of the problem. 

    Takeaways

    • In the discovery phase of the problem, keep the focus on the problem
    • Keep a crisp communication with the stakeholders, making them aware of the severity of the problem and assuring them about a steady solution.
    • In the mitigation phases, identify who needs to be involved in the problem resolution
    • Come up with multiple solutions to pick the most logical and efficient solution out of the lot.
    • Have tollgates in places to catch slippages at multiple levels. 
    • D.I.A.P.E.R framework
    • Design structured and centralized logs.
    • Invest time in setting up the process and invest money in getting the right tools for the team.
    • Alerts: Have a notification system in place, which shall raise flags when things go beyond a certain benchmark.
    • Prepare strategies for different severity levels and assign color codes for the course of action for each level of threat.
    • Evaluate the problem and the action via who, what, and how?
    • Resolution of the problem, which is cost and time efficient and aligns with the business goals/needs. 

    Remember that we are building the software for the people with the help of people within the team. Things go wrong even in the most elite systems with sophisticated setups. 

    Do not go harsh on yourself and others within the team. Adapt, learn, and keep shipping! 

  • Why Signals Could Be the Future for Modern Web Frameworks?

    Introduction

    When React got introduced, it had an edge over other libraries and frameworks present in that era because of a very interesting concept called one-way data binding or in simpler words uni-directional flow of data introduced as a part of Virtual DOM.

    It made for a fantastic developer experience where one didn’t have to think about how the updates flow in the UI when data (”state” to be more technical) changes.

    However, as more and more hooks got introduced there were some syntactical rules to make sure they perform in the most optimum way. Essentially, a deviation from the original purpose of React which is unidirectional flow or explicit mutations

    To call out a few

    • Filling out the dependency arrays correctly
    • Memoizing the right values or callbacks for rendering optimization
    • Consciously avoiding prop drilling

    And possibly a few more that if done the wrong way could cause some serious performance issues i.e. everything just re-renders. A slight deviation from the original purpose of just writing components to build UIs.

    The use of signals is a good example of how adopting Reactive programming primitives can help remove all this complexity and help improve developer experience by shifting focus on the right things without having to explicitly follow a set of syntactical rules for gaining performance.

    What Is a Signal?

    A signal is one of the key primitives of Reactive programming. Syntactically, they are very similar to states in React. However, the reactive capabilities of a signal is what gives it the edge.

    const [state, setState] = useState(0);
    // state -> value
    // setState -> setter
    const [signal, setSignal] = createSignal(0);
    // signal -> getter 
    // setSignal -> setter

    At this point, they look pretty much the same—except that useState returns a value and useSignal returns a getter function.

    How is a signal better than a state?

    Once useState returns a value, the library generally doesn’t concern itself with how the value is used. It’s the developer who has to decide where to use that value and has to explicitly make sure that any effects, memos or callbacks that want to subscribe to changes to that value has that value mentioned in their dependency list and in addition to that, memoizing that value to avoid unnecessary re-renders. A lot of additional effort.

    function ParentComponent() {
      const [state, setState] = useState(0);
      const stateVal = useMemo(() => {
        return doSomeExpensiveStateCalculation(state);
      }, [state]); // Explicitly memoize and make sure dependencies are accurate
      
      useEffect(() => {
        sendDataToServer(state);
      }, [state]); // Explicilty call out subscription to state
      
      return (
        <div>
          <ChildComponent stateVal={stateVal} />
        </div>
      );
    }

    A createSignal, however, returns a getter function since signals are reactive in nature. To break it down further, signals keep track of who is interested in the state’s changes, and if the changes occur, it notifies these subscribers.

    To gain this subscriber information, signals keep track of the context in which these state getters, which are essentially a function, are called. Invoking the getter creates a subscription.

    This is super helpful as the library is now, by itself, taking care of the subscribers who are subscribing to the state’s changes and notifying them without the developer having to explicitly call it out.

    createEffect(() => {
      updateDataElswhere(state());
    }); // effect only runs when `state` changes - an automatic subscription

    The contexts (not to be confused with React Context API) that are invoking the getter are the only ones the library will notify, which means memoizing, explicitly filling out large dependency arrays, and the fixing of unnecessary re-renders can all be avoided. This helps to avoid using a lot of additional hooks meant for this purpose, such as useRef, useCallback, useMemo, and a lot of re-renders.

    This greatly enhances the developer experience and shifts focus back on building components for the UI rather than spending that extra 10% of developer efforts in abiding by strict syntactical rules for performance optimization.

    function ParentComponent() {
      const [state, setState] = createSignal(0);
      const stateVal = doSomeExpensiveStateCalculation(state()); // no need memoize explicity
    
      createEffect(() => {
        sendDataToServer(state());
      }); // will only be fired if state changes - the effect is automatically added as a subscriber
    
      return (
        <div>
          <ChildComponent stateVal={stateVal} />
        </div>
      );
    }

    Conclusion

    It might look like there’s a very biased stance toward using signals and reactive programming in general. However, that’s not the case.

    React is a high-performance, optimized library—even though there are some gaps or misses in using your state in an optimum way, which leads to unnecessary re-renders, it’s still really fast. After years of using React a certain way, frontend developers are used to visualizing a certain flow of data and re-rendering, and replacing that entirely with a reactive programming mindset is not natural. React is still the de facto choice for building user interfaces, and it will continue to be with every iteration and new feature added.

    Reactive programming, in addition to performance enhancements, also makes the developer experience much simpler by boiling down to three major primitives: Signal, Memo, and Effects. This helps focus more on building components for UIs rather than worrying about dealing explicitly with performance optimization.

    Signals are increasingly getting popular and are a part of many modern web frameworks, such as Solid.js, Preact, Qwik, and Vue.js.

  • Apache Flink – A Solution for Real-Time Analytics

    In today’s world, data is being generated at an unprecedented rate. Every click, every tap, every swipe, every tweet, every post, every like, every share, every search, and every view generates a trail of data. Businesses are struggling to keep up with the speed and volume of this data, and traditional batch-processing systems cannot handle the scale and complexity of this data in real-time.

    This is where streaming analytics comes into play, providing faster insights and more timely decision-making. Streaming analytics is particularly useful for scenarios that require quick reactions to events, such as financial fraud detection or IoT data processing. It can handle large volumes of data and provide continuous monitoring and alerts in real-time, allowing for immediate action to be taken when necessary.

    Stream processing or real-time analytics is a method of analyzing and processing data as it is generated, rather than in batches. It allows for faster insights and more timely decision-making. Popular open-source stream processing engines include Apache Flink, Apache Spark Streaming, and Apache Kafka Streams. In this blog, we are going to talk about Apache Flink and its fundamentals and how it can be useful for streaming analytics. 

    Introduction

    Apache Flink is an open-source stream processing framework first introduced in 2014. Flink has been designed to process large amounts of data streams in real-time, and it supports both batch and stream processing. It is built on top of the Java Virtual Machine (JVM) and is written in Java and Scala.

    Flink is a distributed system that can run on a cluster of machines, and it has been designed to be highly available, fault-tolerant, and scalable. It supports a wide range of data sources and provides a unified API for batch and stream processing, which makes it easy to build complex data processing applications.

    Advantages of Apache Flink

    Real-time analytics is the process of analyzing data as it is generated. It requires a system that can handle large volumes of data in real-time and provide insights into the data as soon as possible. Apache Flink has been designed to meet these requirements and has several advantages over other real-time data processing systems.

    1. Low Latency: Flink processes data streams in real-time, which means it can provide insights into the data almost immediately. This makes it an ideal solution for applications that require low latency, such as fraud detection and real-time recommendations.
    2. High Throughput: Flink has been designed to handle large volumes of data and can scale horizontally to handle increasing volumes of data. This makes it an ideal solution for applications that require high throughput, such as log processing and IoT applications.
    3. Flexible Windowing: Flink provides a flexible windowing API that enables the creation of complex windows for processing data streams. This enables the creation of windows based on time, count, or custom triggers, which makes it easy to create complex data processing applications.
    4. Fault Tolerance: Flink is designed to be highly available and fault-tolerant. It can recover from failures quickly and can continue processing data even if some of the nodes in the cluster fail.
    5. Compatibility: Flink is compatible with a wide range of data sources, including Kafka, Hadoop, and Elasticsearch. This makes it easy to integrate with existing data processing systems.

    Flink Architecture

    Apache Flink processes data streams in a distributed manner. The Flink cluster consists of several nodes, each of which is responsible for processing a portion of the data. The nodes communicate with each other using a messaging system, such as Apache Kafka.

    The Flink cluster processes data streams in parallel by dividing the data into small chunks, or partitions, and processing them independently. Each partition is processed by a task, which is a unit of work that runs on a node in the cluster.

    Flink provides several APIs for building data processing applications, including the DataStream API, the DataSet API, and the Table API. The below diagram illustrates what a Flink cluster looks like.

    Apache Flink Cluster
    • Flink application runs on a cluster.
    • A Flink cluster has a job manager and a bunch of task managers.
    • A job manager is responsible for effective allocation and management of computing resources. 
    • Task managers are responsible for the execution of a job.

    Flink Job Execution

    1. Client system submits job graph to the job manager
    • A client system prepares and sends a dataflow/job graph to the job manager.
    • It can be your Java/Scala/Python Flink application or the CLI.
    • The runtime and program execution do not include the client.
    • After submitting the job, the client can either disconnect and operate in detached mode or remain connected to receive progress reports in attached mode.

    Given below is an illustration of how the job graph converted from code looks like

    Job Graph
    1. The job graph is converted to an execution graph by the job manager
    • The execution graph is a parallel version of the job graph. 
    • For each job vertex, it contains an execution vertex per parallel subtask. 
    • An operator that exhibits a parallelism level of 100 will consist of a single job vertex and 100 execution vertices.

    Given below is an illustration of what an execution graph looks like:

    Execution Graph
    1. Job manager submits the parallel instances of execution graph to task managers
    • Execution resources in Flink are defined through task slots. 
    • Each task manager will have one or more task slots, each of which can run one pipeline of parallel tasks. 
    • A pipeline consists of multiple successive tasks
    Parallel instances of execution graph being submitted to task slots

    Flink Program

    Flink programs look like regular programs that transform DataStreams. Each program consists of the same basic parts:

    • Obtain an execution environment 

    ExecutionEnvironment is the context in which a program is executed. This is how execution environment is set up in Flink code:

    ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment(); // if program is running on local machine
    ExecutionEnvironment env = new CollectionEnvironment(); // if source is collections
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // will do the right thing based on context

    • Connect to data stream

    We can use an instance of the execution environment to connect to the data source which can be file System, a streaming application or collection. This is how we can connect to data source in Flink: 

    DataSet<String> data = env.readTextFile("file:///path/to/file"); // to read from file
    DataSet<User> users = env.fromCollection( /* get elements from a Java Collection */); // to read from collections
    DataSet<User> users = env.addSource(/*streaming application or database*/);

    • Perform Transformations

    We can perform transformation on the events/data that we receive from the data sources.
    A few of the data transformation operations are map, filter, keyBy, flatmap, etc.

    • Specify where to send the data

    Once we have performed the transformation/analytics on the data that is flowing through the stream, we can specify where we will send the data.
    The destination can be a filesystem, database, or data streams.

     dataStream.sinkTo(/*streaming application or database api */);

    Flink Transformations

    1. Map: Takes one element at a time from the stream and performs some transformation on it, and gives one element of any type as an output.

      Given below is an example of Flink’s map operator:

    stream.map(new MapFunction<Integer, String>()
    {
    public String map(Integer integer)
    {
    return " input -> "+integer +" : " +		
    " output -> " +
    ""+numberToWords(integer	
    .toString().	
    toCharArray()); // converts number to words
    }
    }).print();

    1. Filter: Evaluates a boolean function for each element and retains those for which the function returns true.

    Given below is an example of Flink’s filter operator:

    stream.filter(new FilterFunction<Integer>()
    {
    public String filter(Integer integer) throws Exception
    {
    return integer%2 != 0;
    }
    }).print();

    1. Reduce: A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.

    Given below is an example of Flink’s reduce operator:

    DataStream<Integer> stream = env.fromCollection(data);
    stream.countWindowAll(3)
    .reduce(new ReduceFunction<Integer>(){
    public Integer reduce(Integer integer, Integer t1)  throws Exception
    {
    return integer+=t1;
    }
    }).print();

    Input : 

    Output : 

    1. KeyBy: 
    • Logically partitions a stream into disjoint partitions. 
    • All records with the same key are assigned to the same partition. 
    • Internally, keyBy() is implemented with hash partitioning.

    The figure below illustrates how key by operator works in Flink.

    Fault Tolerance

    • Flink combines stream replay and checkpointing to achieve fault tolerance. 
    • At a checkpoint, each operator’s corresponding state and the specific point in each input stream are marked.
    • Whenever Checkpointing is done, a snapshot of the data of all the operators is saved in the state backend, which is generally the job manager’s memory or configurable durable storage.
    • Whenever there is a failure, operators are reset to the most recent state in the state backend, and event processing is resumed.

    Checkpointing

    • Checkpointing is implemented using stream barriers.
    • Barriers are injected into the data stream at the source. E.g., kafka, kinesis, etc.
    • Barriers flow with the records as part of the data stream.

    Refer below diagram to understand how checkpoint barriers flow with the events:

    Checkpoint Barriers
    Saving Snapshots
    • Operators snapshot their state at the point in time when they have received all snapshot barriers from their input streams, and before emitting the barriers to their output streams.
    • Once a sink operator (the end of a streaming DAG) has received the barrier n from all of its input streams, it acknowledges that snapshot n to the checkpoint coordinator. 
    • After all sinks have acknowledged a snapshot, it is considered completed.

    The below diagram illustrates how checkpointing is achieved in Flink with the help of barrier events, state backends, and checkpoint table.

    Checkpointing

    Recovery

    • Flink selects the latest completed checkpoint upon failure. 
    • The system then re-deploys the entire distributed dataflow.
    • Gives each operator the state that was snapshotted as part of the checkpoint.
    • The sources are set to start reading the stream from the position given in the snapshot.
    • For example, in Apache Kafka, that means telling the consumer to start fetching messages from an offset given in the snapshot.

    Scalability  

    A Flink job can be scaled up and scaled down as per requirement.

    This can be done manually by:

    1. Triggering a savepoint (manually triggered checkpoint)
    2. Adding/Removing nodes to/from the cluster
    3. Restarting the job from savepoint

    OR 

    Automatically by Reactive Scaling

    • The configuration of a job in Reactive Mode ensures that it utilizes all available resources in the cluster at all times.
    • Adding a Task Manager will scale up your job, and removing resources will scale it down. 
    • Reactive Mode restarts a job on a rescaling event, restoring it from the latest completed checkpoint.
    • The only downside is that it works only in standalone mode.

    Alternatives  

    • Spark Streaming: It is an open-source distributed computing engine that has added streaming capabilities, but Flink is optimized for low-latency processing of real-time data streams and supports more complex processing scenarios.
    • Apache Storm: It is another open-source stream processing system that has a steeper learning curve than Flink and uses a different architecture based on spouts and bolts.
    • Apache Kafka Streams: It is a lightweight stream processing library built on top of Kafka, but it is not as feature-rich as Flink or Spark, and is better suited for simpler stream processing tasks.

    Conclusion  

    In conclusion, Apache Flink is a powerful solution for real-time analytics. With its ability to process data in real-time and support for streaming data sources, it enables businesses to make data-driven decisions with minimal delay. The Flink ecosystem also provides a variety of tools and libraries that make it easy for developers to build scalable and fault-tolerant data processing pipelines.

    One of the key advantages of Apache Flink is its support for event-time processing, which allows it to handle delayed or out-of-order data in a way that accurately reflects the sequence of events. This makes it particularly useful for use cases such as fraud detection, where timely and accurate data processing is critical.

    Additionally, Flink’s support for multiple programming languages, including Java, Scala, and Python, makes it accessible to a broad range of developers. And with its seamless integration with popular big data platforms like Hadoop and Apache Kafka, it is easy to incorporate Flink into existing data infrastructure.

    In summary, Apache Flink is a powerful and flexible solution for real-time analytics, capable of handling a wide range of use cases and delivering timely insights that drive business value.

    References  

  • An Introduction to Stream Processing & Analytics

    What is Stream Processing and Analytics?

    Stream processing is a technology used to process large amounts of data in real-time as it is generated rather than storing it and processing it later.

    Think of it like a conveyor belt in a factory. The conveyor belt constantly moves, bringing in new products that need to be processed. Similarly, stream processing deals with data that is constantly flowing, like a stream of water. Just like the factory worker needs to process each product as it moves along the conveyor belt, stream processing technology processes each piece of data as it arrives.

    Stateful and stateless processing are two different approaches to stream processing, and the right choice depends on the specific requirements and needs of the application. 

    Stateful processing is useful in scenarios where the processing of an event or data point depends on the state of previous events or data points. For example, it can be used to maintain a running total or average across multiple events or data points.

    Stateless processing, on the other hand, is useful in scenarios where the processing of an event or data point does not depend on the state of previous events or data points. For example, in a simple data transformation application, stateless processing can be used to transform each event or data point independently without the need to maintain state.

    Streaming analytics refers to the process of analyzing and processing data in real time as it is generated. Streaming analytics enable applications to react to events and make decisions in near real time.

    Why Stream Processing and Analytics?

    Stream processing is important because it allows organizations to make real-time decisions based on the data they are receiving. This is particularly useful in situations where timely information is critical, such as in financial transactions, network security, and real-time monitoring of industrial processes.

    For example, in financial trading, stream processing can be used to analyze stock market data in real time and make split-second decisions to buy or sell stocks. In network security, it can be used to detect and respond to cyber-attacks in real time. And in industrial processes, it can be used to monitor production line efficiency and quickly identify and resolve any issues.

    Stream processing is also important because it can process massive amounts of data, making it ideal for big data applications. With the growth of the Internet of Things (IoT), the amount of data being generated is growing rapidly, and stream processing provides a way to process this data in real time and derive valuable insights.

    Collectively, stream processing provides organizations with the ability to make real-time decisions based on the data they are receiving, allowing them to respond quickly to changing conditions and improve their operations.

    How is it different from Batch Processing?

    Batch Data Processing:

    Batch Data Processing is a method of processing where a group of transactions or data is collected over a period of time and is then processed all at once in a “batch”. The process begins with the extraction of data from its sources, such as IoT devices or web/application logs. This data is then transformed and integrated into a data warehouse. The process is generally called the Extract, Transform, Load (ETL) process. The data warehouse is then used as the foundation for an analytical layer, which is where the data is analyzed, and insights are generated.

    Stream/Real-time Data Processing:

    Real-Time Data Streaming involves the continuous flow of data that is generated in real-time, typically from multiple sources such as IoT devices or web/application logs. A message broker is used to manage the flow of data between the stream processors, the analytical layer, and the data sink. The message broker ensures that the data is delivered in the correct order and that it is not lost. Stream processors used to perform data ingestion and processing. These processors take in the data streams and process them in real time. The processed data is then sent to an analytical layer, where it is analyzed, and insights are generated. 

    Processes involved in Stream processing and Analytics:

    The process of stream processing can be broken down into the following steps:

    • Data Collection: The first step in stream processing is collecting data from various sources, such as sensors, social media, and transactional systems. The data is then fed into a stream processing system in real time.
    • Data Ingestion: Once the data is collected, it needs to be ingested or taken into the stream processing system. This involves converting the data into a standard format that can be processed by the system.
    • Data Processing: The next step is to process the data as it arrives. This involves applying various processing algorithms and rules to the data, such as filtering, aggregating, and transforming the data. The processing algorithms can be applied to individual events in the stream or to the entire stream of data.
    • Data Storage: After the data has been processed, it is stored in a database or data warehouse for later analysis. The storage can be configured to retain the data for a specific amount of time or to retain all the data.
    • Data Analysis: The final step is to analyze the processed data and derive insights from it. This can be done using data visualization tools or by running reports and queries on the stored data. The insights can be used to make informed decisions or to trigger actions, such as sending notifications or triggering alerts.

    It’s important to note that stream processing is an ongoing process, with data constantly being collected, processed, and analyzed in real time. The visual representation of this process can be represented as a continuous cycle of data flowing through the system, being processed and analyzed at each step along the way.

    Stream Processing Platforms & Frameworks:

    Stream Processing Platforms & Tools are software systems that enable the collection, processing, and analysis of real-time data streams.

    Stream Processing Frameworks:

    A stream processing framework is a software library or framework that provides a set of tools and APIs for developers to build custom stream processing applications. Frameworks typically require more development effort and configuration to set up and use. They provide more flexibility and control over the stream processing pipeline but also require more development and maintenance resources. 

    Examples: Apache Spark Streaming, Apache Flink, Apache Beam, Apache Storm, Apache Samza

    Let’s first look into the most commonly used stream processing frameworks: Apache Flink & Apache Spark Streaming.

    Apache Flink : 

    Flink is an open-source, unified stream-processing and batch-processing framework. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner, making it ideal for processing huge amounts of data in real-time.

    • Flink provides out-of-the-box checkpointing and state management, two features that make it easy to manage enormous amounts of data with relative ease.
    • The event processing function, the filter function, and the mapping function are other features that make handling a large amount of data easy.

    Flink also comes with real-time indicators and alerts which make abig difference when it comes to data processing and analysis.

    Note: We have discussed the stream processing and analytics in detail in Stream Processing and Analytics with Apache Flink

    Apache Spark Streaming : 

    Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads. Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data.

    • Great for solving complicated transformative logic
    • Easy to program
    • Runs at blazing speeds
    • Processes large data within a fraction of second

    Stream Processing Platforms:

    A stream processing platform is an end-to-end solution for processing real-time data streams. Platforms typically require less development effort and maintenance as they provide pre-built tools and functionality for processing, analyzing, and visualizing data. 

    Examples: Apache Kafka, Amazon Kinesis, Google Cloud Pub-Sub

    Let’s look into the most commonly used stream processing platforms: Apache Kafka & AWS Kinesis.

    Apache Kafka: 

    Apache Kafka is an open-source distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

    • Because it’s an open-source, “Kafka generally requires a higher skill set to operate and manage, so it’s typically used for development and testing.
    • APIs allow “producers” to publish data streams to “topics;” a “topic” is a partitioned log of records; a “partition” is ordered and immutable; “consumers” subscribe to “topics.”
    •  It can run on a cluster of “brokers” with partitions split across cluster nodes. 
    • Messages can be effectively unlimited in size (2GB). 

    AWS Kinesis:

    Amazon Kinesis is a cloud-based service on Amazon Web Services (AWS) that allows you to ingest real-time data such as application logs, website clickstreams, and IoT telemetry data for machine learning and analytics, as well as video and audio. 

    • Amazon Kinesis is a SaaS offering, reducing the complexities in the design, build, and manage stages compared to open-source Apache Kafka. It’s ideally suited for building microservices architectures. 
    • “Producers” can push data as soon as it is put on the stream.  Kinesis breaks the stream across “shards” (which are like partitions). 
    • Shards have a hard limit on the number of transactions and data volume per second. If you need more volume, you must subscribe to more shards. You pay for what you use.
    •  Most maintenance and configurations are hidden from the user. Scaling is easy (adding shards) compared to Kafka. 
    • Maximum message size is 1MB.

    Three Characteristics of Event Stream processing Platform:

    Publish and Subscribe:

    In a publish-subscribe model, producers publish events or messages to streams or topics, and consumers subscribe to streams or topics to receive the events or messages. This is similar to a message queue or enterprise messaging system. It allows for the decoupling of the producer and consumer, enabling them to operate independently and asynchronously. 

    Store streams of events in a fault-tolerant way

    This means that the platform is able to store and manage events in a reliable and resilient manner, even in the face of failures or errors. To achieve fault tolerance, event streaming platforms typically use a variety of techniques, such as replicating data across multiple nodes, and implementing data recovery and failover mechanisms.

    Process streams of events in real-time, as they occur

    This means that the platform can process and analyze data as it is generated rather than waiting for data to be batch-processed or stored for later processing.

    Challenges when designing the stream processing and analytics solution:

    Stream processing is a powerful technology, but there are also several challenges associated with it, including:

    • Late arriving data: Data that is delayed or arrives out of order can disrupt the processing pipeline and lead to inaccurate results. Stream processing systems need to be able to handle out-of-order data and reconcile it with the data that has already been processed.
    • Missing data: If data is missing or lost, it can impact the accuracy of the processing results. Stream processing systems need to be able to identify missing data and handle it appropriately, whether by skipping it, buffering it, or alerting a human operator.
    • Duplicate data: Duplicate data can lead to over-counting and skewed results. Stream processing systems need to be able to identify and de-duplicate data to ensure accurate results.
    • Data skew: data skew occurs when there is a disproportionate amount of data for certain key fields or time periods. This can lead to performance issues, processing delays, and inaccurate results. Stream processing systems need to be able to handle data skew by load balancing and scaling resources appropriately.
    • Fault tolerance: Stream processing systems need to be able to handle hardware and software failures without disrupting the processing pipeline. This requires fault-tolerant design, redundancy, and failover mechanisms.
    • Data security and privacy: Real-time data processing often involves sensitive data, such as personal information, financial data, or intellectual property. Stream processing systems need to ensure that data is securely transmitted, stored, and processed in compliance with regulatory requirements.
    • Latency: Another challenge with stream processing is latency or the amount of time it takes for data to be processed and analyzed. In many applications, the results of the analysis need to be produced in real-time, which puts pressure on the stream processing system to process the data quickly.
    • Scalability: Stream processing systems must be able to scale to handle large amounts of data as the amount of data being generated continues to grow. This can be a challenge because the systems must be designed to handle data in real-time while also ensuring that the results of the analysis are accurate and reliable.
    • Maintenance: Maintaining a stream processing system can also be challenging, as the systems are complex and require specialized knowledge to operate effectively. In addition, the systems must be able to evolve and adapt to changing requirements over time.

    Despite these challenges, stream processing remains an important technology for organizations that need to process data in real time and make informed decisions based on that data. By understanding these challenges and designing the systems to overcome them, organizations can realize the full potential of stream processing and improve their operations.

    Key benefits of stream processing and analytics:

    • Real-time processing keeps you in sync all the time:

    For Example: Suppose an online retailer uses a distributed system to process orders. The system might include multiple components, such as a web server, a database server, and an application server. The different components could be kept in sync by real-time processing by processing orders as they are received and updating the database accordingly. As a result, orders would be accurate and processed efficiently by maintaining a consistent view of the data.

    • Real-time data processing is More Accurate and timely:

    For Example a financial trading system that processes data in real-time can help to ensure that trades are executed at the best possible prices, improving the accuracy and timeliness of the trades. 

    • Deadlines are met with Real-time processing:

    For example: In a control system, it may be necessary to respond to changing conditions within a certain time frame in order to maintain the stability of the system. 

    • Real-time processing is quite reactive:

    For example, a real-time processing system might be used to monitor a manufacturing process and trigger an alert if it detects a problem or to analyze sensor data from a power plant and adjust the plant’s operation in response to changing conditions.

    • Real-time processing involves multitasking:

    For example, consider a real-time monitoring system that is used to track the performance of a large manufacturing plant. The system might receive data from multiple sensors and sources, including machine sensors, temperature sensors, and video cameras. In this case, the system would need to be able to multitask in order to process and analyze data from all of these sources in real time and to trigger alerts or take other actions as needed. 

    • Real-time processing works independently:

    For example, a real-time processing system may rely on a database or message queue to store and retrieve data, or it may rely on external APIs or services to access additional data or functionality.

    Use case studies:

    There are many real-life examples of stream processing in different industries that demonstrate the benefits of this technology. Here are a few examples:

    • Financial Trading: In the financial industry, stream processing is used to analyze stock market data in real time and make split-second decisions to buy or sell stocks. This allows traders to respond to market conditions in real time and improve their chances of making a profit.
    • Network Security: Stream processing is also used in network security to detect and respond to cyber-attacks in real-time. By processing network data in real time, security systems can quickly identify and respond to threats, reducing the risk of a data breach.
    • Industrial Monitoring: In the industrial sector, stream processing is used to monitor production line efficiency and quickly identify and resolve any issues. For example, it can be used to monitor the performance of machinery and identify any potential problems before they cause a production shutdown.
    • Social Media Analysis: Stream processing is also used to analyze social media data in real time. This allows organizations to monitor brand reputation, track customer sentiment, and respond to customer complaints in real time.
    • Healthcare: In the healthcare industry, stream processing is used to monitor patient data in real time and quickly identify any potential health issues. For example, it can be used to monitor vital signs and alert healthcare providers if a patient’s condition worsens.

    These are just a few examples of the many real-life applications of stream processing. Across all industries, stream processing provides organizations with the ability to process data in real time and make informed decisions based on the data they are receiving.

    How to start stream analytics?

    • Our recommendation in building a dedicated platform is to keep the focus on choosing a diverse stream processor to pair with your existing analytical interface. 
    • Or, keep an eye on vendors who offer both stream processing and BI as a service.

    Resources:

    Here are some useful resources for learning more about stream processing:

    Videos:

    Tutorials:

    Articles:

    These resources will provide a good starting point for learning more about stream processing and how it can be used to solve real-world problems. 

    Conclusion:

    Real-time data analysis and decision-making require stream processing and analytics in diverse industries, including finance, healthcare, and e-commerce. Organizations can improve operational efficiency, customer satisfaction, and revenue growth by processing data in real time. A robust infrastructure, skilled personnel, and efficient algorithms are required for stream processing and analytics. Businesses need stream processing and analytics to stay competitive and agile in today’s fast-paced world as data volumes and complexity continue to increase.

  • Machine Learning in Flutter using TensorFlow

    Machine learning has become part of day-to-day life. Small tasks like searching songs on YouTube and suggestions on Amazon are using ML in the background. This is a well-developed field of technology with immense possibilities. But how we can use it?

    This blog is aimed at explaining how easy it is to use machine learning models (which will act as a brain) to build powerful ML-based Flutter applications. We will briefly touch base on the following points

    1. Definitions

    Let’s jump to the part where most people are confused. A person who is not exposed to the IT industry might think AI, ML, & DL are all the same. So, let’s understand the difference.  

    Figure 01

    1.1. Artificial Intelligence (AI): 

    AI, i.e. artificial intelligence, is a concept of machines being able to carry out tasks in a smarter way. You all must have used YouTube. In the search bar, you can type the lyrics of any song, even lyrics that are not necessarily the starting part of the song or title of songs, and get almost perfect results every time. This is the work of a very powerful AI.
    Artificial intelligence is the ability of a machine to do tasks that are usually done by humans. This ability is special because the task we are talking about requires human intelligence and discernment.

    1.2. Machine Learning (ML):

    Machine learning is a subset of artificial intelligence. It is based on the idea that we expose machines to new data, which can be a complete or partial row, and let the machine decide the future output. We can also say it is a sub-field of AI that deals with the extraction of patterns from data sets. With a new data set and processing, the last result machine will slowly reach the expected result. This means that the machine can find rules for optical behavior to get new output. It also can adapt itself to new changing data just like humans.

    1.3. Deep Learning (ML): 

    Deep learning is again a smaller subset of machine learning, which is essentially a neural network with multiple layers. These neural networks attempt to simulate the behavior of the human brain, so you can say we are trying to create an artificial human brain. With one layer of a neural network, we can still make approximate predictions, and additional layers can help to optimize and refine for accuracy.

    2. Types of ML

    Before starting the implementation, we need to know the types of machine learning because it is very important to know which type is more suitable for our expected functionality.

    Figure 02

    2.1. Supervised Learning

    As the name suggests, in supervised learning, the learning happens under supervision. Supervision means the data that is provided to the machine is already classified data i.e., each piece of data has fixed labels, and inputs are already mapped to the output.
    Once the machine is learned, it is ready for the classification of new data.
    This learning is useful for tasks like fraud detection, spam filtering, etc.

    2.2. Unsupervised Learning

    In unsupervised learning, the data given to machines to learn is purely raw, with no tags or labels. Here, the machine is the one that will create new classes by extracting patterns.
    This learning can be used for clustering, association, etc.

    2.3. Semi-Supervised Learning

    Both supervised and unsupervised have their own limitations, because one requires labeled data, and the other does not, so this learning combines the behavior of both learnings, and with that, we can overcome the limitations.
    In this learning, we feed row data and categorized data to the machine so it can classify the row data, and if necessary, create new clusters.

    2.4. : Reinforcement Learning

    For this learning, we feed the last output’s feedback with new incoming data to machines so they can learn from their mistakes. This feedback-based process will continue until the machine reaches the perfect output. This feedback is given by humans in the form of punishment or reward. This is like when a search algorithm gives you a list of results, but users do not click on other than the first result. It is like a human child who is learning from every available option and by correcting its mistakes, it grows.

    3. TensorFlow

    Machine learning is a complex process where we need to perform multiple activities like processing of acquiring data, training models, serving predictions, and refining future results.

    To perform such operations, Google developed a framework in November 2015 called TensorFlow. All the above-mentioned processes can become easy if we use the TensorFlow framework. 

    For this project, we are not going to use a complete TensorFlow framework but a small tool called TensorFlow Lite

    3.1. TensorFlow Lite

    TensorFlow Lite allows us to run the machine learning models on devices with limited resources, like limited RAM or memory.

    3.2. TensorFlow Lite Features

    • Optimized for on-device ML by addressing five key constraints: 
    • Latency: because there’s no round-trip to a server 
    • Privacy: because no personal data leaves the device 
    • Connectivity: because internet connectivity is not required 
    • Size: because of a reduced model and binary size
    • Power consumption: because of efficient inference and a lack of network connections
    • Support for Android and iOS devices, embedded Linux, and microcontrollers
    • Support for Java, Swift, Objective-C, C++, and Python programming languages
    • High performance, with hardware acceleration and model optimization
    • End-to-end examples for common machine learning tasks such as image classification, object detection, pose estimation, question answering, text classification, etc., on multiple platforms

    4. What is Flutter?

    Flutter is an open source, cross-platform development framework. With the help of Flutter by using a single code base, we can create applications for Android, iOS, web, as well as desktop. It was created by Google and uses Dart as a development language. The first stable version of Flutter was released in Apr 2018, and since then, there have been many improvements. 

    5. Building an ML-Flutter Application

    We are now going to build a Flutter application through which we can find the state of mind of a person from their facial expressions. The below steps explain the update we need to do for an Android-native application. For an iOS application, please refer to the links provided in the steps.

    5.1. TensorFlow Lite – Native setup (Android)

    • In android/app/build.gradle, add the following setting in the android block:
    aaptOptions {
            noCompress 'tflite'
            noCompress 'lite'
        }

    5.2. TensorFlow Lite – Flutter setup (Dart)

    • Create an assets folder and place your label file and model file in it. (These files we will create shortly.) In pubspec.yaml add:
    assets:
       - assets/labels.txt
       - assets/<file_name>.tflite

     

    Figure 02

    • Run this command (Install TensorFlow Light package): 
    $ flutter pub add tflite

    • Add the following line to your package’s pubspec.yaml (and run an implicit flutter pub get):
    dependencies:
         tflite: ^0.9.0

    • Now in your Dart code, you can use:
    import 'package:tflite/tflite.dart';

    • Add camera dependencies to your package’s pubspec.yaml (optional):
    dependencies:
         camera: ^0.10.0+1

    • Now in your Dart code, you can use:
    import 'package:camera/camera.dart';

    • As the camera is a hardware feature, in the native code, there are few updates we need to do for both Android & iOS.  To learn more, visit:
    https://pub.dev/packages/camera
    • Following is the code that will appear under dependencies in pubspec.yaml once the the setup is complete.
    Figure 03
    • Flutter will automatically download the most recent version if you ignore the version number of packages.
    • Do not forget to add the assets folder in the root directory.

    5.3. Generate model (using website)

    • Click on Get Started

    • Select Image project
    • There are three different categories of ML projects available. We’ll choose an image project since we’re going to develop a project that analyzes a person’s facial expression to determine their emotional condition.
    • The other two types, audio project and pose project, will be useful for creating projects that involve audio operation and human pose indication, respectively.

    • Select Standard Image model
    • Once more, there are two distinct groups of image machine learning projects. Since we are creating a project for an Android smartphone, we will select a standard picture project.
    • The other type, an Embedded Image Model project, is designed for hardware with relatively little memory and computing power.

    • Upload images for training the classes
    • We will create new classes by clicking on “Add a class.”
    • We must upload photographs to these classes as we are developing a project that analyzes a person’s emotional state from their facial expression.
    • The more photographs we upload, the more precise our result will be.
    • Click on train model and wait till training is over
    • Click on Export model
    • Select TensorFlow Lite Tab -> Quantized  button -> Download my model

    5.4. Add files/models to the Flutter project

    • Labels.txt

    File contains all the class names which you created during model creation.

     

    • *.tflite

    File contains the original model file as well as associated files a ZIP.

    5.5. Load & Run ML-Model

    • We are importing the model from assets, so this line of code is crucial. This model will serve as the project’s brain.
    • Here, we’re configuring the camera using a camera controller and obtaining a live feed (Cameras[0] is the front camera).

    6. Conclusion

    We can achieve good performance of a Flutter app with an appropriate architecture, as discussed in this blog.

  • Demystifying UI Frameworks and Theming for React Apps

    Introduction:

    In this blog, we will be talking about design systems, diving into the different types of CSS frameworks/libraries, then looking into issues that come with choosing a framework that is not right for your type of project. Then we will be going over different business use cases where these different frameworks/libraries match their requirements.

    Let’s paint a scenario: when starting a project, you start by choosing a JS framework. Let’s say, for example, that you went with a popular framework like React. Depending on whether you want an isomorphic app, you will look at Next.js. Next, we choose a UI framework, and that’s when our next set of problems appears.

    WHICH ONE?

    It’s hard to go with even the popular ones because it might not be what you are looking for. There are different types of libraries handling different types of use cases, and there are so many similar ones that each handle stuff slightly differently.

    These frameworks come and go, so it’s essential to understand the fundamentals of CSS. These libraries and frameworks help you build faster; they don’t change how CSS works.

    But, continuing with our scenario, let’s say we choose a popular library like Bootstrap, or Material. Then, later on, as you’re working through the project, you notice issues like:

    – Overriding default classes more than required 

    – End up with ugly-looking code that’s hard to read

    – Bloated CSS that reduces performance (flash of unstyled content issues, reduced CLS, FCP score)

    – Swift changing designs, but you’re stuck with rigid frameworks, so migrating is hard and requires a lot more effort

    – Require swift development but end up building from scratch

    – Ending up with a div soup with no semantic meaning

    To solve these problems and understand how these frameworks work, we have segregated them into the following category types. 

    We will dig into each category and look at how they work, their pros/cons and their business use case.

    Categorizing the available frameworks:

    Vanilla Libraries 

    These libraries allow you to write vanilla CSS with some added benefits like vendor prefixing, component-level scoping, etc.  You can use this as a building block to create your own styling methodology. Essentially, it’s mainly CSS in JS-type libraries that come in this type of category. CSS modules would also come under these as well since you are writing CSS in a module file.

    Also, inline styles in React seem to resemble a css-in-js type method, but they are different. For inline styles, you would lose out on media queries, keyframe animations, and selectors like pseudo-class, pseudo-element, and attribute selectors. But css-in-js type libraries have these abilities.  

    It also differs in how the out the CSS; inline styling would result in inline CSS in the HTML for that element, whereas css-in-js outputs as internal styles with class names.

    Nowadays, these css-in-js types are popular for their optimized critical render path strategy for performance.

    Example:

    Emotion

    import styled from @emotion/styled';
    const Button = styled.button`
        padding: 32px;
        background-color: hotpink;
        font-size: 24px;
        border-radius: 4px;
        color: black;
        font-weight: bold;
        &:hover {
            color: white;
        }
    `
    render(<Button>This my button component.</Button>)

    Styled Components

    const Button = styled.a`
    /* This renders the buttons above... Edit me! */
    display: inline-block;
    border-radius: 3px;
    padding: 0.5rem 0;
    margin: 0.5rem 1rem;
    width: 11rem;
    background: transparent;
    color: white;
    border: 2px solid white;
    /* The GitHub button is a primary button
    * edit this to target it specifically! */
    ${propsprops. primary && css`
    background: white;
    color: black;`} 

    List of example frameworks: 

       – Styled components

       – Emotion

       – Vanilla-extract

       – Stitches

       – CSS modules
    (CSS modules is not an official spec or an implementation in the browser, but rather, it’s a process in a build step (with the help of Webpack or Browserify) that changes class names and selectors to be scoped.)

    Pros:

    • Fully customizable—you can build on top of it
    • Doesn’t bloat CSS, only loads needed CSS
    • Performance
    • Little to no style collision

    Cons:

    • Requires effort and time to make components from scratch
    • Danger of writing smelly code
    • Have to handle accessibility on your own

    Where would you use these?

    • A website with an unconventional design that must be built from scratch.
    • Where performance and high webvital scores are required—the performance, in this case, refers to an optimized critical render path strategy that affects FCP and CLS.
    • Generally, it would be user-facing applications like B2C.

    Unstyled / Functional Libraries

    Before coming to the library, we would like to cover a bit on accessibility.

    Apart from a website’s visual stuff, there is also a functional aspect, accessibility.

    And many times, when we say accessibility in the context of web development, people automatically think of screen readers. But it doesn’t just mean website accessibility to people with a disability; it also means enabling as many people as possible to use your websites, even people with or without disabilities or people who are limited. 

    Different age groups

    Font size settings on phones and browser settings should be reflected on the app

    Situational limitation

    Dark mode and light mode

    Different devices

    Mobile, desktop, tablet

    Different screen sizes

    Ultra wide 21:9, normal monitor screen size 16:9 

    Interaction method

    Websites can be accessible with keyboard only, mouse, touch, etc.

    But these libraries mostly handle accessibility for the disabled, then interaction methods and focus management. The rest is left to developers, which includes settings that are more visual in nature, like screen sizes, light/dark mode etc.

    In general, ARIA attributes and roles are used to provide information about the interaction of a complex widget. The libraries here sprinkle this information onto their components before giving them to be styled.

    So, in short, these are low-level UI libraries that handle the functional part of the UI elements, like accessibility, keyboard navigation, or how they work. They come with little-to-no styling, which is meant to be overridden.

    Radix UI

    // Compose a Dialog with custom focus management
    export const InfoDialog = ({ children }) => {
        const dialogCloseButton = React.useRef(null);
        return (
            <Dialog.Root>
                <Dialog.Trigger>View details</Dialog.Trigger>
                <Dialog.Overlay />
                <Dialog.Portal>
                    <DialogContent
                        onOpenAutoFocus={(event) => {
                        // Focus the close button when dialog opens
                            dialogCloseButton.current?.focus();
                            event.preventDefault();
                        }}>
                        {children}
                        <Dialog.Close ref={dialogCloseButton}>
                            Close
                        </Dialog.Close>
                    </DialogContent>
                </Dialog.Portal>
            </Dialog.Root>
        )
    } 

    React Aria

    import React from "react";
    function Breadcrumbs (props) {
        let { navProps } = useBreadcrumbs(props);
        let children = React. Children.toArray (props.children);
        return (
            <nav {...navProps}>
                <ol style={{ display: 'flex', listStyle: 'none', margin: 0}}>
                    {children.map((child, i) => React.cloneElement(child, { isCurrent: i === children.le})
                    )}
                </ol>
            </nav>
        )
    }

    List of the frameworks:

    • Radix UI
    • Reach UI
    • React Aria, React Stately (by Adobe)
    • Headless-UI

    Pros:

    • Gives perfect accessibility and functionality
    • Gives the flexibility to create composable elements
    • Unopinionated styling, free to override

    Cons:

    • Can’t be used for a rapid development project or prototyping
    • Have to understand the docs thoroughly to continue development at a normal pace

    Where would you use these?

    • Websites like news or articles won’t require this.
    • Applications where accessibility is more important than styling and design (Government websites, banking, or even internal company apps).
    • Applications where importance is given to both accessibility and design, so customizability to these components is preferred (Teamflow, CodeSandbox, Vercel).
    • Can be paired with Vanilla libraries to provide performance with accessibility.
    • Can be paired with utility-style libraries to provide relatively faster development with accessibility.

    Utility Styled Library / Framework

    These types of libraries allow you to style your elements through their interfaces, either through class names or component props using composable individual CSS properties as per your requirements. The strongest point you have with such libraries is the flexibility of writing custom CSS properties. With these libraries, you would often require a “wrapper” class or components to be able to reuse them. 

    These libraries dump these utility classes into your HTML, impacting your performance. Though there is still an option to improve the performance by purging the unused CSS from your project in a build step, even with that, the performance won’t be as good as css-in-js. The purging would look at the class names throughout the whole project and remove them if there is no reference. So, when loading a page, it would still load CSS that is not being used on the current page but another one.

    Tailwind

    const people = [
      {
        name: 'Calvin Hawkins',
        email: 'calvin.hawkins@example.com',
        image:
          'https://images.unsplash.com/photo-1491528323818-fdd1faba62cc?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=facearea&facepad=2&w=256&h=256&q=80',
      },
      {
        name: 'Kristen Ramos',
        email: 'kristen.ramos@example.com',
        image:
          'https://images.unsplash.com/photo-1550525811-e5869dd03032?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=facearea&facepad=2&w=256&h=256&q=80',
      },
      {
        name: 'Ted Fox',
        email: 'ted.fox@example.com',
        image:
          'https://images.unsplash.com/photo-1500648767791-00dcc994a43e?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=facearea&facepad=2&w=256&h=256&q=80',
      },
    ]
    
    export default function Example() {
      return (
        <ul className="divide-y divide-gray-200">
          {people.map((person) => (
            <li key={person.email} className="py-4 flex">
              <img className="h-10 w-10 rounded-full" src={person.image} alt="" />
              <div className="ml-3">
                <p className="text-sm font-medium text-gray-900">{person.name}</p>
                <p className="text-sm text-gray-500">{person.email}</p>
              </div>
            </li>
          ))}
        </ul>
      )
    }

    Chakra UI

    import { MdStar } from "react-icons/md";
    
    export default function Example() {
      return (
        <Center h="100vh">
          <Box p="5" maxW="320px" borderWidth="1px">
            <Image borderRadius="md" src="https://bit.ly/2k1H1t6" />
            <Flex align="baseline" mt={2}>
              <Badge colorScheme="pink">Plus</Badge>
              <Text
                ml={2}
                textTransform="uppercase"
                fontSize="sm"
                fontWeight="bold"
                color="pink.800"
              >
                Verified • Cape Town
              </Text>
            </Flex>
            <Text mt={2} fontSize="xl" fontWeight="semibold" lineHeight="short">
              Modern, Chic Penthouse with Mountain, City & Sea Views
            </Text>
            <Text mt={2}>$119/night</Text>
            <Flex mt={2} align="center">
              <Box as={MdStar} color="orange.400" />
              <Text ml={1} fontSize="sm">
                <b>4.84</b> (190)
              </Text>
            </Flex>
          </Box>
        </Center>
      );
    }

    List of the frameworks

    • Tailwind
    • Chakra UI (although it has some prebuilt components, its concept is driven from Tailwind)
    • Tachyons
    • xStyled

    Pros:

    • Rapid development and prototyping
    • Gives flexibility to styling
    • Enforces a little consistency; you don’t have to use magic numbers while creating the layout (spacing values, responsive variables like xs, sm, etc.)
    • Less context switching—you’ll write CSS in your HTML elements

    Cons:

    • Endup with ugly-looking/hard-to-read code
    • Lack of importance to components, you would have to handle accessibility yourself
    • Creates a global stylesheet that would have unused classes

    Where would you use these?

    • Easier composition of simpler components to build large applications.
    • Modular applications where rapid customization is required, like font sizes, color pallets, themes, etc.
    • FinTech or healthcare applications where you need features like theme-based toggling in light/dark mode to be already present.
    • Application where responsive design is supported out of the box, along with ease of accessibility and custom breakpoints for responsiveness.

    Pre-styled / All-In-One Framework 

    These are popular frameworks that come with pre-styled, ready-to-use components out of the box with little customization.

    These are heavy libraries that have fixed styling that can be overridden. However, generally speaking, overriding the classes would just load in extra CSS, which just clogs up the performance. These kinds of libraries are generally more useful for rapid prototyping and not in places with heavy customization and priority on performance.

    These are quite beginner friendly as well, but if you are a beginner, it is best to understand the basics and fundamentals of CSS rather than fully relying on frameworks like these as your crutches. Although, these frameworks have their pros with speed of development.

    Material UI

    <Box
            component="form"
            className="lgn-form-content"
            id="loginForm"
            onSubmit={formik.handleSubmit}
          >
            <Input
              id="activationCode"
              placeholder="Enter 6 Digit Auth Code"
              className="lgn-form-input"
              type="text"
              onChange={formik.handleChange}
              value={formik.values.activationCode}
            />
    
            <Button
              sx={{ marginBottom: "24px", marginTop: "1rem" }}
              type="submit"
              className="lgn-form-submit"
              form="loginForm"
              onKeyUp={(e) =>
                keyUpHandler(e, formik.handleSubmit, formik.isSubmitting)
              }
            >
              <Typography className="lgn-form-submit-text">
                Activate & Sign In
              </Typography>
            </Button>
            {formik.errors.activationCode && formik.touched.activationCode ? (
              <Typography color="white">{formik.errors.activationCode}</Typography>
            ) : null}
    </Box>

    BootStrap

    <Accordion isExpanded={true} useArrow={true}>
       <AccordionLabel className="editor-accordion-label">RULES</AccordionLabel>
       <AccordionSection>
         <div className="editor-detail-panel editor-detail-panel-column">
           <div className="label">Define conditional by adding a rule</div>
           <div className="rule-actions"></div>
         </div>
       </AccordionSection>
    </Accordion>

    List of the framework:

    • Bootstrap
    • Semantic UI
    • Material UI
    • Bulma
    • Mantine 

    Pros: 

    • Faster development, saves time since everything comes out of the box.
    • Helps avoid cross-browser bugs
    • Helps follow best practices (accessibility)

    Cons:

    • Low customization
    • Have to become familiar with the framework and its nuisance
    • Bloated CSS, since its loading in everything from the framework on top of overridden styles

    Where would you use these?

    • Focus is not on nitty-gritty design but on development speed and functionality.
    • Enterprise apps where the UI structure of the application isn’t dynamic and doesn’t get altered a lot.
    • B2B apps mostly where the focus is on getting the functionality out fast—UX is mostly driven by ease of use of the functionality with a consistent UI design.
    • Applications where you want to focus more on cross-browser compatibility.

    Conclusion:

    This is not a hard and fast rule; there are still a bunch of parameters that aren’t covered in this blog, like developer preference, or legacy code that already uses a pre-existing framework. So, pick one that seems right for you, considering the parameters in and outside this blog and your judgment.

    To summarize a little on the pros and cons of the above categories, here is a TLDR diagram:

    Pictorial Representation of the Summary

  • Cube – An Innovative Framework to Build Embedded Analytics

    Historically, embedded analytics was thought of as an integral part of a comprehensive business intelligence (BI) system. However, when we considered our particular needs, we soon realized something more innovative was necessary. That is when we came across Cube (formerly CubeJS), a powerful platform that could revolutionize how we think about embedded analytics solutions.

    This new way of modularizing analytics solutions means businesses can access the exact services and features they require at any given time without purchasing a comprehensive suite of analytics services, which can often be more expensive and complex than necessary.

    Furthermore, Cube makes it very easy to link up data sources and start to get to grips with analytics, which provides clear and tangible benefits for businesses. This new tool has the potential to be a real game changer in the world of embedded analytics, and we are very excited to explore its potential.

    Understanding Embedded Analytics

    When you read a word like “embedded analytics” or something similar, you probably think of an HTML embed tag or an iFrame tag. This is because analytics was considered a separate application and not part of the SaaS application, so the market had tools specifically for analytics.

    “Embedded analytics is a digital workplace capability where data analysis occurs within a user’s natural workflow, without the need to toggle to another application. Moreover, embedded analytics tends to be narrowly deployed around specific processes such as marketing campaign optimization, sales lead conversions, inventory demand planning, and financial budgeting.” – Gartner

    Embedded Analytics is not just about importing data into an iFrame—it’s all about creating an optimal user experience where the analytics feel like they are an integral part of the native application. To ensure that the user experience is as seamless as possible, great attention must be paid to how the analytics are integrated into the application. This can be done with careful thought to design and by anticipating user needs and ensuring that the analytics are intuitive and easy to use. This way, users can get the most out of their analytics experience.

    Existing Solutions

    With the rising need for SaaS applications and the number of SaaS applications being built daily, analytics must be part of the SaaS application.

    We have identified three different categories of exciting solutions available in the market.

    Traditional BI Platforms

    Many tools, such as GoodData, Tableau, Metabase, Looker, and Power BI, are part of the big and traditional BI platforms. Despite their wide range of features and capabilities, these platforms need more support with their Big Monolith Architecture, limited customization, and less-than-intuitive user interfaces, making them difficult and time-consuming.

    Here are a few reasons these are not suitable for us:

    • They lack customization, and their UI is not intuitive, so they won’t be able to match our UX needs.
    • They charge a hefty amount, which is unsuitable for startups or small-scale companies.
    • They have a big monolith architecture, making integrating with other solutions difficult.

    New Generation Tools

    The next experiment taking place in the market is the introduction of tools such as Hex, Observable, Streamlit, etc. These tools are better suited for embedded needs and customization, but they are designed for developers and data scientists. Although the go-to-market time is shorter, all these tools cannot integrate into SaaS applications.

    Here are a few reasons why these are not suitable for us:

    • They are not suitable for non-technical people and cannot integrate with Software-as-a-Service (SaaS) applications.
    • Since they are mainly built for developers and data scientists, they don’t provide a good user experience.
    • They are not capable of handling multiple data sources simultaneously.
    • They do not provide pre-aggregation and caching solutions.

    In House Tools

    Building everything in-house, instead of paying other platforms to build everything from scratch, is possible using API servers and GraphQL. However, there is a catch: the requirements for analytics are not straightforward, which will require a lot of expertise to build, causing a big hurdle in adaptation and resulting in a longer time-to-market.

    Here are a few reasons why these are not suitable for us:

    • Building everything in-house requires a lot of expertise and time, thus resulting in a longer time to market.
    • It requires developing a secure authentication and authorization system, which adds to the complexity.
    • It requires the development of a caching system to improve the performance of analytics.
    • It requires the development of a real-time system for dynamic dashboards.
    • It requires the development of complex SQL queries to query multiple data sources.

    Typical Analytics Features

    If you want to build analytics features, the typical requirements look like this:

    Multi-Tenancy

    When developing software-as-a-service (SaaS) applications, it is often necessary to incorporate multi-tenancy into the architecture. This means multiple users will be accessing the same software application, but with a unique and individualized experience. To guarantee that this experience is not compromised, it is essential to ensure that the same multi-tenancy principles are carried over into the analytics solution that you are integrating into your SaaS application. It is important to remember that this will require additional configuration and setup on your part to ensure that all of your users have access to the same level of tools and insights.

    Intuitive Charts

    If you look at some of the available analytics tools, they may have good charting features, but they may not be able to meet your specific UX needs. In today’s world, many advanced UI libraries and designs are available, which are often far more effective than the charting features of analytics tools. Integrating these solutions could help you create a more user-friendly experience tailored specifically to your business requirements.

    Security

    You want to have authentication and authorization for your analytics so that managers can get an overview of the analytics for their entire team, while individual users can only see their own analytics. Furthermore, you may want to grant users with certain roles access to certain analytics charts and other data to better understand how their team is performing. To ensure that your analytics are secure and that only the right people have access to the right information, it is vital to set up an authentication and authorization system.

    Caching

    Caching is an incredibly powerful tool for improving the performance and economics of serving your analytics. By implementing a good caching solution, you can see drastic improvements in the speed and efficiency of your analytics, while also providing an improved user experience. Additionally, the cost savings associated with this approach can be quite significant, providing you with a greater return on investment. Caching can be implemented in various ways, but the most effective approaches are tailored to the specific needs of your analytics. By leveraging the right caching solutions, you can maximize the benefits of your analytics and ensure that your users have an optimized experience.

    Real-time

    Nowadays, every successful SaaS company understands the importance of having dynamic and real-time dashboards; these dashboards provide users with the ability to access the latest data without requiring them to refresh the tab each and every time. By having real-time dashboards, companies can ensure their customers have access to the latest information, which can help them make more informed decisions. This is why it is becoming increasingly important for SaaS organizations to invest in robust, low-latency dashboard solutions that can deliver accurate, up-to-date data to their customers.

    Drilldowns

    Drilldown is an incredibly powerful analytics capability that enables users to rapidly transition from an aggregated, top-level overview of their data to a more granular, in-depth view. This can be achieved simply by clicking on a metric within a dashboard or report. With drill-down, users can gain a greater understanding of the data by uncovering deeper insights, allowing them to more effectively evaluate the data and gain a more accurate understanding of their data trends.

    Data Sources

    With the prevalence of software as a service (SaaS) applications, there could be a range of different data sources used, including PostgreSQL, DynamoDB, and other types of databases. As such, it is important for analytics solutions to be capable of accommodating multiple data sources at once to provide the most comprehensive insights. By leveraging the various sources of information, in conjunction with advanced analytics, businesses can gain a thorough understanding of their customers, as well as trends and behaviors. Additionally, accessing and combining data from multiple sources can allow for more precise predictions and recommendations, thereby optimizing the customer experience and improving overall performance.

    Budget

    Pricing is one of the most vital aspects to consider when selecting an analytics tool. There are various pricing models, such as AWS Quick-sight, which can be quite complex, or per-user basis costs, which can be very expensive for larger organizations. Additionally, there is custom pricing, which requires you to contact customer care to get the right pricing; this can be quite a difficult process and may cause a big barrier to adoption. Ultimately, it is important to understand the different pricing models available and how they may affect your budget before selecting an analytics tool.

    After examining all the requirements, we came across a solution like Cube, which is an innovative solution with the following features:

    • Open Source: Since it is open source, you can easily do a proof-of-concept (POC) and get good support, as any vulnerabilities will be fixed quickly.
    • Modular Architecture: It can provide good customizations, such as using Cube to use any custom charting library you prefer in your current framework.
    • Embedded Analytics-as-a-Code: You can easily replicate your analytics and version control it, as Cube is analytics in the form of code.
    • Cloud Deployments: It is a new-age tool, so it comes with good support with Docker or Kubernetes (K8s). Therefore, you can easily deploy it on the cloud.

    Cube Architecture

    Let’s look at the Cube architecture to understand why Cube is an innovative solution.

    • Cube supports multiple data sources simultaneously; your data may be stored in Postgres, Snowflake, and Redshift, and you can connect to all of them simultaneously. Additionally, they have a long list of data sources they can support.
    • Cube provides analytics over a REST API; very few analytics solutions provide chart data or metrics over REST APIs.
    • The security you might be using for your application can easily be mirrored for Cube. This helps simplify the security aspects, as you don’t need to maintain multiple tokens for the app and analytics tool.
    • Cube provides a unique way to model your data in JSON format; it’s more similar to an ORM. You don’t need to write complex SQL queries; once you model your data, Cube will generate the SQL to query the data source.
    • Cube has very good pre-aggregation and caching solutions.

    Cube Deep Dive

    Let’s look into different concepts that we just saw briefly in the architecture diagram.

    Data Modeling

    Cube

    A cube represents a table of data and is conceptually similar to a view in SQL. It’s like an ORM where you can define schema, extend it, or define abstract cubes to make use of code reusable. For example, if you have a Customer table, you need to write a Cube for it. Using Cubes, you can build analytical queries.

    Each cube contains definitions of measures, dimensions, segments, and joins between cubes. Cube bifurcates columns into measures and dimensions. Similar to tables, every cube can be referenced in another cube. Even though a cube is a table representation, you can choose which columns you want to expose for analytics. You can only add columns you want to expose to analytics; this will translate into SQL for the dimensions and measures you use in the SQL query (Push Down Mechanism).

    cube('Orders', {
      sql: `SELECT * FROM orders`,
    });

    Dimensions

    You can think about a dimension as an attribute related to a measure, for example, the measure userCount. This measure can have different dimensions, such as country, age, occupation, etc.

    Dimensions allow us to further subdivide and analyze the measure, providing a more detailed and comprehensive picture of the data.

    cube('Orders', {
    
      ...,
    
      dimensions: {
        status: {
          sql: `status`,
          type: `string`},
      },
    });

    Measures

    These parameters/SQL columns allow you to define the aggregations for numeric or quantitative data. Measures can be used to perform calculations such as sum, minimum, maximum, average, and count on any set of data.

    Measures also help you define filters if you want to add some conditions for a metric calculation. For example, you can set thresholds to filter out any data that is not within the range of values you are looking for.

    Additionally, measures can be used to create additional metrics, such as the ratio between two different measures or the percentage of a measure. With these powerful tools, you can effectively analyze and interpret your data to gain valuable insights.

    cube('Orders', {
    
      ...,
    
      measures: {
        count: {
          type: `count`,
        },
      },
    });

    Joins

    Joins define the relationships between cubes, which then allows accessing and comparing properties from two or more cubes at the same time. In Cube, all joins are LEFT JOINs. This also allows you to represent one-to-one, many-to-one relationships easily.

    cube('Orders', {
    
      ...,
    
      joins: {
        LineItems: {
          relationship: `belongsTo`,
          // Here we use the `CUBE` global to refer to the current cube,
          // so the following is equivalent to `Orders.id = LineItems.order_id`
          sql: `${CUBE}.id = ${LineItems}.order_id`,
        },
      },
    });

    There are three kinds of join relationships:

    • belongsTo
    • hasOne
    • hasMany

    Segments

    Segments are filters predefined in the schema instead of a Cube query. Segments help pre-build complex filtering logic, simplifying Cube queries and making it easy to re-use common filters across a variety of queries.

    To add a segment that limits results to completed orders, we can do the following:

    cube('Orders', {
      ...,
      segments: {
        onlyCompleted: {
          sql: `${CUBE}.status = 'completed'`},
      },
    });

    Pre-Aggregations

    Pre-aggregations are a powerful way of caching frequently-used, expensive queries and keeping the cache up-to-date periodically. The most popular roll-up pre-aggregation is summarized data of the original cube grouped by any selected dimensions of interest. It works on “measure types” like count, sum, min, max, etc.

    Cube analyzes queries against a defined set of pre-aggregation rules to choose the optimal one that will be used to create pre-aggregation table. When there is a smaller dataset that queries execute over, the application works well and delivers responses within acceptable thresholds. However, as the size of the dataset grows, the time-to-response from a user’s perspective can often suffer quite heavily. It specifies attributes from the source, which Cube uses to condense (or crunch) the data. This simple yet powerful optimization can reduce the size of the dataset by several orders of magnitude, and ensures subsequent queries can be served by the same condensed dataset if any matching attributes are found.

    Even granularity can be specified, which defines the granularity of data within the pre-aggregation. If set to week, for example, then Cube will pre-aggregate the data by week and persist it to Cube Store.

    Cube can also take care of keeping pre-aggregations up-to-date with the refreshKey property. By default, it is set to every: ‘1 hour’.

    cube('Orders', {
    
      ...,
    
      preAggregations: {
        main: {
          measures: [CUBE.count],
          dimensions: [CUBE.status],
          timeDimension: CUBE.createdAt,
          granularity: 'day',
        },
      },
    });

    Additional Cube Concepts

    Let’s look into some of the additional concepts that Cube provides that make it a very unique solution.

    Caching

    Cube provides a two-level caching system. The first level is in-memory cache, which is active by default. Cube in-memory cache acts as a buffer for your database when there is a burst of requests hitting the same data from multiple concurrent users, while pre-aggregations are designed to provide the right balance between time to insight and querying performance.

    The second level of caching is called pre-aggregations, and requires explicit configuration to activate.

    Drilldowns

    Drilldowns are a powerful feature to facilitate data exploration. It allows building an interface to let users dive deeper into visualizations and data tables. See ResultSet.drillDown() on how to use this feature on the client side.

    A drilldown is defined on the measure level in your data schema. It is defined as a list of dimensions called drill members. Once defined, these drill members will always be used to show underlying data when drilling into that measure.

    Subquery

    You can use subqueries within dimensions to reference measures from other cubes inside a dimension. Under the hood, it behaves as a correlated subquery, but is implemented via joins for optimal performance and portability.

    For example, the following SQL can be written using a subquery in cubes as:

    SELECT
      id,
      (SELECT SUM(amount)FROM dealsWHERE deals.sales_manager_id = sales_managers.id)AS deals_amount
    FROM sales_managers
    GROUPBY 1

    Cube Representation

    cube(`Deals`, {
      sql: `SELECT * FROM deals`,
      measures: {
        amount: {
          sql: `amount`,
          type: `sum`,
        },
      },
    });
    
    cube(`SalesManagers`, {
      sql: `SELECT * FROM sales_managers`,
    
      joins: {
        Deals: {
          relationship: `hasMany`,
          sql: `${SalesManagers}.id = ${Deals}.sales_manager_id`,
        },
      },
    
      measures: {
        averageDealAmount: {
          sql: `${dealsAmount}`,
          type: `avg`,
        },
      },
    
      dimensions: {
        dealsAmount: {
          sql: `${Deals.amount}`,
          type: `number`,
          subQuery: true,
        },
      },
    });

    Apart from these, Cube also provides advanced concepts such as Export and Import, Extending Cubes, Data Blending, Dynamic Schema Creation, and Polymorphic Cubes. You can read more about them in the Cube documentation.

    Getting Started with Cube

    Getting started with Cube is very easy. All you need to do is follow the instructions on the Cube documentation page.

    To get started you can use Docker to get started quickly. With Docker, you can install Cube in a few easy steps:

    1. In a new folder for your project, run the following command:

    docker run -p 4000:4000 -p 3000:3000 
      -v ${PWD}:/cube/conf 
      -e CUBEJS_DEV_MODE=true 
      cubejs/cube

    2. Head to http://localhost:4000 to open Developer Playground.

    The Developer Playground has a database connection wizard that loads when Cube is first started up and no .env file is found. After database credentials have been set up, an .env file will automatically be created and populated with the same credentials.

    Click on the type of database to connect to, and you’ll be able to enter credentials:

    After clicking Apply, you should see available tables from the configured database. Select one to generate a data schema. Once the schema is generated, you can execute queries on the Build tab.****

    Conclusion

    Cube is a revolutionary, open-source framework for building embedded analytics applications. It offers a unified API for connecting to any data source, comprehensive visualization libraries, and a data-driven user experience that makes it easy for developers to build interactive applications quickly. With Cube, developers can focus on the application logic and let the framework take care of the data, making it an ideal platform for creating data-driven applications that can be deployed on the web, mobile, and desktop. It is an invaluable tool for any developer interested in building sophisticated analytics applications quickly and easily.

  • How to deploy GitHub Actions Self-Hosted Runners on Kubernetes

    GitHub Actions jobs are run in the cloud by default; however, sometimes we want to run jobs in our own customized/private environment where we have full control. That is where a self-hosted runner saves us from this problem. 

    To get a basic understanding of running self-hosted runners on the Kubernetes cluster, this blog is perfect for you. 

    We’ll be focusing on running GitHub Actions on a self-hosted runner on Kubernetes. 

    An example use case would be to create an automation in GitHub Actions to execute MySQL queries on MySQL Database running in a private network (i.e., MySQL DB, which is not accessible publicly).

    A self-hosted runner requires the provisioning and configuration of a virtual machine instance; here, we are running it on Kubernetes. For running a self-hosted runner on a Kubernetes cluster, the action-runner-controller helps us to make that possible.

    This blog aims to try out self-hosted runners on Kubernetes and covers:

    1. Deploying MySQL Database on minikube, which is accessible only within Kubernetes Cluster.
    2. Deploying self-hosted action runners on the minikube.
    3. Running GitHub Action on minikube to execute MySQL queries on MySQL Database.

    Steps for completing this tutorial:

    Create a GitHub repository

    1. Create a private repository on GitHub. I am creating it with the name velotio/action-runner-poc.

    Setup a Kubernetes cluster using minikube

    1. Install Docker.
    2. Install Minikube.
    3. Install Helm 
    4. Install kubectl

    Install cert-manager on a Kubernetes cluster

    • By default, actions-runner-controller uses cert-manager for certificate management of admission webhook, so we have to make sure cert-manager is installed on Kubernetes before we install actions-runner-controller. 
    • Run the below helm commands to install cert-manager on minikube.
    • Verify installation using “kubectl –namespace cert-manager get all”. If everything is okay, you will see an output as below:

    Setting Up Authentication for Hosted Runners‍

    There are two ways for actions-runner-controller to authenticate with the GitHub API (only 1 can be configured at a time, however):

    1. Using a GitHub App (not supported for enterprise-level runners due to lack of support from GitHub.)
    2. Using a PAT (personal access token)

    To keep this blog simple, we are going with PAT.

    To authenticate an action-runner-controller with the GitHub API, we can use a  PAT with the action-runner-controller registers a self-hosted runner.

    • Go to account > Settings > Developers settings > Personal access token. Click on “Generate new token”. Under scopes, select “Full control of private repositories”.
    •  Click on the “Generate token” button.
    • Copy the generated token and run the below commands to create a Kubernetes secret, which will be used by action-runner-controller deployment.
    export GITHUB_TOKEN=XXXxxxXXXxxxxXYAVNa 

    kubectl create ns actions-runner-system

    Create secret

    kubectl create secret generic controller-manager  -n actions-runner-system 
    --from-literal=github_token=${GITHUB_TOKEN}

    Install action runner controller on the Kubernetes cluster

    • Run the below helm commands
    helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller
    helm repo update
    helm upgrade --install --namespace actions-runner-system 
    --create-namespace --wait actions-runner-controller 
    actions-runner-controller/actions-runner-controller --set 
    syncPeriod=1m

    • Verify that the action-runner-controller installed properly using below command
    kubectl --namespace actions-runner-system get all

     

    Create a Repository Runner

    • Create a RunnerDeployment Kubernetes object, which will create a self-hosted runner named k8s-action-runner for the GitHub repository velotio/action-runner-poc
    • Please Update Repo name from “velotio/action-runner-poc” to “<Your-repo-name>”
    • To create the RunnerDeployment object, create the file runner.yaml as follows:
    apiVersion: actions.summerwind.dev/v1alpha1
    kind: RunnerDeployment
    metadata:
     name: k8s-action-runner
     namespace: actions-runner-system
    spec:
     replicas: 2
     template:
       spec:
         repository: velotio/action-runner-poc

    • To create, run this command:
    kubectl create -f runner.yaml

    Check that the pod is running using the below command:

    kubectl get pod -n actions-runner-system | grep -i "k8s-action-runner"

    • If everything goes well, you should see two action runners on the Kubernetes, and the same are registered on Github. Check under Settings > Actions > Runner of your repository.
    • Check the pod with kubectl get po -n actions-runner-system

    Install a MySQL Database on the Kubernetes cluster

    • Create PV and PVC for MySQL Database. 
    • Create mysql-pv.yaml with the below content.
    apiVersion: v1
    kind: PersistentVolume
    metadata:
     name: mysql-pv-volume
     labels:
       type: local
    spec:
     capacity:
       storage: 2Gi
     accessModes:
       - ReadWriteOnce
     hostPath:
       path: "/mnt/data"
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
     name: mysql-pv-claim
    spec:
     accessModes:
       - ReadWriteOnce
     resources:
       requests:
         storage: 2Gi

    • Create mysql namespace
    kubectl create ns mysql

    • Now apply mysql-pv.yaml to create PV and PVC 
    kubectl create -f mysql-pv.yaml -n mysql

    Create the file mysql-svc-deploy.yaml and add the below content to mysql-svc-deploy.yaml

    Here, we have used MYSQL_ROOT_PASSWORD as “password”.

    apiVersion: v1
    kind: Service
    metadata:
     name: mysql
    spec:
     ports:
       - port: 3306
     selector:
       app: mysql
     clusterIP: None
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
     name: mysql
    spec:
     selector:
       matchLabels:
         app: mysql
     strategy:
       type: Recreate
     template:
       metadata:
         labels:
           app: mysql
       spec:
         containers:
           - image: mysql:5.6
             name: mysql
             env:
                 # Use secret in real usage
               - name: MYSQL_ROOT_PASSWORD
                 value: password
             ports:
               - containerPort: 3306
                 name: mysql
             volumeMounts:
               - name: mysql-persistent-storage
                 mountPath: /var/lib/mysql
         volumes:
           - name: mysql-persistent-storage
             persistentVolumeClaim:
               claimName: mysql-pv-claim

    • Create the service and deployment
    kubectl create -f mysql-svc-deploy.yaml -n mysql

    • Verify that the MySQL database is running
    kubectl get po -n mysql

    Create a GitHub repository secret to store MySQL password

    As we will use MySQL password in the GitHub action workflow file as a good practice, we should not use it in plain text. So we will store MySQL password in GitHub secrets, and we will use this secret in our GitHub action workflow file.

    • Create a secret in the GitHub repository and give the name to the secret as “MYSQL_PASS”, and in the values, enter “password”. 

    Create a GitHub workflow file

    • YAML syntax is used to write GitHub workflows. For each workflow, we use a separate YAML file, which we store at .github/workflows/ directory. So, create a .github/workflows/ directory in your repository and create a file .github/workflows/mysql_workflow.yaml as follows.
    ---
    name: Example 1
    on:
     push:
       branches: [ main ]
    jobs:
     build:
       name: Build-job
       runs-on: self-hosted
       steps:
       - name: Checkout
         uses: actions/checkout@v2
     
       - name: MySQLQuery
         env:
           PASS: ${{ secrets.MYSQL_PASS }}
         run: |
           docker run -v ${GITHUB_WORKSPACE}:/var/lib/docker --rm mysql:5.6 sh -c "mysql -u root -p$PASS -hmysql.mysql.svc.cluster.local </var/lib/docker/test.sql"

    • If you check the docker run command in the mysql_workflow.yaml file, we are referring to the .sql file, i.e., test.sql. So, create a test.sql file in your repository as follows:
    use mysql;
    CREATE TABLE IF NOT EXISTS Persons (
       PersonID int,
       LastName varchar(255),
       FirstName varchar(255),
       Address varchar(255),
       City varchar(255)
    );
     
    SHOW TABLES;

    • In test.sql, we are running MySQL queries like create tables.
    • Push changes to your repository main branch.
    • If everything is fine, you will be able to see that the GitHub action is getting executed in a self-hosted runner pod. You can check it under the “Actions” tab of your repository.
    • You can check the workflow logs to see the output of SHOW TABLES—a command we have used in the test.sql file—and check whether the persons tables is created.

    References