Category: Engineering blogs

  • Iceberg – Introduction and Setup (Part – 1)

    As we already discussed in our previous Delta Lake blog, there are already table formats in use, ones with very high specifications and their own benefits. Iceberg is one of them. So, in this blog, we will discuss Iceberg.

    What is Apache Iceberg?

    Iceberg, from the open-source Apache, is a table format used to handle large amounts of data stored locally or on various cloud storage platforms. Netflix developed Iceberg to solve its big data problem. After that, they donated it to Apache, and it became open source in 2018.  Iceberg now has a large number of contributors all over the world on GitHub and is the most widely used table format. 

    Iceberg mainly solves all the key problems we once faced when using the Hive table format to deal with data stored on various cloud storage like S3.

    Iceberg has similar features and capabilities, like SQL tables. Yes, it is open source, so multiple engines like Spark can operate on it to perform transformations and such. It also has all ACID properties. This is a quick introduction to  Iceberg, covering its features and initial setup.

    Why to go with Iceberg

    The main reason to use Iceberg is that it performs better when we need to load data from S3, or metadata is available on a cloud storage medium. Unlike Hive, Iceberg tracks the data at the file level rather than the folder level, which can decrease performance; that’s why we want to choose Iceberg. Here is the folder hierarchy that Iceberg uses while saving the data into its tables. Each Iceberg table is a combination of four files: snapshot metadata list, manifest list, manifest file, and data file.

    1. Snapshot Metadata File:  This file holds the metadata information about the table, such as the schema, partitions, and manifest list.
    2. Manifest List:  This list records each manifest file along with the path and metadata information. At this point, Iceberg decides which manifest files to ignore and which to read.
    3. Manifest File: This file contains the paths to real data files, which hold the real data along with the metadata.
    4. Data File: Here is the real parquet, ORC, and Avro file, along with the real data.

    Features of Iceberg:

    Some Iceberg features include:

    • Schema Evolution: Iceberg allows you to evolve your schema without having to rewrite your data. This means you can easily add, drop, or rename columns, providing flexibility to adapt to changing data requirements without impacting existing queries.
    • Partition Evolution: Iceberg supports partition evolution, enabling you to modify the partitioning scheme as your data and query patterns evolve. This feature helps maintain query performance and optimize data layout over time.
    • Time Travel: Iceberg’s time travel feature allows you to query historical versions of your data. This is particularly useful for debugging, auditing, and recreating analyses based on past data states.
    • Multiple Query Engine Support: Iceberg supports multiple query engines, including Trino, Presto, Hive, and Amazon Athena. This interoperability ensures that you can read and write data across different tools seamlessly, facilitating a more versatile and integrated data ecosystem.
    • AWS Support: Iceberg is well-integrated with AWS services, making it easy to use with Amazon S3 for storage and other AWS analytics services. This integration helps leverage the scalability and reliability of AWS infrastructure for your data lake.
    • ACID Compliance: Iceberg ensures ACID (Atomicity, Consistency, Isolation, Durability) transactions, providing reliable data consistency and integrity. This makes it suitable for complex data operations and concurrent workloads, ensuring data reliability and accuracy.
    • Hidden Partitioning: Iceberg’s hidden partitioning abstracts the complexity of managing partitions from the user, automatically handling partition management to improve query performance without manual intervention.
    • Snapshot Isolation: Iceberg supports snapshot isolation, enabling concurrent read and write operations without conflicts. This isolation ensures that users can work with consistent views of the data, even as it is being updated.
    • Support for Large Tables: Designed for high scalability, Iceberg can efficiently handle petabyte-scale tables, making it ideal for large datasets typical in big data environments.
    • Compatibility with Modern Data Lakes: Iceberg’s design is tailored for modern data lake architectures, supporting efficient data organization, metadata management, and performance optimization, aligning well with contemporary data management practices.

    These features make Iceberg a powerful and flexible table format for managing data lakes, ensuring efficient data processing, robust performance, and seamless integration with various tools and platforms. By leveraging Iceberg, organizations can achieve greater data agility, reliability, and efficiency, enhancing their data analytics capabilities and driving better business outcomes.

    Prerequisite:

    • PySpark: Ensure that you have PySpark installed and properly configured. PySpark provides the Python API for Spark, enabling you to harness the power of distributed computing with Spark using Python.
    • Python: Make sure you have Python installed on your system. Python is essential for writing and running your PySpark scripts. It’s recommended to use a virtual environment to manage your dependencies effectively.
    • Iceberg-Spark JAR: Download the appropriate Iceberg-Spark JAR file that corresponds to your Spark version. This JAR file is necessary to integrate Iceberg with Spark, allowing you to utilize Iceberg’s advanced table format capabilities within your Spark jobs.
    • Jars to Configure Cloud Storage: Obtain and configure the necessary JAR files for your specific cloud storage provider. For example, if you are using Amazon S3, you will need the hadoop-aws JAR and its dependencies. For Google Cloud Storage, you need the gcs-connector JAR. These JARs enable Spark to read from and write to cloud storage systems.
    • Spark and Hadoop Configuration: Ensure your Spark and Hadoop configurations are correctly set up to integrate with your cloud storage. This might include setting the appropriate access keys, secret keys, and endpoint configurations in your spark-defaults.conf and core-site.xml.
    • Iceberg Configuration: Configure Iceberg settings specific to your environment. This might include catalog configurations (e.g., Hive, Hadoop, AWS Glue) and other Iceberg properties that optimize performance and compatibility.
    • Development Environment: Set up a development environment with an IDE or text editor that supports Python and Spark development, such as IntelliJ IDEA with the PyCharm plugin, Visual Studio Code, or Jupyter Notebooks.
    • Data Source Access: Ensure you have access to the data sources you will be working with, whether they are in cloud storage, relational databases, or other data repositories. Proper permissions and network configurations are necessary for seamless data integration.
    • Basic Understanding of Data Lakes: A foundational understanding of data lake concepts and architectures will help effectively utilize Iceberg. Knowledge of how data lakes differ from traditional data warehouses and their benefits will also be helpful.
    • Version Control System: Use a version control system like Git to manage your codebase. This helps in tracking changes, collaborating with team members, and maintaining code quality.
    • Documentation and Resources: Familiarize yourself with Iceberg documentation and other relevant resources. This will help you troubleshoot issues, understand best practices, and leverage advanced features effectively.

    You can download the run time JAR from here —according to the Spark version installed on your machine or cluster. It will be the same as the Delta Lake setup. You can either download these JAR files to your machine or cluster, provide a Spark submit command, or you can download these while initializing the Spark session by passing these in Spark config as a JAR package, along with the appropriate version.

    To use cloud storage, we are using these JARs with the S3 bucket for reading and writing Iceberg tables. Here is the basic example of a spark session:

    AWS_ACCESS_KEY_ID = "XXXXXXXXXXXXXX"
    AWS_SECRET_ACCESS_KEY = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXpiwvahk7e"
    
    spark_jars_packages = "com.amazonaws:aws-java-sdk:1.12.246,org.apache.hadoop:hadoop-aws:3.2.2,org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0"
    
    spark = pyspark.sql.SparkSession.builder 
       .config("spark.jars.packages", spark_jars_packages) 
       .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") 
       .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog") 
       .config("spark.sql.catalog.demo.warehouse", "s3a://abhishek-test-01012023/iceberg-sample-data/") 
       .config('spark.sql.catalog.demo.type', 'hadoop') 
       .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') 
       .config("spark.driver.memory", "20g") 
       .config("spark.memory.offHeap.enabled", "true") 
       .config("spark.memory.offHeap.size", "8g") 
       .getOrCreate()
    
    spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
    spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)

    Iceberg Setup Using Docker

    You can set and configure AWS creds, as well as some database-related or stream-related configs inside the docker-compose file.

    version: "3"
    
    services:
      spark-iceberg:
        image: tabulario/spark-iceberg
        container_name: spark-iceberg
        build: spark/
        depends_on:
          - rest
          - minio
        volumes:
          - ./warehouse:/home/iceberg/warehouse
          - ./notebooks:/home/iceberg/notebooks/notebooks
          - ./data:/home/iceberg/data
        environment:
          - AWS_ACCESS_KEY_ID=admin
          - AWS_SECRET_ACCESS_KEY=password
          - AWS_REGION=us-east-1
        ports:
          - 8888:8888
          - 8080:8080
        links:
          - rest:rest
          - minio:minio
      rest:
        image: tabulario/iceberg-rest:0.1.0
        ports:
          - 8181:8181
        environment:
          - AWS_ACCESS_KEY_ID=admin
          - AWS_SECRET_ACCESS_KEY=password
          - AWS_REGION=us-east-1
          - CATALOG_WAREHOUSE=s3a://warehouse/wh/
          - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
          - CATALOG_S3_ENDPOINT=http://minio:9000
      minio:
        image: minio/minio
        container_name: minio
        environment:
          - MINIO_ROOT_USER=admin
          - MINIO_ROOT_PASSWORD=password
        ports:
          - 9001:9001
          - 9000:9000
        command: ["server", "/data", "--console-address", ":9001"]
      mc:
        depends_on:
          - minio
        image: minio/mc
        container_name: mc
        environment:
          - AWS_ACCESS_KEY_ID=admin
          - AWS_SECRET_ACCESS_KEY=password
          - AWS_REGION=us-east-1
        entrypoint: >
          /bin/sh -c "
          until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
          /usr/bin/mc rm -r --force minio/warehouse;
          /usr/bin/mc mb minio/warehouse;
          /usr/bin/mc policy set public minio/warehouse;
          exit 0;
          " 

    Save this file with docker-compose.yaml. And run the command: docker compose up. Now, you can log into your container by using this command:

    docker exec -it <container-id> bash

    You can mount the sample data directory in a container or copy it from your local machine to the container. To copy the data inside the Docker directory, we can use the CP command.

    docker cp input-data <Container ID>:/home/iceberg/data 

    Setup S3 As a Warehouse in Iceberg, Read Data from the S3, and Write Iceberg Tables in the S3 Again Using an EC2 Instance  

    We have generated 90 GB of data here using Spark Job, stored in the S3 bucket. 

    AWS_ACCESS_KEY_ID = "XXXXXXXXXXX"
    AWS_SECRET_ACCESS_KEY = "XXXXXXXXXXX+XXXXXXXXXXX"
    
    spark_jars_packages = "com.amazonaws:aws-java-sdk:1.12.246,org.apache.hadoop:hadoop-aws:3.2.2,org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0"
    
    spark = pyspark.sql.SparkSession.builder 
       .config("spark.jars.packages", spark_jars_packages) 
       .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") 
       .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog") 
       .config("spark.sql.catalog.demo.warehouse", "s3a://abhishek-test-01012023/iceberg-sample-data/") 
       .config('spark.sql.catalog.demo.type', 'hadoop') 
       .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') 
       .config("spark.driver.memory", "20g") 
       .config("spark.memory.offHeap.enabled", "true") 
       .config("spark.memory.offHeap.size", "8g") 
       .getOrCreate()
    
    spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
    spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)

    Step 1

    We read the data in Spark and create an Iceberg table out of it, storing the iceberg tables in the S3 bucket only.

    Some Iceberg functionality won’t work if we haven’t installed or used the appropriate JAR file of the Iceberg version. The Iceberg version should be compatible with the Spark version you are using; otherwise, some feature partitions will throw an error of noSuchMethod. This must be taken care of carefully while setting this up, either in EC2 or EMR.

    Create an Iceberg table on S3 and write data into that table. The sample data we have used is generated using a Spark job for Delta tables. We are using the same data and schema of the data as follows.

    Step 2

    We created Iceberg tables in the location of the S3 bucket and wrote the data with partition columns in the S3 bucket only.

    spark.sql(""" CREATE TABLE IF NOT EXISTS demo.db.iceberg_data_2(id INT, first_name String,
    last_name String, address String, pincocde INT, net_income INT, source_of_income String,
    state String, email_id String, description String, population INT, population_1 String,
    population_2 String, population_3 String, population_4 String, population_5 String, population_6 String,
    population_7 String, date INT)
    USING iceberg
    TBLPROPERTIES ('format'='parquet', 'format-version' = '2')
    PARTITIONED BY (`date`)
    location 's3a://abhishek-test-01012023/iceberg_v2/db/iceberg_data_2'""")
    
    # Read the data that need to be written
    # Reading the data from delta tables in spark Dataframe
    
    df = spark.read.parquet("s3a://abhishek-test-01012023/delta-lake-sample-data/")
    
    logging.info("Starting writing the data")
    
    df.sortWithinPartitions("date").writeTo("demo.db.iceberg_data").partitionedBy("date").createOrReplace()
    
    logging.info("Writing has been finished")
    
    logging.info("Query the data from iceberg using spark SQL")
    
    spark.sql("describe table demo.db.iceberg_data").show()
    spark.sql("Select * from demo.db.iceberg_data limit 10").show()

    This is how we can use Iceberg over S3. There is another option: We can also create Iceberg tables in the AWS Glue catalog. Most tables created in the Glue catalog using Ahena are external tables that we use externally after generating the manifest files, like Delta Lake. 

    Step 3

    We print the Iceberg table’s data along with the table descriptions. 

    Using Iceberg, we can directly create the table in the Glue catalog using Athena, and it supports all read and write operations on the data available. These are the configurations that need to use in spark while using Glue catalog.

    {
        "conf":  {
                 "spark.sql.catalog.glue_catalog1": "org.apache.iceberg.spark.SparkCatalog",
                 "spark.sql.catalog.glue_catalog1.warehouse": 
                       "s3://YOUR-BUCKET-NAME/iceberg/glue_catalog1/tables/",
                 "spark.sql.catalog.glue_catalog1.catalog-impl":    "org.apache.iceberg.aws.glue.GlueCatalog",
                 "spark.sql.catalog.glue_catalog1.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
                 "spark.sql.catalog.glue_catalog1.lock-impl": "org.apache.iceberg.aws.glue.DynamoLockManager",
                 "spark.sql.catalog.glue_catalog1.lock.table": "myGlueLockTable",
      "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
               } 
    }

    Now, we can easily create the Iceberg table using the Spark or Athena, and it will be accessible via Delta. We can perform upserts, too.

    Conclusion

    We’ve learned the basics of the Iceberg table format, its features, and the reasons for choosing Iceberg. We discussed how Iceberg provides significant advantages such as schema evolution, partition evolution, hidden partitioning, and ACID compliance, making it a robust choice for managing large-scale data. We also delved into the fundamental setup required to implement this table format, including configuration and integration with data processing engines like Apache Spark and query engines like Presto and Trino. By leveraging Iceberg, organizations can ensure efficient data management and analytics, facilitating better performance and scalability. With this knowledge, you are well-equipped to start using Iceberg for your data lake needs, ensuring a more organized, scalable, and efficient data infrastructure.

  • Key Considerations for Picking Up the Right BI Tool

    The Business Intelligence (BI) tool has become a cornerstone in modern data analysis by transcending the limitations of traditional methods like Excel and databases.

    With plenty of options, selecting the right BI tool is crucial for unlocking the full potential of your organization’s data. In this blog, we will explore some popular BI tools, their features, and key considerations to help you make an informed decision.

    Here are some of the leading tools at the forefront of our discussion.

    Key Considerations for Choosing the Ideal BI Tool

    1. Business Objectives 

    Your selected BI tool must align with your business objectives and user expertise:

    • Identify the specific goals and outcomes you want to achieve from the BI tool. It could be improving sales, optimizing operations, or enhancing competitive insights. 
    • Be sure to also assess the technical proficiency of your users and choose a BI tool that matches the skill level of your team to achieve optimal utilization and efficiency.

    After solidifying the objectives, dive into the additional considerations explained below to craft your ultimate decision.

    2. Factors Related to Installation

    When choosing the BI tool from an installation and deployment perspective, various factors come into play. A selection of these considerations is outlined in the table below.

    Based on these points, we can summarise that:

    • Smaller businesses might prefer user-friendly options like PowerBI or Qlik Sense. 
    • Larger enterprises with extensive IT support might opt for Tableau or SAP BI for their comprehensive features. 
    • Open-source enthusiasts might find Apache Superset appealing, but it requires a solid understanding of software deployment.

    3. Ease of Use & Learning Curve 

    To ensure widespread adoption within your organization, we must choose the BI tool that prioritizes ease of use and has a manageable learning curve. 

    • Power BI and Tableau offer user-friendly interfaces, making them accessible to a wide range of users, with moderate learning curves.
    • SAP BI is ideal for organizations already familiar with SAP products, leveraging existing expertise for seamless integration.
    • Superset and Qlik Sense provide a balanced approach, accommodating users with different levels of technical proficiency while ensuring accessibility and usability.

    4. Integration with Existing Infrastructure

    You must also consider how well the BI tool aligns with existing IT infrastructure, applications, and databases:

    Power BI

    Integrates well with Microsoft products, providing seamless connectivity and robust integration. It is well-suited for businesses leveraging Microsoft technologies.


    Tableau
    :

    It’s a leading BI and data visualization tool with robust integration capabilities. Like many other BI platforms, it also supports a wide range of data sources, Cloud Platforms, and big data techs like Spark and Hadoop. This makes it suitable for organizations with a diverse tech stack. Learn More


    SAP BI:

    It integrates well with SAP products. For third-party applications, Business Connector is used for integration. It can be challenging and requires additional configuration. Best suited for organizations that are heavily invested in SAP products.


    Apache Superset:

    Apache Superset Provides integration options with a wide range of system techs due to open source and active community support. However additional setup and configuration must be done first for specific technologies. Thus, it would be wise to use this for small-scale businesses as using it for a large organization can become a very complex & tedious task.


    Qlik Sense:

    Qlik Sense is known for its strong integration capabilities and real-time data analysis. Much like Tableau, it also seamlessly connects with various data sources, big data techs like Hadoop and Spark, and major cloud platforms like GCP, AWS, and Azure. Learn More

    5. Cost Estimation 

    BI platforms can vary significantly in their pricing models and associated costs. So, you need to evaluate costs against your current and future usage and team size. Here, I’ve mentioned some key points to consider when comparing BI tools with a focus on budget constraints:

    • If an organization possesses the expertise to manage its cloud infrastructure and has a dedicated team to oversee resource scaling and monitoring, Apache Superset stands out as an excellent choice. This minimizes your licensing costs.
    • However, if building a cloud infrastructure isn’t your preference and you need a Software as a Service (SaaS) solution, Power BI Premium could be suitable for small teams focused on analysis.
    • SAP BI presents a viable option for large organizations needing customized pricing plans tailored to specific requirements. 
    • Alternatively, if you require both cloud and on-premise options, Qlik Sense and Tableau offer versatile solutions, catering well to the needs of small and medium-sized businesses.

    Summary

    So, in a nutshell, when choosing a BI tool, carefully assess your organization’s individual needs, technical infrastructure, budget limitations, and technical proficiency. Each tool has its strengths, so tailor your choice to match your specific requirements, enabling you to maximize your data’s potential.

    References:

    1. Power BI
      https://learn.microsoft.com/en-us/power-bi/connect-data/desktop-quickstart-connect-to-data
      https://community.fabric.microsoft.com/t5/Microsoft-Power-BI-Community/ct-p/powerbi
      https://powerbi.microsoft.com/en-us/pricing/
    2. Tableau
      https://help.tableau.com/current/pro/desktop/en-us/basicconnectoverview.htm
      https://www.tableau.com/blog/community
      https://www.tableau.com/pricing/teams-orgs
    3. SAP BI
      https://www.sap.com/india/products/technology-platform/cloud-analytics/pricing.html
    4. Qlik Sense
      https://www.qlik.com/us/products/data-sources?category=ProductOrServiceQlikSense
      https://www.qlik.com/us/pricing
    5. Apache Superset
      https://superset.apache.org/docs/databases/installing-database-drivers/
  • Exploring Marvels of Webpack : Ep 1 – React Project without CRA

    Hands up if you’ve ever built a React project with Create-React-App (CRA)—and that’s all of us, isn’t it? Now, how about we pull back the curtain and see what’s actually going on behind the scenes? Buckle up, it’s time to understand what CRA really is and explore the wild, untamed world of creating a React project without it. Sounds exciting, huh?

    What is CRA?

    CRA—Create React App (https://create-react-app.dev/)—is a command line utility provided by Facebook for creating react apps with preconfigured setup. CRA provides an abstraction layer over the nitty-gritty details of configuring tools like Babel and Webpack, allowing us to focus on writing code. Apart from this, it basically comes with everything preconfigured, and developers don’t need to worry about anything but code.

    That’s all well and good, but why do we need to learn about manual configuration? At some point in your career, you’ll likely have to adjust webpack configurations. And if that’s not a convincing reason, how about satisfying your curiosity?  🙂

    Let’s begin our journey.

    Webpack

    As per the official docs (https://webpack.js.org/concepts/):

    “At its core, webpack is a static module bundler for modern JavaScript applications.”

    But what does that actually mean? Let’s break it down:

    static:It refers to the static assets (HTML, CSS, JS, images) on our application.

    module:It refers to a piece of code in one of our files. In a large application, it’s not usually possible to write everything in a single file, so we have multiple modules piled up together.

    bundler:It is a tool (which is webpack in our case), that bundles up everything we have used in our project and converts it to native, browser understandable JS, CSS, HTML (static assets).

    Source:https://webpack.js.org/

    So, in essence, webpack takes our application’s static assets (like JavaScript modules, CSS files, and more) and bundles them together, resolving dependencies and optimizing the final output.

    Webpack is preconfigured in our Create-React-App (CRA), and for most use cases, we don’t need to adjust it. You’ll find that many tutorials begin a React project with CRA. However, to truly understand webpack and its functionalities, we need to configure it ourselves. In this guide, we’ll attempt to do just that.

    Let’s break this whole process into multiple steps:

    Step 1: Let us name our new project

    Create a new project folder and navigate into it:

    mkdir react-webpack-way
    cd react-webpack-way

    Step 2: Initialize npm

    Run the following command to initialize a new npm project. Answer the prompts or press Enter to accept the default values.

    npm init # if you are patient enough to answer the prompts :)
    Or
    npm init -y

    This will generate a package.json for us.

    Step 3: Install React and ReactDOM

    Install React and ReactDOM as dependencies:

    npm install react react-dom

    Step 4: Create project structure

    You can create any folder structure that you are used to. But for the sake of simplicity, let’s stick to the following structure:

    |- src
      |- index.js
    |- public
      |- index.html

    Step 5: Set up React components

    Let’s populate our index.js:

    // src/index.js
    import React from 'react';
    import ReactDOM from 'react-dom';
    
    const App = () => {
      return <h1>Hello, React with Webpack!</h1>;
    };
    
    ReactDOM.render(<App />, document.getElementById('root'));

    Step 6: Let’s deal with the HTML file

    Add the following content to index.html:

    <!-- public/index.html -->
    <!DOCTYPE html>
    <html lang="en">
      <head>
        <meta charset="utf-8" />
        <title>React with Webpack</title>
      </head>
      <body>
        <div id="root"></div> <!-- Do not miss this one -->
      </body>
    </html>

    Step 7: Install Webpack and Babel

    Install Webpack, Babel, and html-webpack-plugin as development dependencies:

    npm install --save-dev webpack webpack-cli webpack-dev-server @babel/core @babel/preset-react @babel/preset-env babel-loader html-webpack-plugin

    Or

    If this looks verbose to you, you can do these in steps:

    npm install --save-dev webpack webpack-cli webpack-dev-server # webpack
    npm install --save-dev @babel/core @babel/preset-react @babel/preset-env babel-loader #babel
    npm install --save-dev html-webpack-plugin

    Why babel? Read more: https://babeljs.io/docs/

    In a nutshell, some of the reasons we use Babel are:

    1. JavaScript ECMAScript Compatibility:
      • Babel allows developers to use the latest ECMAScript (ES) features in their code, even if the browser or Node.js environment doesn’t yet support them. This is achieved through the process of transpiling, where Babel converts modern JavaScript code (ES6 and beyond) into a version that is compatible with a wider range of browsers and environments.

    2. JSX Transformation:
      • JSX (JavaScript XML) is a syntax extension for JavaScript used with React. Babel is required to transform JSX syntax into plain JavaScript, as browsers do not understand JSX directly. This transformation is necessary for React components to be properly rendered in the browser.

    3. Module System Transformation:
      • Babel helps in transforming the module system used in JavaScript. It can convert code written using the ES6 module syntax (import and export) into the CommonJS or AMD syntax that browsers and older environments understand.

    4. Polyfilling:
      • Babel can include polyfills for features not present in the target environment. This ensures your application can use newer language features or APIs even if they are not supported natively.

    5. Browser Compatibility:
      • Different browsers have varying levels of support for JavaScript features. Babel helps address these compatibility issues by allowing developers to write code using the latest features and then automatically transforming it to a version that works across different browsers.

    Why html-webpack-plugin? Read more: https://webpack.js.org/plugins/html-webpack-plugin/

    The html-webpack-plugin is a popular webpack plugin that simplifies the process of creating an HTML file to serve your bundled JavaScript files. It automatically injects the bundled script(s) into the HTML file, saving you from having to manually update the script tags every time your bundle changes. To put it in perspective, if you don’t have this plugin, you won’t see your React index file injected into the HTML file.

    Step 8: Configure Babel

    Create a .babelrc file in the project root and add the following configuration:

    // .babelrc
    {
      "presets": ["@babel/preset-react", "@babel/preset-env"]
    }

    Step 9: Configure Webpack

    Create a webpack.config.js file in the project root:

    // webpack.config.js
    const path = require('path');
    const HtmlWebpackPlugin = require('html-webpack-plugin');
    
    module.exports = {
      entry: './src/index.js',
      output: {
        path: path.resolve(__dirname, 'dist'),
        filename: 'bundle.js',
      },
      module: {
        rules: [
          {
            test: /\.(js|jsx)$/,
            exclude: /node_modules/,
            use: 'babel-loader',
          },
        ],
      },
      plugins: [
        new HtmlWebpackPlugin({
          template: 'public/index.html',
        }),
      ],
      devServer: {
        static: path.resolve(__dirname, 'public'),
        port: 3000,
      },
    };

    Step 10: Update package.json scripts

    Update the “scripts” section in your package.json file:

    "scripts": {
      "start": "webpack serve --mode development --open",
      "build": "webpack --mode production"
    }

    Note: Do not replace the contents of package.json here. Just update the scripts section.

    Step 11: This is where our hard work pays off

    Now you can run your React project using the following command:

    npm start

    Visit http://localhost:3000 in your browser, and you should see your React app up and running.

    This is it. This is a very basic version of our CRA.

    There’s more

    Stick around if you want to understand what we exactly did in the webpack.config.js.

    At this point, our webpack config looks like this:

    // webpack.config.js
    const path = require('path');
    const HtmlWebpackPlugin = require('html-webpack-plugin');
    
    module.exports = {
      entry: './src/index.js',
      output: {
        path: path.resolve(__dirname, 'dist'),
        filename: 'bundle.js',
      },
      module: {
        rules: [
          {
            test: /\.(js|jsx)$/,
            exclude: /node_modules/,
            use: 'babel-loader',
          },
        ],
      },
      plugins: [
        new HtmlWebpackPlugin({
          template: 'public/index.html',
        }),
      ],
      devServer: {
        static: path.resolve(__dirname, 'public'),
        port: 3000,
      },
    };

    Let’s go through each section of the provided webpack.config.js file and explain what each keyword means:

    1. const path = require('path');
      • This line imports the Node.js path module, which provides utilities for working with file and directory paths. Our webpack configuration ensures that file paths are specified correctly and consistently across different operating systems.

    2. const HtmlWebpackPlugin = require('html-webpack-plugin');
      • This line imports the HtmlWebpackPlugin module. This webpack plugin simplifies the process of creating an HTML file to include the bundled JavaScript files. It’s a convenient way of automatically generating an HTML file that includes the correct script tags for our React application.

    3. module.exports = { ... };
      • This line exports a JavaScript object, which contains the configuration for webpack. It specifies how webpack should bundle and process your code.

    4. entry: './src/index.js',
      • This configuration tells webpack the entry point of your application, which is the main JavaScript file where the bundling process begins. In this case, it’s ./src/index.js.

    5. output: { path: path.resolve(__dirname, 'dist'), filename: 'bundle.js', },
      • This configuration specifies where the bundled JavaScript file should be output: path is the directory, and filename is the name of the output file. In this case, it will be placed in the dist directory with the name bundle.js.

    6. module: { rules: [ ... ], },
      • This section defines rules for how webpack should process different types of files. In this case, it specifies a rule for JavaScript and JSX files (those ending with .js or .jsx). The babel-loader is used to transpile these files using Babel, excluding files in the node_modules directory.

    7. plugins: [ new HtmlWebpackPlugin({ template: 'public/index.html', }), ],
      • This section includes an array of webpack plugins. In particular, it adds the HtmlWebpackPlugin, configured to use the public/index.html file as a template. This plugin will automatically generate an HTML file with the correct script tags for the bundled JavaScript.

    8. devServer: { static: path.resolve(__dirname, 'public'), port: 3000, },
      • This configuration is for the webpack development server. It specifies the base directory for serving static files (public in this case) and the port number (3000) on which the development server will run. The development server provides features like hot-reloading during development.

    And there you have it! We’ve just scratched the surface of the wild world of webpack. But don’t worry, this is just the opening act. Grab your gear, because in the upcoming articles, we’re going to plunge into the deep end, exploring the advanced terrains of webpack. Stay tuned!

  • Strategies for Cost Optimization Across Amazon EKS Clusters

    Fast-growing tech companies rely heavily on Amazon EKS clusters to host a variety of microservices and applications. The pairing of Amazon EKS for managing the Kubernetes Control Plane and Amazon EC2 for flexible Kubernetes nodes creates an optimal environment for running containerized workloads. 

    With the increasing scale of operations, optimizing costs across multiple EKS clusters has become a critical priority. This blog will demonstrate how we can leverage various tools and strategies to analyze, optimize, and manage EKS costs effectively while maintaining performance and reliability. 

    Cost Analysis:

    Working on cost optimization becomes absolutely necessary for cost analysis. Data plays an important role, and trust your data. The total cost of operating an EKS cluster encompasses several components. The EKS Control Plane (or Master Node) incurs a fixed cost of $0.20 per hour, offering straightforward pricing. 

    Meanwhile, EC2 instances, serving as the cluster’s nodes, introduce various cost factors, such as block storage and data transfer, which can vary significantly based on workload characteristics. For this discussion, we’ll focus primarily on two aspects of EC2 cost: instance hours and instance pricing. Let’s look at how to do the cost analysis on your EKS cluster.

    • Tool Selection: We can begin our cost analysis journey by selecting Kubecost, a powerful tool specifically designed for Kubernetes cost analysis. Kubecost provides granular insights into resource utilization and costs across our EKS clusters.
    • Deployment and Usage: Deploying Kubecost is straightforward. We can integrate it with our Kubernetes clusters following the provided documentation. Kubecost’s intuitive dashboard allowed us to visualize resource usage, cost breakdowns, and cost allocation by namespace, pod, or label. Once deployed, you can see the Kubecost overview page in your browser by port-forwarding the Kubecost k8s service. It might take 5-10 minutes for Kubecost to gather metrics. You can see your Amazon EKS spend, including cumulative cluster costs, associated Kubernetes asset costs, and monthly aggregated spend.
    • Cluster Level Cost Analysis: For multi-cluster cost analysis and cluster level scoping, consider using the AWS Tagging strategy and tag your EKS clusters. Learn more about tagging strategy from the following documentations. You can then view your cost analysis in AWS Cost Explorer. AWS Cost Explorer provided additional insights into our AWS usage and spending trends. By analyzing cost and usage data at a granular level, we can identify areas for further optimization and cost reduction. 
    • Multi-Cluster Cost Analysis using Kubecost and Prometheus: Kubecost deployment comes with a Prometheus cluster to send cost analysis metrics to the Prometheus server. For multiple EKS clusters, we can enable the remote Prometheus server, either AWS-Managed Prometheus server or self-managed Prometheus. To get cost analysis metrics from multiple clusters, we need to run Kubeost with an additional Sigv4 pod that sends individual and combined cluster metrics to a common Prometheus cluster. You can follow the AWS documentation for Multi-Cluster Cost Analysis using Kubecost and Prometheus.

    Cost Optimization Strategies:

    Based on the cost analysis, the next step is to plan your cost optimization strategies. As explained in the previous section, the Control Plane has a fixed cost and straightforward pricing model. So, we will focus mainly on optimizing the data nodes and optimizing the application configuration. Let’s look at the following strategies when optimizing the cost of the EKS cluster and supporting AWS services:

    • Right Sizing: On the cost optimization pillar of the AWS Well-Architected Framework, we find a section on Cost-Effective Resources, which describes Right Sizing as:

    “… using the lowest cost resource that still meets the technical specifications of a specific workload.”

    • Application Right Sizing: Right-sizing is the strategy to optimize pod resources by allocating the appropriate CPU and memory resources to pods. Care must be taken to try to set requests that align as close as possible to the actual utilization of these resources. If the value is too low, then the containers may experience throttling of the resources and impact the performance. However, if the value is too high, then there is waste, since those unused resources remain reserved for that single container. When actual utilization is lower than the requested value, the difference is called slack cost. A tool like kube-resource-report is valuable for visualizing the slack cost and right-sizing the requests for the containers in a pod. Installation instructions demonstrate how to install via an included helm chart.

      helm upgrade –install kube-resource-report chart/kube-resource-report


      You can also consider tools like VPA recommender with Goldilocks to get an insight into your pod resource consumption and utilization.


    • Compute Right Sizing: Application right sizing and Kubecost analysis are required to right-size EKS Compute. Here are several strategies for computing right sizing:some text
      • Mixed Instance Auto Scaling group: Employ a mixed instance policy to create a diversified pool of instances within your auto scaling group. This mix can include both spot and on-demand instances. However, it’s advisable not to mix instances of different sizes within the same Node group.
      • Node Groups, Taints, and Tolerations: Utilize separate Node Groups with varying instance sizes for different application requirements. For example, use distinct node groups for GPU-intensive and CPU-intensive applications. Use taints and tolerations to ensure applications are deployed on the appropriate node group.
      • Graviton Instances: Explore the adoption of Graviton Instances, which offer up to 40% better price performance compared to traditional instances. Consider migrating to Graviton Instances to optimize costs and enhance application performance.
    • Purchase Options: Another part of the cost optimization pillar of the AWS Well-Architected Framework that we can apply comes from the Purchasing Options section, which says:

    Spot Instances allow you to use spare compute capacity at a significantly lower cost than On-Demand EC2 instances (up to 90%).”

    Understanding purchase options for Amazon EC2 is crucial for cost optimization. The Amazon EKS data plane consists of worker nodes or serverless compute resources responsible for running Kubernetes application workloads. These nodes can utilize different capacity types and purchase options, including  On-Demand, Spot Instances, Savings Plans, and Reserved Instances.

    On-Demand and Spot capacity offer flexibility without spending commitments. On-Demand instances are billed based on runtime and guarantee availability at On-Demand rates, while Spot instances offer discounted rates but are preemptible. Both options are suitable for temporary or bursty workloads, with Spot instances being particularly cost-effective for applications tolerant of compute availability fluctuations. 

    Reserved Instances involve upfront spending commitments over one or three years for discounted rates. Once a steady-state resource consumption profile is established, Reserved Instances or Savings Plans become effective. Savings Plans, introduced as a more flexible alternative to Reserved Instances, allow for commitments based on a “US Dollar spend amount,” irrespective of provisioned resources. There are two types: Compute Savings Plans, offering flexibility across instance types, Fargate, and Lambda charges, and EC2 Instance Savings Plans, providing deeper discounts but restricting compute choice to an instance family.

    Tailoring your approach to your workload can significantly impact cost optimization within your EKS cluster. For non-production environments, leveraging Spot Instances exclusively can yield substantial savings. Meanwhile, implementing Mixed-Instances Auto Scaling Groups for production workloads allows for dynamic scaling and cost optimization. Additionally, for steady workloads, investing in a Savings Plan for EC2 instances can provide long-term cost benefits. By strategically planning and optimizing your EC2 instances, you can achieve a notable reduction in your overall EKS compute costs, potentially reaching savings of approximately 60-70%.

    “… this (matching supply and demand) accomplished using Auto Scaling, which helps you to scale your EC2 instances and Spot Fleet capacity up or down automatically according to conditions you define.”

    • Cluster Autoscaling: Therefore, a prerequisite to cost optimization on a Kubernetes cluster is to ensure you have Cluster Autoscaler running. This tool performs two critical functions in the cluster. First, it will monitor the cluster for pods that are unable to run due to insufficient resources. Whenever this occurs, the Cluster Autoscaler will update the Amazon EC2 Auto Scaling group to increase the desired count, resulting in additional nodes in the cluster. Additionally, the Cluster Autoscaler will detect nodes that have been underutilized and reschedule pods onto other nodes. Cluster Autoscaler will then decrease the desired count for the Auto Scaling group to scale in the number of nodes.

    The Amazon EKS User Guide has a great section on the configuration of the Cluster Autoscaler. There are a couple of things to pay attention to when configuring the Cluster Autoscaler:

    IAM Roles for Service Account – Cluster Autoscaler will require access to update the desired capacity in the Auto Scaling group. The recommended approach is to create a new IAM role with the required policies and a trust policy that restricts access to the service account used by Cluster Autoscaler. The role name must then be provided as an annotation on the service account:

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: cluster-autoscaler
      annotations:
    	eks.amazonaws.com/role-arn: arn:aws:iam::000000000000:role/my_role_name

    Auto-Discovery Setup

    Setup your Cluster Autoscaler in Auto-Discovery Setup by enabling the –node-group-auto-discovery flag as an argument. Also, make sure to tag your EKS nodes’ Autoscaling groups with the following tags: 

    k8s.io/cluster-autoscaler/enabled,
    k8s.io/cluster-autoscaler/<cluster-name>


    Auto Scaling Group per AZ
    – When Cluster Autoscaler scales out, it simply increases the desired count for the Auto Scaling group, leaving the responsibility for launching new EC2 instances to the AWS Auto Scaling service. If an Auto Scaling group is configured for multiple availability zones, then the new instance may be provisioned in any of those availability zones.

    For deployments that use persistent volumes, you will need to provision a separate Auto Scaling group for each availability zone. This way, when Cluster Autoscaler detects the need to scale out in response to a given pod, it can target the correct availability zone for the scale-out based on persistent volume claims that already exist in a given availability zone.

    When using multiple Auto Scaling groups, be sure to include the following argument in the pod specification for Cluster Autoscaler:

    –balance-similar-node-groups=true

    • Pod Autoscaling: Now that Cluster Autoscaler is running in the cluster, you can be confident that the instance hours will align closely with the demand from pods within the cluster. Next up is to use Horizontal Pod Autoscaler (HPA) to scale out or in the number of pods for a deployment based on specific metrics for the pods to optimize pod hours and further optimize our instance hours.

    The HPA controller is included with Kubernetes, so all that is required to configure HPA is to ensure that the Kubernetes metrics server is deployed in your cluster and then defining HPA resources for your deployments. For example, the following HPA resource is configured to monitor the CPU utilization for a deployment named nginx-ingress-controller. HPA will then scale out or in the number of pods between 1 and 5 to target an average CPU utilization of 80% across all the pods:

    apiVersion: autoscaling/v1
    kind: HorizontalPodAutoscaler
    metadata:
      name: nginx-ingress-controller
    spec:
      scaleTargetRef:
    	apiVersion: apps/v1
    	kind: Deployment
    	name: nginx-ingress-controller
      minReplicas: 1
      maxReplicas: 5
      targetCPUUtilizationPercentage: 80

    The combination of Cluster Autoscaler and Horizontal Pod Autoscaler is an effective way to keep EC2 instance hours tied as close as possible to the actual utilization of the workloads running in the cluster.

    “Systems can be scheduled to scale out or in at defined times, such as the start of business hours, thus ensuring that resources are available when users arrive.”

    There are many deployments that only need to be available during business hours. A tool named kube-downscaler can be deployed to the cluster to scale in and out the deployments based on time of day. 

    Some example use case of kube-downscaler is:

    • Deploy the downscaler to a test (non-prod) cluster with a default uptime or downtime time range to scale down all deployments during the night and weekend.
    • Deploy the downscaler to a production cluster without any default uptime/downtime setting and scale down specific deployments by setting the downscaler/uptime (or downscaler/downtime) annotation. This might be useful for internal tooling front ends, which are only needed during work time.
    • AWS Fargate with EKS: You can run Kubernetes without managing clusters of K8s servers with AWS Fargate, a serverless compute service.

    AWS Fargate pricing is based on usage (pay-per-use). There are no upfront charges here as well. There is, however, a one-minute minimum charge. All charges are also rounded up to the nearest second. You will also be charged for any additional services you use, such as CloudWatch utilization charges and data transfer fees. Fargate can also reduce your management costs by reducing the number of DevOps professionals and tools you need to run Kubernetes on Amazon EKS.

    Conclusion:

    Effectively managing costs across multiple Amazon EKS clusters is essential for optimizing operations. By utilizing tools like Kubecost and AWS Cost Explorer, coupled with strategies such as right-sizing, mixed instance policies, and Spot Instances, organizations can streamline cost analysis and optimize resource allocation. Additionally, implementing auto-scaling mechanisms like Cluster Autoscaler ensures dynamic resource scaling based on demand, further optimizing costs. Leveraging AWS Fargate with EKS can eliminate the need to manage Kubernetes clusters, reducing management costs. Overall, by combining these strategies, organizations can achieve significant cost savings while maintaining performance and reliability in their containerized environments.

  • Unveiling the Magic of Golang Interfaces: A Comprehensive Exploration

    Go interfaces are powerful tools for designing flexible and adaptable code. However, their inner workings can often seem hidden behind the simple syntax.

    This blog post aims to peel back the layers and explore the internals of Go interfaces, providing you with a deeper understanding of their power and capabilities.

    1. Interfaces: Not Just Method Signatures

    While interfaces appear as collections of method signatures, they are deeper than that. An interface defines a contract: any type that implements the interface guarantees the ability to perform specific actions through those methods. This contract-based approach promotes loose coupling and enhances code reusability.

    // Interface defining a "printable" behavior
    type Printable interface {
        String() string
    }
    
    // Struct types implementing the Printable interface
    type Book struct {
        Title string
    }
    
    type Article struct {
        Title string
        Content string
    }
    
    // Implement String() method to fulfill the contract
    func (b Book) String() string {
        return b.Title
    }
    
    // Implement String() method to fulfill the contract
    func (a Article) String() string {
        return fmt.Sprintf("%s", a.Title)
    }

    Here, both Book and Article types implement the Printable interface by providing a String() method. This allows us to treat them interchangeably in functions expecting Printable values.

    2. Interface Values and Dynamic Typing

    An interface variable itself cannot hold a value. Instead, it refers to an underlying concrete type that implements the interface. Go uses dynamic typing to determine the actual type at runtime. This allows for flexible operations like:

    func printAll(printables []Printable) {
        for _, p := range printables {
            fmt.Println(p.String()) // Calls the appropriate String() based on concrete type
        }
    }
    
    book := Book{Title: "Go for Beginners"}
    article := Article{Title: "The power of interfaces"}
    
    printables := []Printable{book, article}
    printAll(printables)

    The printAll function takes a slice of Printable and iterates over it. Go dynamically invokes the correct String() method based on the concrete type of each element (Book or Article) within the slice.

    3. Embedded Interfaces and Interface Inheritance

    Go interfaces support embedding existing interfaces to create more complex contracts. This allows for code reuse and hierarchical relationships, further enhancing the flexibility of your code:

    type Writer interface {
        Write(data []byte) (int, error)
    }
    
    type ReadWriter interface {
        Writer
        Read([]byte) (int, error)
    }
    
    type MyFile struct {
        // ... file data and methods
    }
    
    // MyFile implements both Writer and ReadWriter by embedding their interfaces
    func (f *MyFile) Write(data []byte) (int, error) {
        // ... write data to file
    }
    
    func (f *MyFile) Read(data []byte) (int, error) {
        // ... read data from file
    }

    Here, ReadWriter inherits all methods from the embedded Writer interface, effectively creating a more specific “read-write” contract.

    4. The Empty Interface and Its Power

    The special interface{} represents the empty interface, meaning it requires no specific methods. This seemingly simple concept unlocks powerful capabilities:

    // Function accepting any type using the empty interface
    func PrintAnything(value interface{}) {
        fmt.Println(reflect.TypeOf(value), value)
    }
    
    PrintAnything(42)  // Output: int 42
    PrintAnything("Hello") // Output: string Hello
    PrintAnything(MyFile{}) // Output: main.MyFile {}

    This function can accept any type because interface{} has no requirements. Internally, Go uses reflection to extract the actual type and value at runtime, enabling generic operations.

    5. Understanding Interface Equality and Comparisons

    Equality checks on interface values involve both the dynamic type and underlying value:

    book1 := Book{Title: "Go for Beginners"}
    book2 := Book{Title: "Go for Beginners"}
    
    // Same type and value, so equal
    fmt.Println(book1 == book2) // True
    
    differentBook := Book{Title: "Go for Dummies"}
    
    // Same type, different value, so not equal
    fmt.Println(book1 == differentBook) // False
    
    article := Article{Title: "Go for Beginners"}
    
    // This will cause a compilation error
    fmt.Println(book1 == article) // Error: invalid operation: book1 == article (mismatched types Book and Article)

    However, it’s essential to remember that interfaces themselves cannot be directly compared using the == operator unless they both contain exactly the same value of the same type.

    To compare interface values effectively, you can utilize two main approaches:

    1. Type Assertions:
    These allow you to safely access the underlying value and perform comparisons if you’re certain about the actual type:

    func getBookTitleFromPrintable(p Printable) (string, bool) {
        book, ok := p.(Book) // Check if p is a Book
        if ok {
            return book.Title, true
        }
        return "", false // Return empty string and false if not a Book
    }
    
    bookTitle, ok := getBookTitleFromPrintable(article)
    if ok {
        fmt.Println("Extracted book title:", bookTitle)
    } else {
        fmt.Println("Article is not a Book")
    }

    2. Custom Comparison Functions:
    You can also create dedicated functions to compare interface values based on specific criteria:

    func comparePrintablesByTitle(p1, p2 Printable) bool {
        return p1.String() == p2.String()
    }
    
    fmt.Println(comparePrintablesByTitle(book1, article)) // Compares titles regardless of types

    Understanding these limitations and adopting appropriate comparison techniques ensures accurate and meaningful comparisons with Go interfaces.

    6. Interface Methods and Implicit Receivers

    Interface methods implicitly receive a pointer to the underlying value. This enables methods to modify the state of the object they are called on:

    type Counter interface {
        Increment() int
    }
    
    type MyCounter struct {
        count int
    }
    
    func (c *MyCounter) Increment() int {
        c.count++
        return c.count
    }
    
    counter := MyCounter{count: 5}
    fmt.Println(counter.Increment()) // Output: 6

    The Increment method receives a pointer to MyCounter, allowing it to directly modify the count field.

    7. Error Handling and Interfaces

    Go interfaces play a crucial role in error handling. The built-in error interface defines a single method, Error() string, used to represent errors:

    type error interface {
        Error() string
    }
    
    // Custom error type implementing the error interface
    type MyError struct {
        message string
    }
    
    func (e MyError) Error() string {
        return e.message
    }
    
    func myFunction() error {
        // ... some operation
        return MyError{"Something went wrong"}
    }
    
    if err := myFunction(); err != nil {
        fmt.Println("Error:", err.Error()) // Prints "Something went wrong"
    }

    By adhering to the error interface, custom errors can be seamlessly integrated into Go’s error-handling mechanisms.

    8. Interface Values and Nil

    Interface values can be nil, indicating they don’t hold any concrete value. However, attempting to call methods on a nil interface value results in a panic.

    var printable Printable // nil interface value
    fmt.Println(printable.String()) // Panics!

    Always check for nil before calling methods on interface values.

    However, it’s important to understand that an interface{} value doesn’t simply hold a reference to the underlying data. Internally, Go creates a special structure to store both the type information and the actual value. This hidden structure is often referred to as “boxing” the value.

    Imagine a small container holding both a label indicating the type (e.g., int, string) and the actual data inside something like this:

    type iface struct {
         tab   *itab
         data  unsafe.Pointer
    }

    Technically, this structure involves two components:

    • tab: This type descriptor carries details like the interface’s method set, the underlying type, and the methods of the underlying type that implement the interface.
    • data pointer: This pointer directly points to the memory location where the actual value resides.

    When you retrieve a value from an interface{}, Go performs “unboxing.” It reads the type information and data pointer and then creates a new variable of the appropriate type based on this information.

    This internal mechanism might seem complex, but the Go runtime handles it seamlessly. However, understanding this concept can give you deeper insights into how Go interfaces work under the hood.

    9. Conclusion

    This journey through the magic of Go interfaces has hopefully provided you with a deeper understanding of their capabilities and how they work. We’ve explored how they go beyond simple method signatures to define contracts, enable dynamic behavior, and making it way more flexible.

    Remember, interfaces are not just tools for code reuse, but also powerful mechanisms for designing adaptable and maintainable applications.

    Here are some key takeaways to keep in mind:

    • Interfaces define contracts, not just method signatures.
    • Interfaces enable dynamic typing and flexible operations.
    • Embedded interfaces allow for hierarchical relationships and code reuse.
    • The empty interface unlocks powerful generic capabilities.
    • Understand the nuances of interface equality and comparisons.
    • Interfaces play a crucial role in Go’s error-handling mechanisms.
    • Be mindful of nil interface values and potential panics.

    10. References

  • Mastering Prow: A Guide to Developing Your Own Plugin for Kubernetes CI/CD Workflow

    Continuous Integration and Continuous Delivery (CI/CD) pipelines are essential components of modern software development, especially in the world of Kubernetes and containerized applications. To facilitate these pipelines, many organizations use Prow, a CI/CD system built specifically for Kubernetes. While Prow offers a rich set of features out of the box, you may need to develop your own plugins to tailor the system to your organization’s requirements. In this guide, we’ll explore the world of Prow plugin development and show you how to get started.

    Prerequisites

    Before diving into Prow plugin development, ensure you have the following prerequisites:

    • Basic Knowledge of Kubernetes and CI/CD Concepts: Familiarity with Kubernetes concepts such as Pods, Deployments, and Services, as well as understanding CI/CD principles, will be beneficial for understanding Prow plugin development.
    • Access to a Kubernetes Cluster: You’ll need access to a Kubernetes cluster for testing your plugins. If you don’t have one already, you can set up a local cluster using tools like Minikube or use a cloud provider’s managed Kubernetes service.
    • Prow Setup: Install and configure Prow in your Kubernetes cluster. You can visit Velotio Technologies – Getting Started with Prow: A Kubernetes-Native CI/CD Framework
    • Development Environment Setup: Ensure you have Git, Go, and Docker installed on your local machine for developing and testing Prow plugins. You’ll also need to configure your environment to interact with your organization’s Prow setup.

    The Need for Custom Prow Plugins

    While Prow provides a wide range of built-in plugins, your organization’s Kubernetes workflow may have specific requirements that aren’t covered by these defaults. This is where developing custom Prow plugins comes into play. Custom plugins allow you to extend Prow’s functionality to cater to your needs. Whether automating workflows, integrating with other tools, or enforcing custom policies, developing your own Prow plugins gives you the power to tailor your CI/CD pipeline precisely.

    Getting Started with Prow Plugin Development

    Developing a custom Prow plugin may seem daunting, but with the right approach and tools, it can be a rewarding experience. Here’s a step-by-step guide to get you started:

    1. Set Up Your Development Environment

    Before diving into plugin development, you need to set up your development environment. You will need Git, Go, and access to a Kubernetes cluster for testing your plugins. Ensure you have the necessary permissions to make changes to your organization’s Prow setup.

    2. Choose a Plugin Type

    Prow supports various plugin types, including postsubmits, presubmits, triggers, and utilities. Choose the type that best fits your use case.

    • Postsubmits: These plugins are executed after the code is merged and are often used for tasks like publishing artifacts or creating release notes.
    • Presubmits: Presubmit plugins run before code is merged, typically used for running tests and ensuring code quality.
    • Triggers: Trigger plugins allow you to trigger custom jobs based on specific events or criteria.
    • Utilities: Utility plugins offer reusable functions and utilities for other plugins.

    3. Create Your Plugin

    Once you’ve chosen a plugin type, it’s time to create it. Below is an example of a simple Prow plugin written in Go, named comment-plugin.go. It will create a comment on a pull request each time an event is received.

    This code sets up a basic HTTP server that listens for GitHub events and handles them by creating a comment using the GitHub API. Customize this code to fit your specific use case.

    package main
    
    import (
        "encoding/json"
        "flag"
        "net/http"
        "os"
        "strconv"
        "time"
    
        "github.com/sirupsen/logrus"
        "k8s.io/test-infra/pkg/flagutil"
        "k8s.io/test-infra/prow/config"
        "k8s.io/test-infra/prow/config/secret"
        prowflagutil "k8s.io/test-infra/prow/flagutil"
        configflagutil "k8s.io/test-infra/prow/flagutil/config"
        "k8s.io/test-infra/prow/github"
        "k8s.io/test-infra/prow/interrupts"
        "k8s.io/test-infra/prow/logrusutil"
        "k8s.io/test-infra/prow/pjutil"
        "k8s.io/test-infra/prow/pluginhelp"
        "k8s.io/test-infra/prow/pluginhelp/externalplugins"
    )
    
    const pluginName = "comment-plugin"
    
    type options struct {
        port int
    
        config                 configflagutil.ConfigOptions
        dryRun                 bool
        github                 prowflagutil.GitHubOptions
        instrumentationOptions prowflagutil.InstrumentationOptions
    
        webhookSecretFile string
    }
    
    type server struct {
        tokenGenerator func() []byte
        botUser        *github.UserData
        email          string
        ghc            github.Client
        log            *logrus.Entry
        repos          []github.Repo
    }
    
    func helpProvider(_ []config.OrgRepo) (*pluginhelp.PluginHelp, error) {
        pluginHelp := &pluginhelp.PluginHelp{
           Description: `The sample plugin`,
        }
        return pluginHelp, nil
    }
    
    func (o *options) Validate() error {
        return nil
    }
    
    func gatherOptions() options {
        o := options{config: configflagutil.ConfigOptions{ConfigPath: "./config.yaml"}}
        fs := flag.NewFlagSet(os.Args[0], flag.ExitOnError)
        fs.IntVar(&o.port, "port", 8888, "Port to listen on.")
        fs.BoolVar(&o.dryRun, "dry-run", false, "Dry run for testing. Uses API tokens but does not mutate.")
        fs.StringVar(&o.webhookSecretFile, "hmac-secret-file", "/etc/hmac", "Path to the file containing GitHub HMAC secret.")
        for _, group := range []flagutil.OptionGroup{&o.github} {
           group.AddFlags(fs)
        }
        fs.Parse(os.Args[1:])
        return o
    }
    
    func main() {
        o := gatherOptions()
        if err := o.Validate(); err != nil {
           logrus.Fatalf("Invalid options: %v", err)
        }
    
        logrusutil.ComponentInit()
        log := logrus.StandardLogger().WithField("plugin", pluginName)
    
        if err := secret.Add(o.webhookSecretFile); err != nil {
           logrus.WithError(err).Fatal("Error starting secrets agent.")
        }
    
        gitHubClient, err := o.github.GitHubClient(o.dryRun)
        if err != nil {
           logrus.WithError(err).Fatal("Error getting GitHub client.")
        }
    
        email, err := gitHubClient.Email()
        if err != nil {
           log.WithError(err).Fatal("Error getting bot e-mail.")
        }
    
        botUser, err := gitHubClient.BotUser()
        if err != nil {
           logrus.WithError(err).Fatal("Error getting bot name.")
        }
        repos, err := gitHubClient.GetRepos(botUser.Login, true)
        if err != nil {
           log.WithError(err).Fatal("Error listing bot repositories.")
        }
        serv := &server{
           tokenGenerator: secret.GetTokenGenerator(o.webhookSecretFile),
           botUser:        botUser,
           email:          email,
           ghc:            gitHubClient,
           log:            log,
           repos:          repos,
        }
    
        health := pjutil.NewHealthOnPort(o.instrumentationOptions.HealthPort)
        health.ServeReady()
    
        mux := http.NewServeMux()
        mux.Handle("/", serv)
        externalplugins.ServeExternalPluginHelp(mux, log, helpProvider)
        logrus.Info("starting server " + strconv.Itoa(o.port))
        httpServer := &http.Server{Addr: ":" + strconv.Itoa(o.port), Handler: mux}
        defer interrupts.WaitForGracefulShutdown()
        interrupts.ListenAndServe(httpServer, 5*time.Second)
    }
    
    func (s *server) ServeHTTP(w http.ResponseWriter, r *http.Request) {
        logrus.Info("inside http server")
        _, _, payload, ok, _ := github.ValidateWebhook(w, r, s.tokenGenerator)
        logrus.Info(string(payload))
        if !ok {
           return
        }
        logrus.Info(w, "Event received. Have a nice day.")
        if err := s.handleEvent(payload); err != nil {
           logrus.WithError(err).Error("Error parsing event.")
        }
    }
    
    func (s *server) handleEvent(payload []byte) error {
        logrus.Info("inside handler")
        var pr github.PullRequestEvent
        if err := json.Unmarshal(payload, &pr); err != nil {
           return err
        }
        logrus.Info(pr.Number)
        if err := s.ghc.CreateComment(pr.PullRequest.Base.Repo.Owner.Login, pr.PullRequest.Base.Repo.Name, pr.Number, "comment from smaple-plugin"); err != nil {
           return err
        }
        return nil
    }

    4. Deploy Your Plugin

    To deploy your custom Prow plugin, you will need to create a Docker image and deploy it into your Prow cluster.

    FROM golang as app-builder
    WORKDIR /app
    RUN apt  update
    RUN apt-get install git
    COPY . .
    RUN CGO_ENABLED=0 go build -o main
    
    FROM alpine:3.9
    RUN apk add ca-certificates git
    COPY --from=app-builder /app/main /app/custom-plugin
    ENTRYPOINT ["/app/custom-plugin"]

    docker build -t jainbhavya65/custom-plugin:v1 .

    docker push jainbhavya65/custom-plugin:v1

    Deploy the Docker image using Kubernetes deployment:
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: comment-plugin
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: comment-plugin
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: comment-plugin
        spec:
          containers:
          - args:
            - --github-token-path=/etc/github/oauth
            - --hmac-secret-file=/etc/hmac-token/hmac
            - --port=80
            image: <IMAGE>
            imagePullPolicy: Always
            name: comment-plugin
            ports:
            - containerPort: 80
              protocol: TCP
            volumeMounts:
            - mountPath: /etc/github
              name: oauth
              readOnly: true
            - mountPath: /etc/hmac-token
              name: hmac
              readOnly: true
          volumes:
          - name: oauth
            secret:
              defaultMode: 420
              secretName: oauth-token
          - name: hmac
            secret:
              defaultMode: 420
              secretName: hmac-token

    Create a service for deployment:
    apiVersion: v1
    kind: Service
    metadata:
      name: comment-plugin
    spec:
      ports:
      - port: 80
        protocol: TCP
        targetPort: 80
      selector:
        app: comment-plugin
      sessionAffinity: None
      type: ClusterIP
    view raw

    After creating the deployment and service, integrate it into your organization’s Prow configuration. This involves updating your Prow plugin.yaml files to include your plugin and specify when it should run.

    external_plugins: 
    - name: comment-plugin
      # No endpoint specified implies "http://{{name}}". // as we deploy plugin into same cluster
      # if plugin is not deployed in same cluster then you can give endpoint
      events:
      # only pull request and issue comment events are send to our plugin
      - pull_request
      - issue_comment

    Conclusion

    Mastering Prow plugin development opens up a world of possibilities for tailoring your Kubernetes CI/CD workflow to meet your organization’s needs. While the initial learning curve may be steep, the benefits of custom plugins in terms of automation, efficiency, and control are well worth the effort.

    Remember that the key to successful Prow plugin development lies in clear documentation, thorough testing, and collaboration with your team to ensure that your custom plugins enhance your CI/CD pipeline’s functionality and reliability. As Kubernetes and containerized applications continue to evolve, Prow will remain a valuable tool for managing your CI/CD processes, and your custom plugins will be the secret sauce that sets your workflow apart from the rest.

  • The Ultimate Guide to Disaster Recovery for Your Kubernetes Clusters

    Kubernetes allows us to run a containerized application at scale without drowning in the details of application load balancing. You can ensure high availability for your applications running on Kubernetes by running multiple replicas (pods) of the application. All the complexity of container orchestrations is hidden away safely so that you can focus on developing application instead of deploying it. Learn more about high availability of Kubernetes Clusters and how you can use Kubedm for high availability in Kubernetes here.

    But using Kubernetes has its own challenges and getting Kubernetes up and running takes some real work. If you are not familiar with getting Kubernetes up and running, you might want to take a look here.

    Kubernetes allows us to have a zero downtime deployment, yet service interrupting events are inevitable and can occur at any time. Your network can go down, your latest application push can introduce a critical bug, or in the rarest case, you might even have to face a natural disaster.

    When you are using Kubernetes, sooner or later, you need to set up a backup. In case your cluster goes into an unrecoverable state, you will need a backup to go back to the previous stable state of the Kubernetes cluster.

    Why Backup and Recovery?

    There are three reasons why you need a backup and recovery mechanism in place for your Kubernetes cluster. These are:

    1. To recover from Disasters: like someone accidentally deleted the namespace where your deployments reside.
    2. Replicate the environment: You want to replicate your production environment to staging environment before any major upgrade.
    3. Migration of Kubernetes Cluster: Let’s say, you want to migrate your Kubernetes cluster from one environment to another.

    What to Backup?

    Now that you know why, let’s see what exactly do you need to backup. The two things you need to backup are:

    1. Your Kubernetes control plane is stored into etcd storage and you need to backup the etcd state to get all the Kubernetes resources.
    2. If you have stateful containers (which you will have in real world), you need a backup of persistent volumes as well.

    How to Backup?

    There have been various tools like Heptio ark and Kube-backup to backup and restore the Kubernetes cluster for cloud providers. But, what if you are not using managed Kubernetes cluster? You might have to get your hands dirty if you are running Kubernetes on Baremetal, just like we are.

    We are running 3 master Kubernetes cluster with 3 etcd members running on each master. If we lose one master, we can still recover the master because etcd quorum is intact. Now if we lose two masters, we need a mechanism to recover from such situations as well for production grade clusters.

    Want to know how to set up multi-master Kubernetes cluster? Keep reading!

    Taking etcd backup:

    There is a different mechanism to take etcd backup depending on how you set up your etcd cluster in Kubernetes environment.

    There are two ways to setup etcd cluster in kubernetes environment:

    1. Internal etcd cluster: It means you’re running your etcd cluster in the form of containers/pods inside the Kubernetes cluster and it is the responsibility of Kubernetes to manage those pods.
    2. External etcd cluster: Etcd cluster you’re running outside of Kubernetes cluster mostly in the form of Linux services and providing its endpoints to Kubernetes cluster to write to.

    Backup Strategy for Internal Etcd Cluster:

    To take a backup from inside a etcd pod, we will be using Kubernetes CronJob functionality which will not require any etcdctl client to be installed on the host.

    Following is the definition of Kubernetes CronJob which will take etcd backup every minute:

    `apiVersion: batch/v1beta1kind: CronJobmetadata: name: backup namespace: kube-systemspec: # activeDeadlineSeconds: 100schedule: "*/1 * * * *"
    jobTemplate:
    spec:
    template:
    spec:
    containers:
    - name: backup
    # Same image as in /etc/kubernetes/manifests/etcd.yaml
    image: k8s.gcr.io/etcd:3.2.24
    env:
    - name: ETCDCTL_API
    value: "3"
    command: ["/bin/sh"]
    args: ["-c", "etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d_%H:%M:%S_%Z).db"]
    volumeMounts:
    - mountPath: /etc/kubernetes/pki/etcd
    name: etcd-certs
    readOnly: true
    - mountPath: /backup
    name: backup
    restartPolicy: OnFailure
    hostNetwork: true
    volumes:
    - name: etcd-certs
    hostPath:
    path: /etc/kubernetes/pki/etcd
    type: DirectoryOrCreate
    - name: backup
    hostPath:
    path: /data/backup
    type: DirectoryOrCreate

    Backup Strategy for External Etcd Cluster:

    If you running etcd cluster on Linux hosts as a service, you should set up a Linux cron job to take backup of your cluster.

    Run the following command to save etcd backup

    ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save /path/for/backup/snapshot.db

    Disaster Recovery

    Now, Let’s say the Kubernetes cluster went completely down and we need to recover the Kubernetes cluster from the etcd snapshot.

    Normally, start the etcd cluster and do the kubeadm init on the master node with etcd endpoints.

    Make sure you put the backup certificates into /etc/kubernetes/pki folder before kubeadm init. It will pick up the same certificates.

    Restore Strategy for Internal Etcd Cluster:

    docker run --rm 
    -v '/data/backup:/backup' 
    -v '/var/lib/etcd:/var/lib/etcd' 
    --env ETCDCTL_API=3 
    k8s.gcr.io/etcd:3.2.24' 
    /bin/sh -c "etcdctl snapshot restore '/backup/etcd-snapshot-2018-12-09_11:12:05_UTC.db' ; mv /default.etcd/member/ /var/lib/etcd/"
    
    kubeadm init --ignore-preflight-errors=DirAvailable--var-lib-etcd

    Restore Strategy for External Etcd Cluster

    Restore the etcd on 3 nodes using following commands:

    ETCDCTL_API=3 etcdctl snapshot restore snapshot-188.db 
    --name master-0 
    --initial-cluster master-0=http://10.0.1.188:2380,master-01=http://10.0.1.136:2380,master-2=http://10.0.1.155:2380 
    --initial-cluster-token my-etcd-token 
    --initial-advertise-peer-urls http://10.0.1.188:2380
    
    ETCDCTL_API=3 etcdctl snapshot restore snapshot-136.db 
    --name master-1 
    --initial-cluster master-0=http://10.0.1.188:2380,master-1=http://10.0.1.136:2380,master-2=http://10.0.1.155:2380 
    --initial-cluster-token my-etcd-token 
    --initial-advertise-peer-urls http://10.0.1.136:2380
    
    ETCDCTL_API=3 etcdctl snapshot restore snapshot-155.db 
    --name master-2 
    --initial-cluster master-0=http://10.0.1.188:2380,master-1=http://10.0.1.136:2380,master-2=http://10.0.1.155:2380 
    --initial-cluster-token my-etcd-token 
    --initial-advertise-peer-urls http://10.0.1.155:2380

    The above three commands will give you three restored folders on three nodes named master:

    0.etcd, master-1.etcd and master-2.etcd

    Now, Stop all the etcd service on the nodes, replace the restored folder with the restored folders on all nodes and start the etcd service. Now you can see all the nodes, but in some time you will see that only master node is ready and other nodes went into the not ready state. You need to join those two nodes again with the existing ca.crt file (you should have a backup of that).

    Run the following command on master node:

    kubeadm token create --print-join-command

    It will give you kubeadm join command, add one –ignore-preflight-errors and run that command on other two nodes for them to come into the ready state.

    Conclusion

    One way to deal with master failure is to set up multi-master Kubernetes cluster, but even that does not allow you to completely eliminate the Kubernetes etcd backup and restore, and it is still possible that you may accidentally destroy data on the HA environment.

    Need help with disaster recovery for your Kubernetes Cluster? Connect with the experts at Velotio!

    For more insights into Kubernetes Disaster Recovery check out here.

  • Simplifying MySQL Sharding with ProxySQL: A Step-by-Step Guide

    Introduction:

    ProxySQL is a powerful SQL-aware proxy designed to sit between database servers and client applications, optimizing database traffic with features like load balancing, query routing, and failover. This article focuses on simplifying the setup of ProxySQL, especially for users implementing data-based sharding in a MySQL database.

    What is Sharding?

    Sharding involves partitioning a database into smaller, more manageable pieces called shards based on certain criteria, such as data attributes. ProxySQL supports data-based sharding, allowing users to distribute data across different shards based on specific conditions.

    Understanding the Need for ProxySQL:

    ProxySQL is an intermediary layer that enhances database management, monitoring, and optimization. With features like data-based sharding, ProxySQL is an ideal solution for scenarios where databases need to be distributed based on specific data attributes, such as geographic regions.

    ‍Installation & Setup:‍

    There are two ways to install the proxy, either by installing it using packages or running  ProxySQL in docker. ProxySQL can be installed using two methods: via packages or running it in a Docker container. For this guide, we will focus on the Docker installation.

    1. Install ProxySQL and MySQL Docker Images:

    To start, pull the necessary Docker images for ProxySQL and MySQL using the following commands:

    docker pull mysql:latest
    docker pull proxysql/proxysql

    2. Create Docker Network:

    Create a Docker network for communication between MySQL containers:

    docker network create multi-tenant-network

    Note: ProxySQL setup will need connections to multiple SQL servers. So, we will set up multiple SQL servers on our docker inside a Docker network.

    Containers within the same Docker network can communicate with each other using their container names or IP addresses.

    You can check the list of all the Docker networks currently present by running the following command:

    docker network ls

    3. Set Up MySQL Containers:

    Now, create three MySQL containers within the network:

    Note: We can create any number of MySQL containers.

    docker run -d --name mysql_host_1 --network=multi-tenant-network -p 3307:3306 -e MYSQL_ROOT_PASSWORD=pass123 mysql:latest 
    docker run -d --name mysql_host_2 --network=multi-tenant-network -p 3308:3306 -e MYSQL_ROOT_PASSWORD=pass123 mysql:latest 
    docker run -d --name mysql_host_3 --network=multi-tenant-network -p 3309:3306 -e MYSQL_ROOT_PASSWORD=pass123 mysql:latest

    Note: Adjust port numbers as necessary. 

    The default MySQL protocol port is 3306, but since we cannot access all three of our MySQL containers on the same port, we have set their ports to 3307, 3308, and 3309. Although internally, all MySQL containers will connect using port 3306.

    –network=multi-tenant-network. This specifies that the container should be created under the specified network.

    We have also specified the root password of the MySQL container to log into it, where the username is “root” and the password is “pass123” for all three of them.

    After running the above three commands, three MySQL containers will start running inside the network. You can connect to these three hosts using host = localhost or 127.0.0.1 and port = 3307 / 3308 / 3309.

    To ping the port, use the following command:

    for macOS:

    nc -zv 127.0.0.1 3307

    for Windows: 

    ping 127.0.0.1 3307

    for Linux: 

    telnet 127.0.0.1 3307

    Reference Image

    4. Create Users in MySQL Containers:

    Create “user_shard” and “monitor” users in each MySQL container.

    The “user_shard” user will be used by the proxy to make queries to the DB.

    The “monitor” user will be used by the proxy to monitor the DB.

    Note: To access the MySQL container mysql_host_1, use the command:

    docker exec -it mysql_host_1 mysql -uroot -ppass123

    Use the following commands inside the MySQL container to create the user:

    CREATE USER 'user_shard'@'%' IDENTIFIED BY 'pass123'; 
    GRANT ALL PRIVILEGES ON *.* TO 'user_shard'@'%' WITH GRANT OPTION; 
    FLUSH PRIVILEGES;
    
    CREATE USER monitor@'%' IDENTIFIED BY 'pass123'; 
    GRANT ALL PRIVILEGES ON *.* TO monitor@'%' WITH GRANT OPTION; 
    FLUSH PRIVILEGES;

    Repeat the above steps for mysql_host_2 & mysql_host_3.

    If, at any point, you need to drop the user, you can use the following command:

    DROP USER monitor@’%’;

    5. Prepare ProxySQL Configuration:

    To prepare the configuration, we will need the IP addresses of the MySQL containers. To find those, we can use the following command:

    docker inspect mysql_host_1;
    docker inspect mysql_host_2; 
    docker inspect mysql_host_3;

    By running these commands, you will get all the details of the MySQL Docker container under a field named “IPAddress” inside your network. That is the IP address of that particular MySQL container.

    Example:
    mysql_host_1: 172.19.0.2

    mysql_host_2: 172.19.0.3

    mysql_host_3: 172.19.0.4

    Reference image for IP address of mysql_host_1: 172.19.0.2

    Now, create a ProxySQL configuration file named proxysql.cnf. Include details such as IP addresses of MySQL containers, administrative credentials, and MySQL users.

    Below is the content that needs to be added to the proxysql.cnf file:

    datadir="/var/lib/proxysql"
    
    admin_variables=
    {
        admin_credentials="admin:admin;radmin:radmin"
        mysql_ifaces="0.0.0.0:6032"
        refresh_interval=2000
        hash_passwords=false
    }
    
    mysql_variables=
    {
        threads=4
        max_connections=2048
        default_query_delay=0
        default_query_timeout=36000000
        have_compress=true
        poll_timeout=2000
        interfaces="0.0.0.0:6033;/tmp/proxysql.sock"
        default_schema="information_schema"
        stacksize=1048576
        server_version="5.1.30"
        connect_timeout_server=10000
        monitor_history=60000
        monitor_connect_interval=200000
        monitor_ping_interval=200000
        ping_interval_server_msec=10000
        ping_timeout_server=200
        commands_stats=true
        sessions_sort=true
        monitor_username="monitor"
        monitor_password="pass123"
    }
    
    mysql_servers =
    (
        { address="172.19.0.2" , port=3306 , hostgroup=10, max_connections=100 },
        { address="172.19.0.3" , port=3306 , hostgroup=20, max_connections=100 },
        { address="172.19.0.4" , port=3306 , hostgroup=30, max_connections=100 }
    )
    
    
    mysql_users =
    (
        { username = "user_shard" , password = "pass123" , default_hostgroup = 10 , active = 1 },
        { username = "user_shard" , password = "pass123" , default_hostgroup = 20 , active = 1 },
        { username = "user_shard" , password = "pass123" , default_hostgroup = 30 , active = 1 }
    )

    Most of the settings are default; we won’t go into much detail for each setting. 

    admin_variables: These variables are used for ProxySQL’s administrative interface. It allows you to connect to ProxySQL and perform administrative tasks such as configuring runtime settings, managing servers, and monitoring performance.

    mysql_variables, monitor_username, and monitor_password are used to specify the username that ProxySQL will use when connecting to MySQL servers for monitoring purposes. This monitoring user is used to execute queries and gather statistics about the health and performance of the MySQL servers. This is the user we created during step 4.

    mysql_servers will contain all the MySQL servers we want to be connected with ProxySQL. Each entry will have the IP address of the MySQL container, port, host group, and max_connections. Mysql_users will have all the users we created during step 4.

    7. Run ProxySQL Container:

    Inside the same directory where the proxysql.cnf file is located, run the following command to start ProxySQL:

    docker run -d --rm -p 6032:6032 -p 6033:6033 -p 6080:6080 --name=proxysql --network=multi-tenant-network -v $PWD/proxysql.cnf:/etc/proxysql.cnf proxysql/proxysql

    Here, port 6032 is used for ProxySQL’s administrative interface. It allows you to connect to ProxySQL and perform administrative tasks such as configuring runtime settings, managing servers, and monitoring performance.

    Port 6033 is the default port for ProxySQL’s MySQL protocol interface. It is used for handling MySQL client connections. Our application will use it to access the ProxySQL db and make SQL queries.

    The above command will make ProxySQL run on our Docker with the configuration provided in the proxysql.cnf file.

    Inside ProxySQL Container:

    8. Access ProxySQL Admin Console:

    Now, to access the ProxySQL Docker container, use the following command:

    docker exec -it proxysql bash

    Now, once you’re inside the ProxySQL Docker container, you can access the ProxySQL admin console using the command:

    mysql -u admin -padmin -h 127.0.0.1 -P 6032

    You can run the following queries to get insights into your ProxySQL server:

    i) To get the list of all the connected MySQL servers:

    SELECT * FROM mysql_servers;

    ii) Verify the status of the MySQL backends in the monitor database tables in ProxySQL admin using the following command:

    SHOW TABLES FROM monitor;


    If this returns an empty set, it means that the monitor username and password are not set correctly. You can do so by using the below commands:

    UPDATE global_variables SET variable_value=’monitor’ WHERE variable_name='mysql-monitor_username'; 
    UPDATE global_variables SET variable_value=’pass123’ WHERE variable_name='mysql-monitor_password';
    LOAD MYSQL VARIABLES TO RUNTIME; 
    SAVE MYSQL VARIABLES TO DISK;

    And then restart the proxy Docker container:

    iii) Check the status of DBs connected to ProxySQL using the following command:

    SELECT * FROM monitor.mysql_server_connect_log ORDER BY time_start_us DESC;

    iv) To get a list of all the ProxySQL global variables, use the following command:

    SELECT * FROM global_variables; 

    v) To get all the queries made on ProxySQL, use the following command:

    Select * from stats_mysql_query_digest;

    Note: Whenever we change any row, use the below commands to load them:

    Change in variables:

    LOAD MYSQL VARIABLES TO RUNTIME; 
    SAVE MYSQL VARIABLES TO DISK;
    
    Change in mysql_servers:
    LOAD MYSQL SERVERS TO RUNTIME;
    SAVE MYSQL SERVERS TO DISK;
    
    Change in mysql_query_rules:
    LOAD MYSQL QUERY RULES TO RUNTIME;
    SAVE MYSQL QUERY RULES TO DISK;

    And then restart the proxy docker container.

    IMPORTANT:

    To connect to ProxySQL’s admin console, first get into the Docker container using the following command:

    docker exec -it proxysql bash

    Then, to access the ProxySQL admin console, use the following command:

    mysql -u admin -padmin -h 127.0.0.1 -P6032

    To access the ProxySQL MySQL console, we can directly access it using the following command without going inside the Docker ProxySQL container:

    mysql -u user_shard -ppass123 -h 127.0.0.1 -P6033

    To make queries to the database, we make use of ProxySQL’s 6033 port, where MySQL is being accessed.

    9. Define Query Rules:

    We can add custom query rules inside the mysql_query_rules table to redirect queries to specific databases based on defined patterns. Load the rules to runtime and save to disk.

    12. Sharding Example:

    Now, let’s illustrate how to leverage ProxySQL’s data-based sharding capabilities through a practical example. We’ll create three MySQL containers, each containing data from different continents in the “world” database, specifically within the “countries” table.

    Step 1: Create 3 MySQL containers named mysql_host_1, mysql_host_2 & mysql_host_3.

    Inside all containers, create a database named “world” with a table named “countries”.

    i) Inside mysql_host_1: Insert countries using the following query:

    INSERT INTO `countries` VALUES (1,'India','Asia'),(2,'Japan','Asia'),(3,'China','Asia'),(4,'USA','North America'),(5,'Cuba','North America'),(6,'Honduras','North America');

    ii) Inside mysql_host_2: Insert countries using the following query:

    INSERT INTO `countries` VALUES (1,'Kenya','Africa'),(2,'Ghana','Africa'),(3,'Morocco','Africa'),(4, "Brazil", "South America"), (5, "Chile", "South America"), (6, "Morocco", "South America");

    iii) Inside mysql_host_3: Insert countries using the following query:

    CODE: INSERT INTO `countries` VALUES (1, “Italy”, “Europe”), (2, “Germany”, “Europe”), (3, “France”, “Europe”);

    Now, we have distinct data sets for Asia & North America in mysql_host_1, Africa & South America in mysql_host_2, and Europe in mysql_host_3..js

    Step 2: Define Query Rules for Sharding

    Let’s create custom query rules to redirect queries based on the continent specified in the SQL statement.

    For example, if the query contains the continent “Asia,” we want it to be directed to mysql_host_1.

    — Query Rule for Asia and North America 

    INSERT INTO mysql_query_rules (rule_id, active, username, match_pattern, destination_hostgroup, apply) VALUES (10, 1, 'user_shard', "s*continents*=s*.*?(Asia|North America).*?s*", 10, 0);

    — Query Rule for Africa and South America

    INSERT INTO mysql_query_rules (rule_id, active, username, match_pattern, destination_hostgroup, apply) VALUES (20, 1, 'user_shard', "s*continents*=s*.*?(Africa|South America).*?s*", 20, 0);

    — Query Rule for Europe 

    INSERT INTO mysql_query_rules (rule_id, active, username, match_pattern, destination_hostgroup, apply) VALUES (30, 1, 'user_shard', "s*continents*=s*.*?(Europe).*?s*", 30, 0);

    Step 3: Apply and Save Query Rules

    After adding the query rules, ensure they take effect by running the following commands:

    LOAD MYSQL QUERY RULES TO RUNTIME; 
    SAVE MYSQL QUERY RULES TO DISK;

    Step 4: Test Sharding

    Now, access the MySQL server using the ProxySQL port and execute queries:

    mysql -u user_shard -ppass123 -h 127.0.0.1 -P 6033

    use world;

    — Example Queries:

    Select * from countries where id = 1 and continent = "Asia";

    — This will return id=1, name=India, continent=Asia

    Select * from countries where id = 1 and continent = "Africa";

    — This will return id=1, name=Kenya, continent=Africa.

    Select * from countries where id = 1 and continent = "Africa";

    Based on the defined query rules, the queries will be redirected to the specified MySQL host groups. If no rules match, the default host group that’s specified in mysql_users inside proxysql.cnf will be used.

    Conclusion:

    ProxySQL simplifies access to distributed data through effective sharding strategies. Its flexible query rules, combined with regex patterns and host group definitions, offer significant flexibility with relative simplicity.

    By following this step-by-step guide, users can quickly set up ProxySQL and leverage its capabilities to optimize database performance and achieve efficient data distribution.

    References:

    Download and Install ProxySQL – ProxySQL

    How to configure ProxySQL for the first time – ProxySQL

    Admin Variables – ProxySQL

  • A Comprehensive Guide to Unlock Kafka MirrorMaker 2.0

    Overview

    We are covering how Kafka MirrorMaker operates, how to set it up, and how to test mirror data.    

    MirrorMaker 2.0 is the new replication feature of Kafka 2.4, defined as part of the Kafka Improvement Process – KIP 382. Kafka MirrorMaker 2 is designed to replicate or mirror topics from one Kafka cluster to another. It uses the Kafka Connect framework to simplify the configuration and scaling. MirrorMaker dynamically detects changes to source topics and ensures source and target topic properties are synchronized, including topic data, offsets, and partitions. The topic, together with topic data, offsets, and partitions, is replicated in the target cluster when a new topic is created in the source cluster.

    Use Cases

    Disaster Recovery

    Though Kafka is highly distributed and provides a high level of fault tolerance, disasters can still happen, and data can still become temporarily unavailable—or lost altogether. The best way to mitigate the risks is to have a copy of your data in another Kafka cluster in a different data center. MirrorMaker translates and syncs consumer offsets to the target cluster. That way, we can switch clients to it relatively seamlessly, moving to an alternative deployment on the fly with minor or no service interruptions.

    Closer Read / Writes

    Kafka producer clients often prefer to write locally to achieve low latency, but business requirements demand the data be read by different consumers, often deployed in multiple regions. This can easily make deployments complex due to VPC peering. MirrorMaker can handle all complex replication, making it easier to write and read local mechanisms.

    Data Analytics

    Aggregation is also a factor in data pipelines, which might require the consolidation of data from regional Kafka clusters into a single one. That aggregate cluster then broadcasts that data to other clusters and/or data systems for analysis and visualization.

    Supported Topologies

    • Active/Passive or Active/Standby high availability deployments – (ClusterA => ClusterB)
    • Active/Active HA Deployment – (ClusterA => ClusterB and ClusterB => ClusterA)
    • Aggregation (e.g., from many clusters to one): (ClusterA => ClusterK, ClusterB => ClusterK, ClusterC => ClusterK)
    • Fan-out (opposite of Aggregation): (ClusterK => ClusterA, ClusterK => ClusterB, ClusterK => ClusterC)
    • Forwarding: (ClusterA => ClusterB, ClusterB => ClusterC, ClusterC => ClusterD)

    Salient Features of MirrorMaker 2

    • Mirrors Topic and Topic Configuration – Detects and mirrors new topics and config changes automatically, including the number of partitions and replication factors.
    • Mirrors ACLs – Mirrors Topic ACLs as well, though we found issues in replicating WRITE permission. Also, replicated topics often contain source cluster names as a prefix, which means existing ACLs need to be tweaked, or ACL replication may need to be managed externally if the topologies are more complex.
    • Mirrors Consumer Groups and Offsets – Seamlessly translates and syncs Consumer Group Offsets to target clusters to make it easier to switch from one cluster to another in case of disaster.
    • Ability to Update MM2 Config Dynamically – MirrorMaker is backed by Kafka Connect Framework, which provides REST APIs through which MirrorMaker configurations like replicating new topics, stopping replicating certain topics, etc. can be updated without restarting the cluster.
    • Fault-Tolerant and Horizontally Scalable Operations – The number of processes can be scaled horizontally to increase performance.

    How Kafka MirrorMaker 2 Works

    MirrorMaker uses a set of standard Kafka connectors. Each connector has its own role. The listing of connectors and their functions is provided below.

    • MirrorSourceConnector: Replicates topics, topic ACLs, and configs from the source cluster to the target cluster.
    • MirrorCheckpointConnector: Syncs consumer offsets, emits checkpoints, and enables failover.
    • MirrorHeartBeatConnector: Checks connectivity between the source and target clusters.

    MirrorMaker Running Modes

    There are three ways to run MirrorMaker:

    • As a dedicated MirrorMaker cluster (can be distributed with multiple replicas having the same config): In this mode, MirrorMaker does not require an existing Connect cluster. Instead, a high-level driver manages a collection of Connect workers.
    • As a standalone Connect worker: In this mode, a single Connect worker runs MirrorSourceConnector. This does not support multi-clusters, but it’s useful for small workloads or for testing.
    • In legacy mode, using existing MirrorMaker scripts: After legacy MirrorMaker is deprecated, the existing ./bin/kafka-mirror-maker.sh scripts will be updated to run MM2 in legacy mode:

    Setting up MirrorMaker 2

    We recommend running MirrorMaker as a dedicated MirrorMaker cluster since it does not require an existing Connect cluster. Instead, a high-level driver manages a collection of Connect workers. The cluster can be easily converted to a distributed cluster just by adding multiple replicas of the same configuration. A distributed cluster is required to reduce the load on a single node cluster and also to increase MirrorMaker throughput.

    Prerequisites

    • Docker
    • Docker Compose

    Steps to Set Up MirrorMaker 2

    Set up a single node source, target Kafka cluster, and a MirrorMaker node to run MirrorMaker 2.

    1. Clone repository:

    https://gitlab.com/velotio/kafka-mirror-maker.git

    2. Run the below command to start the Kafka clusters and the MirrorMaker Docker container:

    docker-compose up -d

    3. Login to the mirror-maker docker container: 

    docker exec -it $(docker ps | grep u0022mirror-maker-node-1u0022 | awk '{print $1}') bash

    4. Start MirrorMaker:

    connect-mirror-maker.sh ./mirror-maker-config.properties

    5. Monitor the logs of the MirrorMaker container—it should be something like this: 

    • [2024-02-05 04:07:39,450] INFO [MirrorCheckpointConnector|task-0] sync idle consumer group offset from source to target took 0 ms (org.apache.kafka.connect.mirror.Scheduler:95)
    • [2024-02-05 04:07:49,246] INFO [MirrorCheckpointConnector|worker] refreshing consumer groups took 1 ms (org.apache.kafka.connect.mirror.Scheduler:95)
    • [2024-02-05 04:07:49,337] INFO [MirrorSourceConnector|worker] refreshing topics took 3 ms (org.apache.kafka.connect.mirror.Scheduler:95)
    • [2024-02-05 04:07:49,450] INFO [MirrorCheckpointConnector|task-0] refreshing idle consumers group offsets at target cluster took 2 ms (org.apache.kafka.connect.mirror.Scheduler:95)

    6. Create a topic at the source cluster: 

    kafka-topics.sh --create --bootstrap-server source-kafka:9092 --topic test-topic --partitions 1 --replication-factor 1

    7. List topics and validate the topic: 

    kafka-topics.sh u002du002dlist u002du002dbootstrap-server source-kafka:9092

    8. Produce 100 messages on the topic:

    for x in {1..100}; do echo u0022message $xu0022; done | kafka-console-producer.sh u002du002dbroker-list source-kafka:9092 u002du002dtopic test-topic

    9. Check whether the topic is mirrored in the target cluster.

    Note: The mirrored topic will have a source cluster name prefix to be able to identify which source cluster the topic is mirrored from.

    kafka-topics.sh u002du002dlist u002du002dbootstrap-server target-kafka:9092

    10. Consume 5 messages from the source kafka cluster:

    kafka-console-consumer.sh --bootstrap-server source-kafka:9092  --topic test-topic --max-messages 5 --consumer-property enable.auto.commit=true --consumer-property group.id=test-group —from-beginning

    11. Describe the consumer group at the source and destination to verify that consumer offsets are also mirrored:

    kafka-consumer-groups.sh --bootstrap-server source-kafka:9092 --group test-group --describe

    kafka-consumer-groups.sh --bootstrap-server target-kafka:9092 --group test-group --describe

    12. Consume five messages from the target Kafka cluster. The messages should start from the committed offset in the source cluster. In this case, the message offset will start at 6.

    kafka-console-consumer.sh --bootstrap-server target-kafka:9092  --topic source-kafka.test-topic --max-messages 5 --consumer-property enable.auto.commit=true --consumer-property group.id=test-group —from-beginning

    Conclusion

    We’ve seen how to set up MirrorMaker 2.0 in a dedicated instance. This running mode does not need a running Connect cluster as it leverages a high-level driver that creates a set of Connect workers based on the MirrorMaker properties configuration file.

  • Beginner’s Guide for Writing Unit Test Cases with Jest Framework

    ‍Prerequisite

    Basic JavaScript, TypeScript

    Objective

    To make the reader understand the use/effect of test cases in software development.

    What’s in it for you?‍

    In the world of coding, we’re often in a rush to complete work before a deadline hits. And let’s be honest, writing test cases isn’t usually at the top of our priority list. We get it—they seem tedious, so we’d rather skip this extra step. But here’s the thing: those seemingly boring lines of code have superhero potential. Don’t believe me? You will.

    In this blog, we’re going to break down the mystery around test cases. No jargon, just simple talk. We’ll chat about what they are, explore a handy tool called Jest, and uncover why these little lines are actually the unsung heroes of coding. So, let’s ditch the complications and discover why giving some attention to test cases can level up our coding game. Ready? Let’s dive in!

    What are test cases?

    A test case is a detailed document specifying conditions under which a developer assesses whether a software application aligns with customer requirements. It includes preconditions, the case name, input conditions, and expected results. Derived from test scenarios, test cases cover both positive and negative inputs, providing a roadmap for test execution. This one-time effort aids future regression testing.

    Test cases offer insights into testing strategy, process, preconditions, and expected outputs. Executed during testing, they ensure the software performs its intended tasks. Linking defects to test case IDs facilitates efficient defect reporting. The comprehensive documentation acts as a safeguard, catching any oversights during test case execution and reinforcing the development team’s efforts.

    Different types of test cases exist, including integration, functional, non-functional, and unit.
    For this blog, we will talk about unit test cases.

    What are unit test cases?

    Unit testing is the process of testing the smallest functional unit of code. A functional unit could be a class member or simply a function that does something to your input and provides an output. Test cases around those functional units are called unit test cases.

    Purpose of unit test cases

    • To validate that each unit of the software works as intended and meets the requirements:
      For example, if your requirement is that the function returns an object with specific properties, a unit test will detect whether the code is written accordingly.
    • To check the robustness of code:
      Unit tests are automated and run each time the code is changed to ensure that new code does not break existing functionality.
    • To check the errors and bugs beforehand:
      If a case fails or doesn’t fulfill the requirement, it helps the developer isolate the area and recheck it for bugs before testing on demo/UAT/staging.

    Different frameworks for writing unit test cases

    There are various frameworks for unit test cases, including:

    • Mocha
    • Storybook
    • Cypress
    • Jasmine
    • Puppeteer
    • Jest
    Source: https://raygun.com/blog/javascript-unit-testing-frameworks/

    Why Jest?

    Jest is used and recommended by Facebook and officially supported by the React dev team.

    It has a great community and active support, so if you run into a problem and can’t find a solution in the comprehensive documentation, there are thousands of developers out there who could help you figure it out within hours.

    1. Performance: Ideal for larger projects with continuous deployment needs, Jest delivers enhanced performance.

    2. Compatibility: While Jest is widely used for testing React applications, it seamlessly integrates with other frameworks like Angular, Node, Vue, and Babel-based projects.

    3. Auto Mocking: Jest automatically mocks imported libraries in test files, reducing boilerplate and facilitating smoother testing workflows.

    4. Extended API: Jest comes with a comprehensive API, eliminating the necessity for additional libraries in most cases.

    5. Timer Mocks: Featuring a Time mocking system, Jest accelerates timeout processes, saving valuable testing time.

    6. Active Development & Community: Jest undergoes continuous improvement, boasting the most active community support for rapid issue resolution and updates.

    Components of a test case in Jest‍

    Describe

    • As the name indicates, they are responsible for describing the module we are going to test.
    • It should only describe the module, not the tests, as this describe module is generally not tested by Jest.

    It

    • Here, the actual code is tested and verified with actual or fake (spy, mocks) outputs.
      We can nest various it modules under the describe module.
    • It’s good to describe what the test does or doesn’t do in the description of the it module.

    Matchers

    • Matchers match the output with a real/fake output.
    • A test case without a matcher will always be a true/trivial test case.
    // For each unit test you write,
    // answer these questions:
    
    describe('What component aspect are you testing?', () => {
        it('What should the feature do?', () => {
            const actual = 'What is the actual output?'
            const expected = 'What is the expected output?'
    
            expect(actual).toEqual(expected) // matcher
    
        })
      })

    ‍Mocks and spies in Jest

    Mocks: They are objects or functions that simulate the behavior of real components. They are used to create controlled environments for testing by replacing actual components with simulated ones. Mocks are employed to isolate the code being tested, ensuring that the test focuses solely on the unit or component under examination without interference from external dependencies.

    It is mainly used for mocking a library or function that is most frequently used in the whole file or unit test case.

    Let Code.ts be the file you want to test.

    import { v4 as uuidv4 } from uuid
    
    export const functionToTest = () => {
    
        const id = uuidv4()
        // rest of the code
        return id;
    
    }

    As this is a unit test, we won’t be testing the uuidV4 function, so we will mock the whole uuid module using jest.mock.

    jest.mock('uuid', () => { uuidv4: () => 'random id value' }))  // mocking uuid module which will have uuidV4 as function
    describe('testing code.ts', () => {
        it('i have mocked uuid module', ()=> {
    
        const res = functionToTest()
        expect(res).tobeEqual('random id value')
    })
    
    })

    And that’s it. You have mocked the entire uuid module, so when it is coded during a test, it will return uuidV4 function, and that function, when executed, will give a random id value.

    Spies: They are functions or objects that “spy” on other functions by tracking calls made to them. They allow you to observe and verify the behavior of functions during testing. Spies are useful for checking if certain functions are called, how many times they are called, and with what arguments. They help ensure that functions are interacting as expected.

    This is by far the most used method, as this method works on object values and thus can be used to spy class methods efficiently.

    class DataService {
        fetchData() 
        {
            // code to fetch data
            return { 'real data'}
        }
    }

    describe('DataService Class', () => {
    
        it('should spy on the fetchData method with mockImplementation', () => {
            const dataServiceInstance = new DataService();
            const fetchDataSpy = jest.spyon(DataService.prototype, 'fetchData'); // prototype makes class method to a object
            fetchDataSpy.mockImplementation(() => 'Mocked Data'); // will return mocked data whenever function will be called
    
            const result = dataServiceInstance.fetchData(); // mocked Data
            expect(fetchDataSpy).toHaveBeenCalledTimes(1)
            expect(result).toBe('Mocked Data');
        }
      
      }

    Mocking database call‍

    One of the best uses of Jest is to mock a database call, i.e., mocking create, put, post, and delete calls for a database table.

    We can complete the same action with the help of only Jest spies.

    Let us suppose we have a database called DB, and it has lots of tables in it. Let’s say it has Table Student in it, and we want to mock create a Student database call.

    function async AddStudent(student: Student) 
      {
            await db.Student.create(student) // the call we want to mock
     }

    Now, as we are using the Jest spy method, we know that it will only be applicable to objects, so we will first make the Db. Students table into an object with create as method inside it, which will be jest.fn() (a function which can be used for mocking functions).

    Students an object with create as method inside object which will be jest.fn() (a function which can be used for mocking functions in one line without actually calling that function).

    describe('mocking data base call', () => {
            it('mocking create function', async () => {
                db.Student = {
                    create: jest.fn()
                }
    
                const tempStudent = {
                    name: 'john',
                    age: '12',
                    Rollno: 12
                     }
    
                const mock = jest.spyon(db.Student, 'create').
                    mockResolvedvalue('Student has been created successfully')
    
                await AddStudent(tempStudent)
                expect(mock).tohaveBeenCalledwith(tempStudent);
    
            })
    
        })

    Testing private methods‍

    Sometime, in development, we write private code for classes that can only be used within the class itself. But when writing test cases, we call the function by creating a class instance, and the private functions won’t be accessible to us, so we will not be able to test private functions.

    But in core JavaScript, there is no concept of private and public functions; it is introduced to us as TypeScript. So, we can actually test the private function as a normal public function by using the //@ts-ignore comment just above calling the private function.

     class Test()
      {
    
            private private_fun() {
                console.log("i am in private function");
                return "i am in private function"
            }
    
        }

    describe('Testing test class', () => {
            it('testing private function', () => {
                const test = new Test() 
                
                //calling code with ts-ignore comment
    
                //@ts-ignore
                const res = test.private_fun() //  output ->> "i am in private function "//
                expect(res).toBeEqual("i am in private function")
    
            })
        })

    P.S. One thing to note is that this will only work with TypeScript/JavaScript files.

    The importance of test cases in software development

    Makes code agile:

    In software development, one may have to change the structure or design of your code to add new features. Changing the already-tested code can be risky and costly. When you do the unit test, you just need to test the newly added code instead of the entire program.

    Improves code quality:

    A lot of bugs in software development occur due to unforeseen edge cases. If you forget to predict a single input, you may encounter a major bug in your application. When you write unit tests, think carefully about the edge cases of every function in your application.

    Provides Documentation:

    The unit test gives a basic idea of what the code does, and all the different use cases are covered through the program. It makes documentation easier, increasing the readability and understandability of the code. Anytime other developers can go through the unit test interface, understand the program better, and work on it fast and easily.

    Easy Debugging:

    Unit testing has made debugging a lot easier and quicker. If the test fails at any stage, you only need to debug the latest changes made in the code instead of the entire program. We have also mentioned how unit testing makes debugging easier at the next stage of integration testing as well.

    Conclusion

    So, if you made it to the end, you must have some understanding of the importance of test cases in your code.

    We’ve covered the best framework to choose from and how to write your first test case in Jest. And now, you are more confident in proving bug-free, robust, clean, documented, and tested code in your next MR/PR.