Category: Blogs

  • How to Setup HashiCorp Vault HA Cluster with Integrated Storage (Raft)

    As businesses move their data to the public cloud, one of the most pressing issues is how to keep it safe from illegal access.

    Using a tool like HashiCorp Vault gives you greater control over your sensitive credentials and fulfills cloud security regulations.

    In this blog, we’ll walk you through HashiCorp Vault High Availability Setup.

    Hashicorp Vault

    Hashicorp Vault is an open-source tool that provides a secure, reliable way to store and distribute sensitive information like API keys, access tokens, passwords, etc. Vault provides high-level policy management, secret leasing, audit logging, and automatic revocation to protect this information using UI, CLI, or HTTP API.

    High Availability

    Vault can run in a High Availability mode to protect against outages by running multiple Vault servers. When running in HA mode, Vault servers have two additional states, i.e., active and standby. Within a Vault cluster, only a single instance will be active, handling all requests, and all standby instances redirect requests to the active instance.

    Integrated Storage Raft

    The Integrated Storage backend is used to maintain Vault’s data. Unlike other storage backends, Integrated Storage does not operate from a single source of data. Instead, all the nodes in a Vault cluster will have a replicated copy of Vault’s data. Data gets replicated across all the nodes via the Raft Consensus Algorithm.

    Raft is officially supported by Hashicorp.

    Architecture

    Prerequisites

    This setup requires Vault, Sudo access on the machines, and the below configuration to create the cluster.

    • Install Vault v1.6.3+ent or later on all nodes in the Vault cluster 

    In this example, we have 3 CentOs VMs provisioned using VMware. 

    Setup

    1. Verify the Vault version on all the nodes using the below command (in this case, we have 3 nodes node1, node2, node3):

    vault --version

    2. Configure SSL certificates

    Note: Vault should always be used with TLS in production to provide secure communication between clients and the Vault server. It requires a certificate file and key file on each Vault host.

    We can generate SSL certs for the Vault Cluster on the Master and copy them on the other nodes in the cluster.

    Refer to: https://developer.hashicorp.com/vault/tutorials/secrets-management/pki-engine#scenario-introduction for generating SSL certs.

    • Copy tls.crt tls.key tls_ca.pem to /etc/vault.d/ssl/ 
    • Change ownership to `vault`
    [user@node1 ~]$ cd /etc/vault.d/ssl/           
    [user@node1 ssl]$ sudo chown vault. tls*

    • Copy tls* from /etc/vault.d/ssl to of the nodes

    3. Configure the enterprise license. Copy license on all nodes:

    cp /root/vault.hclic /etc/vault.d/vault.hclic
    chown root:vault /etc/vault.d/vault.hclic
    chmod 0640 /etc/vault.d/vault.hclic

    4. Create the storage directory for raft storage on all nodes:

    sudo mkdir --parents /opt/raft
    sudo chown --recursive vault:vault /opt/raft

    5. Set firewall rules on all nodes:

    sudo firewall-cmd --permanent --add-port=8200/tcp
    sudo firewall-cmd --permanent --add-port=8201/tcp
    sudo firewall-cmd --reload

    6. Create vault configuration file on all nodes:

    ### Node 1 ###
    [user@node1 vault.d]$ cat vault.hcl
    storage "raft" {
        path = "/opt/raft"
        node_id = "node1"
        retry_join 
        {
            leader_api_addr = "https://node2.int.us-west-1-dev.central.example.com:8200"
            leader_ca_cert_file = "/etc/vault.d/ssl/tls_ca.pem"
            leader_client_cert_file = "/etc/vault.d/ssl/tls.crt"
            leader_client_key_file = "/etc/vault.d/ssl/tls.key"
        }
        retry_join 
        {
            leader_api_addr = "https://node3.int.us-west-1-dev.central.example.com:8200"
            leader_ca_cert_file = "/etc/vault.d/ssl/tls_ca.pem"
            leader_client_cert_file = "/etc/vault.d/ssl/tls.crt"
            leader_client_key_file = "/etc/vault.d/ssl/tls.key"
        }
    }
    
    listener "tcp" {
       address = "0.0.0.0:8200"
       tls_disable = false
       tls_cert_file = "/etc/vault.d/ssl/tls.crt"
       tls_key_file = "/etc/vault.d/ssl/tls.key"
       tls_client_ca_file = "/etc/vault.d/ssl/tls_ca.pem"
       tls_cipher_suites = "TLS_TEST_128_GCM_SHA256,
                            TLS_TEST_128_GCM_SHA256,
                            TLS_TEST20_POLY1305,
                            TLS_TEST_256_GCM_SHA384,
                            TLS_TEST20_POLY1305,
                            TLS_TEST_256_GCM_SHA384"
    }
    api_addr = "https://node1.int.us-west-1-dev.central.example.com:8200"
    cluster_addr = "https://node1.int.us-west-1-dev.central.example.com:8201"
    disable_mlock = true
    ui = true
    log_level = "trace"
    disable_cache = true
    cluster_name = "POC"
    
    # Enterprise license_path
    # This will be required for enterprise as of v1.8
    license_path = "/etc/vault.d/vault.hclic"

    ### Node 2 ###
    [user@node2 vault.d]$ cat vault.hcl
    storage "raft" {
        path = "/opt/raft"
        node_id = "node2"
        retry_join 
        {
            leader_api_addr = "https://node1.int.us-west-1-dev.central.example.com:8200"
            leader_ca_cert_file = "/etc/vault.d/ssl/tls_ca.pem"
            leader_client_cert_file = "/etc/vault.d/ssl/tls.crt"
            leader_client_key_file = "/etc/vault.d/ssl/tls.key"
        }
        retry_join 
        {
            leader_api_addr = "https://node3.int.us-west-1-dev.central.example.com:8200"
            leader_ca_cert_file = "/etc/vault.d/ssl/tls_ca.pem"
            leader_client_cert_file = "/etc/vault.d/ssl/tls.crt"
            leader_client_key_file = "/etc/vault.d/ssl/tls.key"
        } 
    }
    
    listener "tcp" {
       address = "0.0.0.0:8200"
       tls_disable = false
       tls_cert_file = "/etc/vault.d/ssl/tls.crt"
       tls_key_file = "/etc/vault.d/ssl/tls.key"
       tls_client_ca_file = "/etc/vault.d/ssl/tls_ca.pem"
       tls_cipher_suites = "TLS_TEST_128_GCM_SHA256,
                            TLS_TEST_128_GCM_SHA256,
                            TLS_TEST20_POLY1305,
                            TLS_TEST_256_GCM_SHA384,
                            TLS_TEST20_POLY1305,
                            TLS_TEST_256_GCM_SHA384"
    }
    api_addr = "https://node2.int.us-west-1-dev.central.example.com:8200"
    cluster_addr = "https://node2.int.us-west-1-dev.central.example.com:8201"
    disable_mlock = true
    ui = true
    log_level = "trace"
    disable_cache = true
    cluster_name = "POC"
    
    # Enterprise license_path
    # This will be required for enterprise as of v1.8
    license_path = "/etc/vault.d/vault.hclic"

    ### Node 3 ###
    [user@node3 ~]$ cat /etc/vault.d/vault.hcl
    storage "raft" {
        path = "/opt/raft"
        node_id = "node3"
        retry_join 
        {
            leader_api_addr = "https://node1.int.us-west-1-dev.central.example.com:8200"
            leader_ca_cert_file = "/etc/vault.d/ssl/tls_ca.pem"
            leader_client_cert_file = "/etc/vault.d/ssl/tls.crt"
            leader_client_key_file = "/etc/vault.d/ssl/tls.key"
        }
        retry_join 
        {
            leader_api_addr = "https://node2.int.us-west-1-dev.central.example.com:8200"
            leader_ca_cert_file = "/etc/vault.d/ssl/tls_ca.pem"
            leader_client_cert_file = "/etc/vault.d/ssl/tls.crt"
            leader_client_key_file = "/etc/vault.d/ssl/tls.key"
        }
    }
    
    listener "tcp" {
       address = "0.0.0.0:8200"
       tls_disable = false
       tls_cert_file = "/etc/vault.d/ssl/tls.crt"
       tls_key_file = "/etc/vault.d/ssl/tls.key"
       tls_client_ca_file = "/etc/vault.d/ssl/tls_ca.pem"
       tls_cipher_suites = "TLS_TEST_128_GCM_SHA256,
                            TLS_TEST_128_GCM_SHA256,
                            TLS_TEST20_POLY1305,
                            TLS_TEST_256_GCM_SHA384,
                            TLS_TEST20_POLY1305,
                            TLS_TEST_256_GCM_SHA384"
    }
    api_addr = "https://node3.int.us-west-1-dev.central.example.com:8200"
    cluster_addr = "https://node3.int.us-west-1-dev.central.example.com:8201"
    disable_mlock = true
    ui = true
    log_level = "trace"
    disable_cache = true
    cluster_name = "POC"
    
    # Enterprise license_path
    # This will be required for enterprise as of v1.8
    license_path = "/etc/vault.d/vault.hclic"

    7. Set environment variables on all nodes:

    export VAULT_ADDR=https://$(hostname):8200
    export VAULT_CACERT=/etc/vault.d/ssl/tls_ca.pem
    export CA_CERT=`cat /etc/vault.d/ssl/tls_ca.pem`

    8. Start Vault as a service on all nodes:

    You can view the systemd unit file if interested by: 

    cat /etc/systemd/system/vault.service
    systemctl enable vault.service
    systemctl start vault.service
    systemctl status vault.service

    9. Check Vault status on all nodes:

    vault status

    10. Initialize Vault with the following command on vault node 1 only. Store unseal keys securely.

    [user@node1 vault.d]$ vault operator init -key-shares=1 -key-threshold=1
    Unseal Key 1: HPY/g5OiT8ivD6L4Bqfjx9L1We2MVb4WZAqKZk6zFf8=
    Initial Root Token: hvs.j4qTq1IZP9nscILMtN2p9GE0
    Vault initialized with 1 key shares and a key threshold of 1.
    Please securely distribute the key shares printed above. 
    When the Vault is re-sealed, restarted, or stopped, you must supply at least 1 of these keys to unseal it
    before it can start servicing requests.
    Vault does not store the generated root key. 
    Without at least 1 keys to reconstruct the root key, Vault will remain permanently sealed!
    It is possible to generate new unseal keys, provided you have a
    quorum of existing unseal keys shares. See "vault operator rekey" for more information.

    11. Set Vault token environment variable for the vault CLI command to authenticate to the server. Use the following command, replacing <initial-root- token> with the value generated in the previous step.

    export VAULT_TOKEN=<initial-root-token>
    echo "export VAULT_TOKEN=$VAULT_TOKEN" >> /root/.bash_profile
    ### Repeat this step for the other 2 servers.

    12. Unseal Vault1 using the unseal key generated in step 10. Notice the Unseal Progress key-value change as you present each key. After meeting the key threshold, the status of the key value for Sealed should change from true to false.

    [user@node1 vault.d]$ vault operator unseal HPY/g5OiT8ivD6L4Bqfjx9L1We2MVb4WZAqKZk6zFf8=
    Key                         Value
    ---                         -----
    Seal Type                   shamir
    Initialized                 true
    Sealed                      false
    Total Shares                1
    Threshold                   1
    Version                     1.11.0
    Build Date                  2022-06-17T15:48:44Z
    Storage Type                raft
    Cluster Name                POC
    Cluster ID                  109658fe-36bd-7d28-bf92-f095c77e860c
    HA Enabled                  true
    HA Cluster                  https://node1.int.us-west-1-dev.central.example.com:8201
    HA Mode                     active
    Active Since                2022-06-29T12:50:46.992698336Z
    Raft Committed Index        36
    Raft Applied Index          36

    13. Unseal Vault2 (Use the same unseal key generated in step 10 for Vault1):

    [user@node2 vault.d]$ vault operator unseal HPY/g5OiT8ivD6L4Bqfjx9L1We2MVb4WZAqKZk6zFf8=
    Key                Value
    ---                -----
    Seal Type          shamir
    Initialized        true
    Sealed             true
    Total Shares       1
    Threshold          1
    Unseal Progress    0/1
    Unseal Nonce       n/a
    Version            1.11.0
    Build Date         2022-06-17T15:48:44Z
    Storage Type       raft
    HA Enabled         true
    
    [user@node2 vault.d]$ vault status
    Key                   Value
    ---                   -----
    Seal Type             shamir
    Initialized           true
    Sealed                true
    Total Shares          1
    Threshold             1
    Version               1.11.0
    Build Date            2022-06-17T15:48:44Z
    Storage Type          raft
    Cluster Name          POC
    Cluster ID            109658fe-36bd-7d28-bf92-f095c77e860c
    HA Enabled            true
    HA Cluster            https://node1.int.us-west-1-dev.central.example.com:8201
    HA Mode               standby
    Active Node Address   https://node1.int.us-west-1-dev.central.example.com:8200
    Raft Committed Index  37
    Raft Applied Index    37

    14. Unseal Vault3 (Use the same unseal key generated in step 10 for Vault1):

    [user@node3 ~]$ vault operator unseal HPY/g5OiT8ivD6L4Bqfjx9L1We2MVb4WZAqKZk6zFf8=
    Key                Value
    ---                -----
    Seal Type          shamir
    Initialized        true
    Sealed             true
    Total Shares       1
    Threshold          1
    Unseal Progress    0/1
    Unseal Nonce       n/a
    Version            1.11.0
    Build Date         2022-06-17T15:48:44Z
    Storage Type       raft
    HA Enabled         true
    
    [user@node3 ~]$ vault status
    Key                       Value
    ---                       -----
    Seal Type                 shamir
    Initialized               true
    Sealed                    false
    Total Shares              1
    Threshold                 1
    Version                   1.11.0
    Build Date                2022-06-17T15:48:44Z
    Storage Type              raft
    Cluster Name              POC
    Cluster ID                109658fe-36bd-7d28-bf92-f095c77e860c
    HA Enabled                true
    HA Cluster                https://node1.int.us-west-1-dev.central.example.com:8201
    HA Mode                   standby
    Active Node Address       https://node1.int.us-west-1-dev.central.example.com:8200
    Raft Committed Index      39
    Raft Applied Index        39

    15. Check the cluster’s raft status with the following command:

    [user@node3 ~]$ vault operator raft list-peers
    Node      Address                                            State       Voter
    ----      -------                                            -----       -----
    node1    node1.int.us-west-1-dev.central.example.com:8201    leader      true
    node2    node2.int.us-west-1-dev.central.example.com:8201    follower    true
    node3    node3.int.us-west-1-dev.central.example.com:8201    follower    true

    16. Currently, node1 is the active node. We can experiment to see what happens if node1 steps down from its active node duty.

    In the terminal where VAULT_ADDR is set to: https://node1.int.us-west-1-dev.central.example.com, execute the step-down command.

    $ vault operator step-down # equivalent of stopping the node or stopping the systemctl service
    Success! Stepped down: https://node2.int.us-west-1-dev.central.example.com:8200

    In the terminal, where VAULT_ADDR is set to https://node2.int.us-west-1-dev.central.example.com:8200, examine the raft peer set.

    [user@node1 ~]$ vault operator raft list-peers
    Node      Address                                            State       Voter
    ----      -------                                            -----       -----
    node1    node1.int.us-west-1-dev.central.example.com:8201    follower    true
    node2    node2.int.us-west-1-dev.central.example.com:8201    leader      true
    node3    node3.int.us-west-1-dev.central.example.com:8201    follower    true

    Conclusion 

    Vault servers are now operational in High Availability mode, and we can test this by writing a secret from either the active or standby Vault instance and see it succeed as a test of request forwarding. Also, we can shut down the active vault instance (sudo systemctl stop vault) to simulate a system failure and see the standby instance assumes the leadership.

  • Benefits of Intelligent Revenue Cycle Management (iRCM) in Healthcare

    Highlights

    Intelligent Revenue Cycle Management (iRCM) is driving a dramatic revolution in the healthcare industry. This innovative approach incorporates advanced technologies like Artificial Intelligence (AI), Machine Learning (ML), and other critical tools to revolutionize revenue cycles. AI and ML are at the forefront of this transformation, playing critical roles in optimizing Revenue Cycles. Beyond the commonly employed AI applications for eligibility/benefits verification and payment estimations, there is a broader scope for AI in RCM.

    A research study by Change Healthcare found that approximately two-thirds of healthcare facilities and health systems are leveraging AI in their revenue cycle management (RCM) processes. Among these organizations, 72% use AI for eligibility/benefits verification, and 64% employ it for payment estimations (likely due to the No Surprises Act). However, the role of AI in RCM extends beyond these areas.

    The 2022 State of Revenue Integrity survey by the National Association of Healthcare Revenue Integrity highlights other AI-enabled functions in RCM, including charge description master (CDM) maintenance, charge capture, denials management, payer contract management, physician credentialing, and claim auditing, among others.

    Shedding Light on the Power of AI and Advanced Technologies in iRCM

    In addition to AI, technologies such as natural language processing (NLP), robotic process automation (RPA), and advanced analytics improve Intelligent Revenue Cycle Management (iRCM). NLP enables the extraction of essential insights from unstructured data sources, hence assisting with exact coding, documentation, and compliance. RPA automates repetitive and rule-based processes, liberating resources and lowering administrative burdens. Advanced analytics provides data-driven insights, empowering healthcare organizations to make informed decisions, identify areas for improvement, and drive continuous optimization.

    The adoption of artificial intelligence and other technologies into iRCM represents a paradigm change in healthcare financial management. It opens new avenues for healthcare organizations to increase revenue capture, save costs, streamline processes, and improve patient happiness.

    iRCM: Unleashing the Advantages of Advanced Technologies for A Healthcare System’s Financial Success

    iRCM leverages modern technologies to enhance financial success in healthcare. iRCM streamlines and optimizes the revenue cycle processes, leading to improved revenue capture, reduced costs, and enhanced financial outcomes for healthcare organizations. Here’s how Intelligent Revenue Cycle Management unleashes the advantages of advanced technologies like AI, ML, and RPA for healthcare’s financial success:

    AI in iRCM
    • Claims Processing: AI algorithms can analyze and handle enormous amounts of claims data, automate claim validation, and detect trends or abnormalities that may result in denials or mistakes. This can save manual work, enhance accuracy, and shorten the claims processing cycle.
    • Prioritization and Workflow Optimization: Artificial intelligence can intelligently prioritize jobs based on urgency, complexity, or revenue impact. It has the ability to automate work and optimize workflow management, ensuring that high-priority matters are addressed quickly and efficiently.
    • Intelligent Coding and Documentation: AI techniques like Natural Language Processing (NLP) are vital in analyzing medical records, extracting relevant information, identifying medical concepts, and assisting in coding. AI models trained on large datasets provide intelligent code recommendations, streamlining workflows in revenue cycle management for healthcare organizations.
    ML in iRCM:
    • Predictive Analytics: ML algorithms can analyze historical data to identify trends, patterns, and correlations related to revenue cycle performance. By leveraging this information, organizations can predict potential issues, optimize processes, and make data-driven decisions to improve financial outcomes.
    • Denial Management: ML can help learn from past denials and identify patterns contributing to claim rejections. This enables proactive measures to be taken, such as improving the documentation or addressing common billing errors, to minimize future denials.
    RPA in iRCM:
    • Automated Data Entry and Extraction: RPA can automate repetitive manual tasks, like data entry from paper documents or extracting information from records. This reduces errors, saves time, and improves efficiency in revenue cycle processes.
    • Claim Status and Follow-up: RPA bots can automatically retrieve claim status from payer portals, update the system with real-time information, and initiate follow-up actions based on predefined rules. This streamlines the claims management process and enhances revenue capture.
    • Eligibility Verification: RPA can automate the verification of patient insurance eligibility by extracting relevant information from various sources and validating it against payer databases. This ensures accurate billing and minimizes claim denials.

    Revolutionize Your Revenue Cycle: Embrace Intelligent Revenue Cycle Management Powered by Advanced Technologies

    iRCM (Intelligent Revenue Cycle Management) represents a cutting-edge approach to optimizing financial success in the healthcare industry. By harnessing the power of advanced technologies like AI, ML, and RPA, iRCM revolutionizes revenue cycle management processes. R Systems, a renowned provider of intelligent Revenue Cycle Management Services, offers comprehensive solutions integrating AI algorithms for intelligent coding and documentation, ML models for predictive analytics, and RPA for streamlined automation.

    With R Systems’ innovative approach, healthcare organizations can unlock the full potential of their revenue cycle, ensuring accurate coding, minimizing claim denials, and maximizing revenue capture. By embracing R Systems’ expertise and leveraging intelligent technologies, healthcare providers can experience efficient operations, improved financial outcomes, and elevated standards of patient care in their revenue cycle management.

  • Modern Data Stack: The What, Why and How?

    This post will provide you with a comprehensive overview of the modern data stack (MDS), including its benefits, how it’s components differ from its predecessors’, and what its future holds.

    “Modern” has the connotation of being up-to-date, of being better. This is true for MDS, but how exactly is MDS better than what was before?

    What was the data stack like?…

    A few decades back, the map-reduce technological breakthrough made it possible to efficiently process large amounts of data in parallel on multiple machines.

    It provided the backbone of a standard pipeline that looked like:

    It was common to see HDFS used for storage, spark for computing, and hive to perform SQL queries on top.

    To run this, we had people handling the deployment and maintenance of Hadoop on their own.

    This core attribute of the setup eventually became a pain point and made it complex and inefficient in the long run.

    Being on-prem while facing growing heavier loads meant scalability became a huge concern.

    Hence, unlike today, the process was much more manual. Adding more RAM, increasing storage, and rolling out updates manually reduced productivity

    Moreover,

    • The pipeline wasn’t modular; components were tightly coupled, causing failures when deciding to shift to something new.
    • Teams committed to specific vendors and found themselves locked in, by design, for years.
    • Setup was complex, and the infrastructure was not resilient. Random surges in data crashed the systems. (This randomness in demand has only increased since the early decade of internet, due to social media-triggered virality.)
    • Self-service was non-existent. If you wanted to do anything with your data, you needed data engineers.
    • Observability was a myth. Your pipeline is failing, but you’re unaware, and then you don’t know why, where, how…Your customers become your testers, knowing more about your system’s issues.
    • Data protection laws weren’t as formalized, especially the lack of policies within the organization. These issues made the traditional setup inefficient in solving modern problems, and as we all know…

    For an upgraded modern setup, we needed something that is scalable, has a smaller learning curve, and something that is feasible for both a seed-stage startup or a fortune 500.

    Standing on the shoulders of tech innovations from the 2000s, data engineers started building a blueprint for MDS tooling with three core attributes: 

    Cloud Native (or the ocean)

    Arguably the definitive change of the MDS era, the cloud reduces the hassle of on-prem and welcomes auto-scaling horizontally or vertically in the era of virality and spikes as technical requirements.

    Modularity

    The M in MDS could stand for modular.

    You can integrate any MDS tool into your existing stack, like LEGO blocks.

    You can test out multiple tools, whether they’re open source or managed, choose the best fit, and iteratively build out your data infrastructure.

    This mindset helps instill a habit of avoiding vendor lock-in by continuously upgrading your architecture with relative ease.

    By moving away from the ancient, one-size-fits-all model, MDS recognizes the uniqueness of each company’s budget, domain, data types, and maturity—and provides the correct solution for a given use case.

    Ease of Use

    MDS tools are easier to set up. You can start playing with these tools within a day.

    Importantly, the ease of use is not limited to technical engineers.

    Owing to the rise of self-serve and no-code tools like tableau—data is finally democratized for usage for all kinds of consumers. SQL remains crucial, but for basic metric calculations PMs, Sales, Marketing, etc., can use a simple drag and drop in the UI (sometimes even simpler than Excel pivot tables).

    MDS also enables one to experiment with different architectural frameworks for their use case. For example, ELT vs. ETL (explained under Data Transformation).

    But, one might think such improvements mean MDS is the v1.1 of Data Stack, a tech upgrade that ultimately uses data to solve similar problems.

    Fortunately, that’s far from the case.

    MDS enables data to solve more human problems across the org—problems that employees have long been facing but could never systematically solve for, helping generate much more value from the data.

    Beyond these, employees want transparency and visibility into how any metric was calculated and which data source in Snowflake was used to build what specific tableau dashboard.

    Critically, with compliance finally being focused on, orgs need solutions for giving the right people the right access at the right time.

    Lastly, as opposed to previous eras, these days, even startups have varied infrastructure components with data; if you’re a PM tasked with bringing insights, how do you know where to start? What data assets the organization has?

    Besides these problem statements being tackled, MDS builds a culture of upskilling employees in various data concepts.

    Data security, governance, and data lineage are important irrespective of department or persona in the organization.

    From designers to support executives, the need for a data-driven culture is a given.

    You’re probably bored of hearing how good the MDS is and want to deconstruct it into its components.

    Let’s dive in.

    SOURCES

    In our modern era, every product is inevitably becoming a tech product

    From a smart bulb to an orbiting satellite, each generates data in its own unique flavor of frequency of generation, data format, data size, etc.

    Social media, microservices, IoT devices, smart devices, DBs, CRMs, ERPs, flat files, and a lot more…

    INGESTION

    Post creation of data, how does one “ingest” or take in that data for actual usage? (the whole point of investing).

    Roughly, there are three categories to help describe the ingestion solutions:

    Generic tools allow us to connect various data sources with data storages.

    E.g.: we can connect Google Ads or Salesforce to dump data into BigQuery or S3.

    These generic tools highlight the modularity and low or no code barrier aspect in MDS.

    Things are as easy as drag and drop, and one doesn’t need to be fluent in scripting.

    Then we have programmable tools as well, where we get more control over how we ingest data through code

    For example, we can write Apache Airflow DAGs in Python to load data from S3 and dump it to Redshift.

    Intermediary – these tools cater to a specific use case or are coupled with the source itself.

    E.g. – Snowpipe, a part of the data source snowflake itself, allows us to load data from files as soon as it’s available at the source.

    DATA STORAGE‍

    Where do you ingest data into?

    Here, we’ve expanded from HDFS & SQL DBs to a wider variety of formats (noSQL, document DB).

    Depending on the use case and the way you interact with data, you can choose from a DW, DB, DL, ObjectStores, etc.

    You might need a standard relational DB for transactions in finance, or you might be collecting logs. You might be experimenting with your product at an early stage and be fine with noSQL without worrying about prescribing schemas.

    One key feature to note is that—most are cloud-based. So, no more worrying about scalability and we pay only for what we use.

    PS: Do stick around till the end for new concepts of Lake House and reverse ETL (already prevalent in the industry).

    DATA TRANSFORMATION

    The stored raw data must be cleaned and restructured into the shape we deem best for actual usage. This slicing and dicing is different for every kind of data.

    For example, we have tools for the E-T-L way, which can be categorized into SaaS and Frameworks, e.g., Fivetran and Spark respectively.

    Interestingly, the cloud era has given storage computational capability such that we don’t even need an external system for transformation, sometimes.

    With this rise of E-LT, we leverage the processing capabilities of cloud data warehouses or lake houses. Using tools like DBT, we write templated SQL queries to transform our data in the warehouses or lake house itself.

    This is enabling analysts to perform heavy lifting of traditional DE problems

    We also see stream processing where we work with applications where “micro” data is processed in real time (analyzed as soon as it’s produced, as opposed to large batches).

    DATA VISUALIZATION

    The ability to visually learn from data has only improved in the MDS era with advanced design, methodology, and integration.

    With Embedded analytics, one can integrate analytical capabilities and data visualizations into the software application itself.

    External analytics, on the other hand, are used to build using your processed data. You choose your source, create a chart, and let it run.

    DATA SCIENCE, MACHINE LEARNING, MLOps

    Source: https://medium.com/vertexventures/thinking-data-the-modern-data-stack-d7d59e81e8c6

    In the last decade, we have moved beyond ad-hoc insight generation in Jupyter notebooks to

    production-ready, real-time ML workflows, like recommendation systems and price predictions. Any startup can and does integrate ML into its products.

    Most cloud service providers offer machine learning models and automated model building as a service.

    MDS concepts like data observation are used to build tools for ML practitioners, whether its feature stores (a feature store is a central repository that provides entity values as of a certain time), or model monitoring (checking data drift, tracking model performance, and improving model accuracy).

    This is extremely important as statisticians can focus on the business problem not infrastructure.

    This is an ever-expanding field where concepts for ex MLOps (DevOps for the ML pipelines—optimizing workflows, efficient transformations) and Synthetic media (using AI to generate content itself) arrive and quickly become mainstream.

    ChatGPT is the current buzz, but by the time you’re reading this, I’m sure there’s going to be an updated one—such is the pace of development.

    DATA ORCHESTRATION

    With a higher number of modularized tools and source systems comes complicated complexity.

    More steps, processes, connections, settings, and synchronization are required.

    Data orchestration in MDS needs to be Cron on steroids.

    Using a wide variety of products, MDS tools help bring the right data for the right purposes based on complex logic.

     

    DATA OBSERVABILITY

    Data observability is the ability to monitor and understand the state and behavior of data as it flows through an organization’s systems.

    In a traditional data stack, organizations often rely on reactive approaches to data management, only addressing issues as they arise. In contrast, data observability in an MDS involves adopting a proactive mindset, where organizations actively monitor and understand the state of their data pipelines to identify potential issues before they become critical.

    Monitoring – a dashboard that provides an operational view of your pipeline or system

    Alerting – both for expected events and anomalies 

    Tracking – ability to set and track specific events

    Analysis – automated issue detection that adapts to your pipeline and data health

    Logging – a record of an event in a standardized format for faster resolution

    SLA Tracking – Measure data quality against predefined standards (cost, performance, reliability)

    Data Lineage – graph representation of data assets showing upstream/downstream steps.

    DATA GOVERNANCE & SECURITY

    Data security is a critical consideration for organizations of all sizes and industries and needs to be prioritized to protect sensitive information, ensure compliance, and preserve business continuity. 

    The introduction of stricter data protection regulations, such as the General Data Protection Regulation (GDPR) and CCPA, introduced a huge need in the market for MDS tools, which efficiently and painlessly help organizations govern and secure their data.

    DATA CATALOG

    Now that we have all the components of MDS, from ingestion to BI, we have so many sources, as well as things like dashboards, reports, views, other metadata, etc., that we need a google like engine just to navigate our components.

    This is where a data catalog helps; it allows people to stitch the metadata (data about your data: the #rows in your table, the column names, types, etc.) across sources.

    This is necessary to help efficiently discover, understand, trust, and collaborate on data assets.

    We don’t want PMs & GTM to look at different dashboards for adoption data.

    Previously, the sole purpose of the original data pipeline was to aggregate and upload events to Hadoop/Hive for batch processing. Chukwa collected events and wrote them to S3 in Hadoop sequence file format. In those days, end-to-end latency was up to 10 minutes. That was sufficient for batch jobs, which usually scan data at daily or hourly frequency.

    With the emergence of Kafka and Elasticsearch over the last decade, there has been a growing demand for real-time analytics on Netflix. By real-time, we mean sub-minute latency. Instead of starting from scratch, Netflix was able to iteratively grow its MDS as per changes in market requirements.

    Source: https://blog.transform.co/data-talks/the-metric-layer-why-you-need-it-examples-and-how-it-fits-into-your-modern-data-stack/

     

    This is a snapshot of the MDS stack a data-mature company like Netflix had some years back where instead of a few all in one tools, each data category was solved by a specialized tool.

    FUTURE COMPONENTS OF MDS?

    DATA MESH

    Source: https://martinfowler.com/articles/data-monolith-to-mesh.html

    The top picture shows how teams currently operate, where no matter the feature or product on the Y axis, the data pipeline’s journey remains the same moving along the X. But in an ideal world of data mesh, those who know the data should own its journey.

    As decentralization is the name of the game, data mesh is MDS’s response to this demand for an architecture shift where domain owners use self-service infrastructure to shape how their data is consumed.

    DATA LAKEHOUSE

    Source: https://www.altexsoft.com/blog/data-lakehouse/

    We have talked about data warehouses and data lakes being used for data storage.

    Initially, when we only needed structured data, data warehouses were used. Later, with big data, we started getting all kinds of data, structured and unstructured.

    So, we started using Data Lakes, where we just dumped everything.

    The lakehouse tries to combine the best of both worlds by adding an intelligent metadata layer on top of the data lake. This layer basically classifies and categorizes data such that it can be interpreted in a structured manner.

    Also, all the data in the lake house is open, meaning that it can be utilized by all kinds of tools. They are generally built on top of open data formats like parquet so that they can be easily accessed by all the tools.

    End users can simply run their SQLs as if they’re querying a DWH. 

    REVERSE ETL

    Suppose you’re a salesperson using Salesforce and want to know if a lead you just got is warm or cold (warm indicating a higher chance of conversion).

    The attributes about your lead, like salary and age are fetched from your OLTP into a DWH, analyzed, and then the flag “warm” is sent back to Salesforce UI, ready to be used in live operations.

     METRICS LAYER

    The Metric layer will be all about consistency, accessibility, and trust in the calculations of metrics.

    Earlier, for metrics, you had v1 v1.1 Excels with logic scattered around.

    Currently, in the modern data stack world, each team’s calculation is isolated in the tool they are used to. For example, BI would store metrics in tableau dashboards while DEs would use code.

    A metric layer would exist to ensure global access of the metrics to every other tool in the data stack.

    For example, DBT metrics layer helps define these in the warehouse—something accessible to both BI and engineers. Similarly, looker, mode, and others have their unique approach to it.

    In summary, this blog post discussed the modern data stack and its advantages over older approaches. We examined the components of the modern data stack, including data sources, ingestion, transformation, and more, and how they work together to create an efficient and effective system for data management and analysis. We also highlighted the benefits of the modern data stack, including increased efficiency, scalability, and flexibility. 

    As technology continues to advance, the modern data stack will evolve and incorporate new components and capabilities.

  • Best Practices for Kafka Security

    Overview‍

    We will cover the security concepts of Kafka and walkthrough the implementation of encryption, authentication, and authorization for the Kafka cluster.

    This article will explain how to configure SASL_SSL (simple authentication security layer) security for your Kafka cluster and how to protect the data in transit. SASL_SSL is a communication type in which clients use authentication mechanisms like PLAIN, SCRAM, etc., and the server uses SSL certificates to establish secure communication. We will use the SCRAM authentication mechanism here for the client to help establish mutual authentication between the client and server. We’ll also discuss authorization and ACLs, which are important for securing your cluster.

    Prerequisites

    Running Kafka Cluster, basic understanding of security components.

    Need for Kafka Security

    The primary reason is to prevent unlawful internet activities for the purpose of misuse, modification, disruption, and disclosure. So, to understand the security in Kafka cluster a secure Kafka cluster, we need to know three terms:

    • Authentication – It is a security method used for servers to determine whether users have permission to access their information or website.
    • Authorization – The authorization security method implemented with authentication enables servers to have a methodology of identifying clients for access. Basically, it gives limited access, which is sufficient for the client.
    • Encryption – It is the process of transforming data to make it distorted and unreadable without a decryption key. Encryption ensures that no other client can intercept and steal or read data.

    Here is the quick start guide by Apache Kafka, so check it out if you still need to set up Kafka.

    https://kafka.apache.org/quickstart

    We’ll not cover the theoretical aspects here, but you can find a ton of sources on how these three components work internally. For now, we’ll focus on the implementation part and how Kafka revolves around security.

    This image illustrates SSL communication between the Kafka client and server.

    We are going to implement the steps in the below order:

    • Create a Certificate Authority
    • Create a Truststore & Keystore

    Certificate Authority – It is a trusted entity that issues SSL certificates. As such, a CA is an independent entity that acts as a trusted third party, issuing certificates for use by others. A certificate authority validates the credentials of a person or organization that requests a certificate before issuing one.

    Truststore – A truststore contains certificates from other parties with which you want to communicate or certificate authorities that you trust to identify other parties. In simple words, a list of CAs that can validate the certificate signed by the trusted CA.

    KeyStore – A KeyStore contains private keys and certificates with their corresponding public keys. Keystores can have one or more CA certificates depending upon what’s needed.

    For Kafka Server, we need a server certificate, and here, Keystore comes into the picture since it stores a server certificate. The server certificate should be signed by Certificate Authority (CA). The KeyStore requests to sign the server certificate and in response, CA send a signed CRT to Keystore.

    We will create our own certificate authority for demonstration purposes. If you don’t want to create a private certificate authority, there are many certificate providers you can go with, like IdenTrust and GoDaddy. Since we are creating one, we need to tell our Kafka client to trust our private certificate authority using the Trust Store.

    This block diagram shows you how all the components communicate with each other and their role to generate the final certificate.

    So, let’s create our Certificate Authority. Run the below command in your terminal:

    “openssl req -new -keyout <private_key_name> -out <public_certificate_name>”

    It will ask for a passphrase, and keep it safe for future use cases. After successfully executing the command, we should have two files named private_key_name and public_certificate_name.

    Now, let’s create a KeyStore and trust store for brokers; we need both because brokers also interact internally with each other. Let’s understand with the help of an example: Broker A wants to connect with Broker B, so Broker A acts as a client and Broker B as a server. We are using the SASL_SSL protocol, so A needs SASL credentials, and B needs a certificate for authentication. The reverse is also possible where Broker B wants to connect with Broker A, so we need both a KeyStore and a trust store for authentication.

    Now let’s create a trust store. Execute the below command in the terminal, and it should ask for the password. Save the password for future use:

    “keytool -keystore <truststore_name.jks> -alias <alias name of the entry to process> -import -file <public_certificate_name>”

    Here, we are using the .jks extension for the file, which stands for Java KeyStore. You can also use Public-Key Cryptography Standards #12 (pkcs12) instead of .jks, but that’s totally up to you. public_certificate_name is the same certificate while we create CA.

    For the KeyStore configuration, run the below command and store the password:

    “keytool genkey -keystore <keystore_name.jks> -validity <number_of_days> -storepass <store_password> -genkey -alias <alias_name> -keyalg <key algorithm name> -ext SAN=<“DNS:localhost”>”

    This action creates the KeyStore file in the current working directory. The question “First and Last Name” requires you to enter a fully qualified domain name because some certificate authorities, such as VeriSign, expect this property to be a fully qualified domain name. Not all CAs require a fully qualified domain name, but I recommend using a fully qualified domain name for portability. All other information should be valid. If the information cannot be verified, a certificate authority such as VeriSign will not sign the CSR generated for that record. I’m using localhost for the domain name here, as seen in the above command itself.

    Keystore has an entry with alias_name. It contains the private key and information needed for generating a CSR. Now let’s create a signing certificate request, so it will be used to get a signed certificate from Certificate Authority.

    Execute the below command in your terminal:

    “keytool -keystore <keystore_name.jks> -alias <alias_name> -certreq -file <file_name.csr>”

    So, we have generated a signing certificate request using a KeyStore (the KeyStore name and alias name should be the same). It should ask for the KeyStore password, so enter the same one used while creating the KeyStore.

    Now, execute the below command. It will ask for the password, so enter the CA password, and now we have a signed certificate:

    “openssl x509 -req -CA <public_certificate_name> -CAkey <private_key_name> -in <csr file> -out <signed_file_name> -CAcreateserial”

    Finally, we need to add the public certificate of CA and signed certificate in the KeyStore, so run the below command. It will add the CA certificate to the KeyStore.

    “keytool -keystore <keystore_name.jks> -alias <public_certificate_name> -import -file <public_certificate_name>”

    Now, let’s run the below command; it will add the signed certificate to the KeyStore.

    “keytool -keystore <keystore_name.jks> -alias <alias_name> -import -file <signed_file_name>”

    As of now, we have generated all the security files for the broker. For internal broker communication, we are using SASL_SSL (see security.inter.broker.protocol in server.properties). Now we need to create a broker username and password using the SCRAM method. For more details, click here.

    Run the below command:

    “kafka-configs.sh –zookeeper <host: port> –entity-type users –entity-name <username> –alter –add-config ‘SCRAM-SHA-512=[password=<password>]’”

    NOTE: Credentials for inter-broker communication must be created before Kafka brokers are started.

    Now, we need to configure the Kafka broker property file, so update the file as given below:

    listeners=SASL_SSL://localhost:9092
    advertised.listeners=SASL_SSL://localhost:9092
    ssl.truststore.location={path/to/truststore_name.jks}
    ssl.truststore.password={truststore_password}
    ssl.keystore.location={/path/to/keystore_name.jks}
    ssl.keystore.password={keystore_password}
    security.inter.broker.protocol=SASL_SSL
    ssl.client.auth=none
    ssl.protocol=TLSv1.2
    sasl.enabled.mechanisms=SCRAM-SHA-512
    sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512
    listener.name.sasl_ssl.scram-sha-512.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username={username} password={password};
    super.users=User:{username}

    NOTE: If you are using an external jaas config file, then remove the ScramLoginModule line and set this environment variable before starting broker. “export KAFKA_OPTS=-Djava.security.auth.login.config={path/to/broker.conf}”

    Now, if we run Kafka, the broker should be running on port 9092 without any failure, and if you have multiple brokers inside Kafka, the same config file can be replicated among them, but the port should be different for each broker.

    Producers and consumers need a username and a password to access the broker, so let’s create their credentials and update respective configurations.

    Create a producer user and update producer.properties inside the bin directory, so execute the below command in your terminal.

    “bin/kafka-configs.sh –zookeeper <host: port> –entity-type users –entity-name <producer_name> –alter –add-config ‘SCRAM-SHA-512=[password=<password>]’”

    We need a trust store file for our clients (producer and consumer), but as we already know how to create a trust store, this is a small task for you. It is suggested that producers and consumers should have separate trust stores because when we move Kafka to production, there could be multiple producers and consumers on different machines.

    security.protocol=SASL_SSL
    ssl.protocol=TLSv1.2
    ssl.truststore.location={path/to/client.truststore.jks}
    ssl.truststore.password={password}
    sasl.mechanism=SCRAM-SHA-512
    sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username={producer_name} password={password};

    The below command creates a consumer user, so now let’s update consumer.properties inside the bin directory:

    “bin/kafka-configs.sh –zookeeper <host: port> –entity-type users –entity-name <consumer_name> –alter –add-config ‘SCRAM-SHA-512=[password=<password>]’”

    security.protocol=SASL_SSL
    ssl.protocol=TLSv1.2
    ssl.truststore.location={path/to/client.truststore.jks}
    ssl.truststore.password={password}
    sasl.mechanism=SCRAM-SHA-512
    sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username={consumer_name} password={password};

    As of now, we have implemented encryption and authentication for Kafka brokers. To verify that our producer and consumer are working properly with SCRAM credentials, run the console producer and consumer on some topics.

    Authorization is not implemented yet. Kafka uses access control lists (ACLs) to specify which users can perform which actions on specific resources or groups of resources. Each ACL has a principal, a permission type, an operation, a resource type, and a name.

    The default authorizer is ACLAuthorizer provided by Kafka; Confluent also provides the Confluent Server Authorizer, which is totally different from ACLAuthorizer. An authorizer is a server plugin used by Kafka to authorize actions. Specifically, the authorizer controls whether operations should be authorized based on the principal and resource being accessed.

    Format of ACLs – Principal P is [Allowed/Denied] Operation O from Host H on any Resource R matching ResourcePattern RP

    Execute the below command to create an ACL with writing permission for the producer:

    “bin/kafka-acls.sh –authorizer-properties zookeeper.connect=<host: port> –add –allow-principal User:<producer_name> –operation WRITE –topic <topic_name>”

    The above command should create ACL of write operation for producer_name on topic_name.

    Now, execute the below command to create an ACL with reading permission for the consumer:

    “bin/kafka-acls.sh –authorizer-properties zookeeper.connect=<host: port> –add –allow-principal User:<consumer_name> –operation READ –topic <topic_name>”

    Now we need to define the consumer group ID for this consumer, so the below command associates a consumer with a given consumer group ID.

    “bin/kafka-acls.sh –authorizer-properties zookeeper.connect=<host: port> –add –allow-principal User:<consumer_name> –operation READ –group <consumer_group_name>”

    Now, we need to add some configuration in two files: broker.properties and consumer.properties.

    # Authorizer class
    authorizer.class.name=kafka.security.authorizer.AclAuthorizer

    The above line indicates that AclAuthorizer class is used for authorization.

    # consumer group id
    group.id=<consumer_group_name>

    Consumer group-id is mandatory, and if we do not specify any group, a consumer will not be able to access the data from topics, so to start a consumer, group-id should be provided.

    Let’s test the producer and consumer one by one, run the console producer and also run the console consumer in another terminal; both should be running without error.

    console-producer
    console-consumer

    Voila!! Your Kafka is secured.

    Summary

    In a nutshell, we have implemented security in our Kafka using the SASL_SSL mechanism and learned how to create ACLs and give different permission to different users.

    Apache Kafka is the wild west without security. By default, there is no encryption, authentication, or access control list. Any client can communicate with the Kafka broker using the PLAINTEXT port. Access using this port should be restricted to trusted clients only. You can use network segmentation and/or authentication ACLs to restrict access to trusted IP addresses in these cases. If none of these are used, the cluster is wide open and available to anyone. A basic knowledge of Kafka authentication, authorization, encryption, and audit trails is required to safely move a system into production.

  • Discover the Benefits of Android Clean Architecture

    All architectures have one common goal: to manage the complexity of our application. We may not need to worry about it on a smaller project, but it becomes a lifesaver on larger ones. The purpose of Clean Architecture is to minimize code complexity by preventing implementation complexity.

    We must first understand a few things to implement the Clean Architecture in an Android project.

    • Entities: Encapsulate enterprise-wide critical business rules. An entity can be an object with methods or data structures and functions.
    • Use cases: It demonstrates data flow to and from the entities.
    • Controllers, gateways, presenters: A set of adapters that convert data from the use cases and entities format to the most convenient way to pass the data to the upper level (typically the UI).
    • UI, external interfaces, DB, web, devices: The outermost layer of the architecture, generally composed of frameworks such as database and web frameworks.

    Here is one thumb rule we need to follow. First, look at the direction of the arrows in the diagram. Entities do not depend on use cases and use cases do not depend on controllers, and so on. A lower-level module should always rely on something other than a higher-level module. The dependencies between the layers must be inwards.

    Advantages of Clean Architecture:

    • Strict architecture—hard to make mistakes
    • Business logic is encapsulated, easy to use, and tested
    • Enforcement of dependencies through encapsulation
    • Allows for parallel development
    • Highly scalable
    • Easy to understand and maintain
    • Testing is facilitated

    Let’s understand this using the small case study of the Android project, which gives more practical knowledge rather than theoretical.

    A pragmatic approach

    A typical Android project typically needs to separate the concerns between the UI, the business logic, and the data model, so taking “the theory” into account, we decided to split the project into three modules:

    • Domain Layer: contains the definitions of the business logic of the app, the data models, the abstract definition of repositories, and the definition of the use cases.
    Domain Module
    • Data Layer: This layer provides the abstract definition of all the data sources. Any application can reuse this without modifications. It contains repositories and data sources implementations, the database definition and its DAOs, the network APIs definitions, some mappers to convert network API models to database models, and vice versa.
    Data Module
    • Presentation layer: This is the layer that mainly interacts with the UI. It’s Android-specific and contains fragments, view models, adapters, activities, composable, and so on. It also includes a service locator to manage dependencies.
    Presentation Module

    Marvel’s comic characters App

    To elaborate on all the above concepts related to Clean Architecture, we are creating an app that lists Marvel’s comic characters using Marvel’s developer API. The app shows a list of Marvel characters, and clicking on each character will show details of that character. Users can also bookmark their favorite characters. It seems like nothing complicated, right?

    Before proceeding further into the sample, it’s good to have an idea of the following frameworks because the example is wholly based on them.

    • Jetpack Compose – Android’s recommended modern toolkit for building native UI.
    • Retrofit 2 – A type-safe HTTP client for Android for Network calls.
    • ViewModel – A class responsible for preparing and managing the data for an activity or a fragment.
    • Kotlin – Kotlin is a cross-platform, statically typed, general-purpose programming language with type inference.

    To get a characters list, we have used marvel’s developer API, which returns the list of marvel characters.

    http://gateway.marvel.com/v1/public/characters

    The domain layer

    In the domain layer, we define the data model, the use cases, and the abstract definition of the character repository. The API returns a list of characters, with some info like name, description, and image links.

    data class CharacterEntity(
        val id: Long,
        val name: String,
        val description: String,
        val imageUrl: String,
        val bookmarkStatus: Boolean
    )

    interface MarvelDataRepository {
        suspend fun getCharacters(dataSource: DataSource): Flow<List<CharacterEntity>>
        suspend fun getCharacter(characterId: Long): Flow<CharacterEntity>
        suspend fun toggleCharacterBookmarkStatus(characterId: Long): Boolean
        suspend fun getComics(dataSource: DataSource, characterId: Long): Flow<List<ComicsEntity>>
    }

    class GetCharactersUseCase(
        private val marvelDataRepository: MarvelDataRepository,
        private val ioDispatcher: CoroutineDispatcher = Dispatchers.IO
    ) {
        operator fun invoke(forceRefresh: Boolean = false): Flow<List<CharacterEntity>> {
            return flow {
                emitAll(
                    marvelDataRepository.getCharacters(
                        if (forceRefresh) {
                            DataSource.Network
                        } else {
                            DataSource.Cache
                        }
                    )
                )
            }
                .flowOn(ioDispatcher)
        }
    }

    The data layer

    As we said before, the data layer must implement the abstract definition of the domain layer, so we need to put the repository’s concrete implementation in this layer. To do so, we can define two data sources, a “local” data source to provide persistence and a “remote” data source to fetch the data from the API.

    class MarvelDataRepositoryImpl(
        private val marvelRemoteService: MarvelRemoteService,
        private val charactersDao: CharactersDao,
        private val comicsDao: ComicsDao,
        private val ioDispatcher: CoroutineDispatcher = Dispatchers.IO
    ) : MarvelDataRepository {
    
        override suspend fun getCharacters(dataSource: DataSource): Flow<List<CharacterEntity>> =
            flow {
                emitAll(
                    when (dataSource) {
                        is DataSource.Cache -> getCharactersCache().map { list ->
                            if (list.isEmpty()) {
                                getCharactersNetwork()
                            } else {
                                list.toDomain()
                            }
                        }
                            .flowOn(ioDispatcher)
    
                        is DataSource.Network -> flowOf(getCharactersNetwork())
                            .flowOn(ioDispatcher)
                    }
                )
            }
    
        private suspend fun getCharactersNetwork(): List<CharacterEntity> =
            marvelRemoteService.getCharacters().body()?.data?.results?.let { remoteData ->
                if (remoteData.isNotEmpty()) {
                    charactersDao.upsert(remoteData.toCache())
                }
                remoteData.toDomain()
            } ?: emptyList()
    
        private fun getCharactersCache(): Flow<List<CharacterCache>> =
            charactersDao.getCharacters()
    
        override suspend fun getCharacter(characterId: Long): Flow<CharacterEntity> =
            charactersDao.getCharacterFlow(id = characterId).map {
                it.toDomain()
            }
    
        override suspend fun toggleCharacterBookmarkStatus(characterId: Long): Boolean {
    
            val status = charactersDao.getCharacter(characterId)?.bookmarkStatus?.not() ?: false
    
            return charactersDao.toggleCharacterBookmarkStatus(id = characterId, status = status) > 0
        }
    
        override suspend fun getComics(
            dataSource: DataSource,
            characterId: Long
        ): Flow<List<ComicsEntity>> = flow {
            emitAll(
                when (dataSource) {
                    is DataSource.Cache -> getComicsCache(characterId = characterId).map { list ->
                        if (list.isEmpty()) {
                            getComicsNetwork(characterId = characterId)
                        } else {
                            list.toDomain()
                        }
                    }
                    is DataSource.Network -> flowOf(getComicsNetwork(characterId = characterId))
                        .flowOn(ioDispatcher)
                }
            )
        }
    
        private suspend fun getComicsNetwork(characterId: Long): List<ComicsEntity> =
            marvelRemoteService.getComics(characterId = characterId)
                .body()?.data?.results?.let { remoteData ->
                    if (remoteData.isNotEmpty()) {
                        comicsDao.upsert(remoteData.toCache(characterId = characterId))
                    }
                    remoteData.toDomain()
                } ?: emptyList()
    
        private fun getComicsCache(characterId: Long): Flow<List<ComicsCache>> =
            comicsDao.getComics(characterId = characterId)
    }

    Since we defined the data source to manage persistence, in this layer, we also need to determine the database for which we are using the room database. In addition, it’s good practice to create some mappers to map the API response to the corresponding database entity.

    fun List<Characters>.toCache() = map { character -> character.toCache() }
    
    fun Characters.toCache() = CharacterCache(
        id = id ?: 0,
        name = name ?: "",
        description = description ?: "",
        imageUrl = thumbnail?.let {
            "${it.path}.${it.extension}"
        } ?: ""
    )
    
    fun List<Characters>.toDomain() = map { character -> character.toDomain() }
    
    fun Characters.toDomain() = CharacterEntity(
        id = id ?: 0,
        name = name ?: "",
        description = description ?: "",
        imageUrl = thumbnail?.let {
            "${it.path}.${it.extension}"
        } ?: "",
        bookmarkStatus = false
    )

    @Entity
    data class CharacterCache(
        @PrimaryKey
        val id: Long,
        val name: String,
        val description: String,
        val imageUrl: String,
        val bookmarkStatus: Boolean = false
    ) : BaseCache

    The presentation layer

    In this layer, we need a UI component like fragments, activity, or composable to display the list of characters; here, we can use the widely used MVVM approach. The view model takes the use cases in its constructors and invokes the corresponding use case according to user actions (get a character, characters & comics, etc.).

    Each use case will invoke the appropriate method in the repository.

    class CharactersListViewModel(
        private val getCharacters: GetCharactersUseCase,
        private val toggleCharacterBookmarkStatus: ToggleCharacterBookmarkStatus
    ) : ViewModel() {
    
        private val _characters = MutableStateFlow<UiState<List<CharacterViewState>>>(UiState.Loading())
        val characters: StateFlow<UiState<List<CharacterViewState>>> = _characters
    
        init {
            _characters.value = UiState.Loading()
            getAllCharacters()
        }
    
        private fun getAllCharacters(forceRefresh: Boolean = false) {
            getCharacters(forceRefresh)
                .catch { error ->
                    error.printStackTrace()
                    when (error) {
                        is UnknownHostException, is ConnectException, is SocketTimeoutException -> _characters.value =
                            UiState.NoInternetError(error)
                        else -> _characters.value = UiState.ApiError(error)
                    }
                }.map { list ->
                    _characters.value = UiState.Loaded(list.toViewState())
                }.launchIn(viewModelScope)
        }
    
        fun refresh(showLoader: Boolean = false) {
            if (showLoader) {
                _characters.value = UiState.Loading()
            }
            getAllCharacters(forceRefresh = true)
        }
    
        fun bookmarkCharacter(characterId: Long) {
            viewModelScope.launch {
                toggleCharacterBookmarkStatus(characterId = characterId)
            }
        }
    }

    /*
    * Scaffold(Layout) for Characters list page
    * */
    
    
    @SuppressLint("UnusedMaterialScaffoldPaddingParameter")
    @Composable
    fun CharactersListScaffold(
        showComics: (Long) -> Unit,
        closeAction: () -> Unit,
        modifier: Modifier = Modifier,
        charactersListViewModel: CharactersListViewModel = getViewModel()
    ) {
        Scaffold(
            modifier = modifier,
            topBar = {
                TopAppBar(
                    title = {
                        Text(text = stringResource(id = R.string.characters))
                    },
                    navigationIcon = {
                        IconButton(onClick = closeAction) {
                            Icon(
                                imageVector = Icons.Filled.Close,
                                contentDescription = stringResource(id = R.string.close_icon)
                            )
                        }
                    }
                )
            }
        ) {
            val state = charactersListViewModel.characters.collectAsState()
    
            when (state.value) {
    
                is UiState.Loading -> {
                    Loader()
                }
    
                is UiState.Loaded -> {
                    state.value.data?.let { characters ->
                        val isRefreshing = remember { mutableStateOf(false) }
                        SwipeRefresh(
                            state = rememberSwipeRefreshState(isRefreshing = isRefreshing.value),
                            onRefresh = {
                                isRefreshing.value = true
                                charactersListViewModel.refresh()
                            }
                        ) {
                            isRefreshing.value = false
    
                            if (characters.isNotEmpty()) {
    
                                LazyVerticalGrid(
                                    columns = GridCells.Fixed(2),
                                    modifier = Modifier
                                        .padding(5.dp)
                                        .fillMaxSize()
                                ) {
                                    items(characters) { state ->
                                        CharacterTile(
                                            state = state,
                                            characterSelectAction = {
                                                showComics(state.id)
                                            },
                                            bookmarkAction = {
                                                charactersListViewModel.bookmarkCharacter(state.id)
                                            },
                                            modifier = Modifier
                                                .padding(5.dp)
                                                .fillMaxHeight(fraction = 0.35f)
                                        )
                                    }
                                }
    
                            } else {
                                Info(
                                    messageResource = R.string.no_characters_available,
                                    iconResource = R.drawable.ic_no_data
                                )
                            }
                        }
                    }
                }
    
                is UiState.ApiError -> {
                    Info(
                        messageResource = R.string.api_error,
                        iconResource = R.drawable.ic_something_went_wrong
                    )
                }
    
                is UiState.NoInternetError -> {
                    Info(
                        messageResource = R.string.no_internet,
                        iconResource = R.drawable.ic_no_connection,
                        isInfoOnly = false,
                        buttonAction = {
                            charactersListViewModel.refresh(showLoader = true)
                        }
                    )
                }
            }
        }
    }
    
    @Preview
    @Composable
    private fun CharactersListScaffoldPreview() {
        MarvelComicTheme {
            CharactersListScaffold(showComics = {}, closeAction = {})
        }
    }

    Let’s see how the communication between the layers looks like.

    Source: Clean Architecture Tutorial for Android

    As you can see, each layer communicates only with the closest one, keeping inner layers independent from lower layers, this way, we can quickly test each module separately, and the separation of concerns will help developers to collaborate on the different modules of the project.

    Thank you so much!

  • Building a Progressive Web Application in React [With Live Code Examples]

    What is PWA:

    A Progressive Web Application or PWA is a web application that is built to look and behave like native apps, operates offline-first, is optimized for a variety of viewports ranging from mobile, tablets to FHD desktop monitors and more. PWAs are built using front-end technologies such as HTML, CSS and JavaScript and bring native-like user experience to the web platform. PWAs can also be installed on devices just like native apps.

    For an application to be classified as a PWA, it must tick all of these boxes:

    • PWAs must implement service workers. Service workers act as a proxy between the web browsers and API servers. This allows web apps to manage and cache network requests and assets
    • PWAs must be served over a secure network, i.e. the application must be served over HTTPS
    • PWAs must have a web manifest definition, which is a JSON file that provides basic information about the PWA, such as name, different icons, look and feel of the app, splash screen, version of the app, description, author, etc

    Why build a PWA?

    Businesses and engineering teams should consider building a progressive web app instead of a traditional web app. Here are some of the most prominent arguments in favor of PWAs:

    • PWAs are responsive. The mobile-first design approach enables PWAs to support a variety of viewports and orientation
    • PWAs can work on slow Internet or no Internet environment. App developers can choose how a PWA will behave when there’s no Internet connectivity, whereas traditional web apps or websites simply stop working without an active Internet connection
    • PWAs are secure because they are always served over HTTPs
    • PWAs can be installed on the home screen, making the application more accessible
    • PWAs bring in rich features, such as push notification, application updates and more

    PWA and React

    There are various ways to build a progressive web application. One can just use Vanilla JS, HTML and CSS or pick up a framework or library. Some of the popular choices in 2020 are Ionic, Vue, Angular, Polymer, and of course React, which happens to be my favorite front-end library.

    Building PWAs with React

    To get started, let’s create a PWA which lists all the users in a system.

    npm init react-app users
    cd users
    yarn add react-router-dom
    yarn run start

    Next, we will replace the default App.js file with our own implementation.

    import React from "react";
    import { BrowserRouter, Route } from "react-router-dom";
    import "./App.css";
    const Users = () => {
     // state
     const [users, setUsers] = React.useState([]);
     // effects
     React.useEffect(() => {
       fetch("https://jsonplaceholder.typicode.com/users")
         .then((res) => res.json())
         .then((users) => {
           setUsers(users);
         })
         .catch((err) => {});
     }, []);
     // render
     return (
       <div>
         <h2>Users</h2>
         <ul>
           {users.map((user) => (
             <li key={user.id}>
               {user.name} ({user.email})
             </li>
           ))}
         </ul>
       </div>
     );
    };
    const App = () => (
     <BrowserRouter>
       <Route path="/" exact component={Users} />
     </BrowserRouter>
    );
    export default App;

    This displays a list of users fetched from the server.

    Let’s also remove the logo.svg file inside the src directory and truncate the App.css file that is populated as a part of the boilerplate code.

    To make this app a PWA, we need to follow these steps:

    1. Register service worker

    • In the file /src/index.js, replace serviceWorker.unregister() with serviceWorker.register().
    import React from 'react';
    import ReactDOM from 'react-dom';
    import './index.css';
    import App from './App';
    import * as serviceWorker from './serviceWorker';
    ReactDOM.render(
     <React.StrictMode>
       <App />
     </React.StrictMode>,
     document.getElementById('root')
    );
    serviceWorker.register();

    • The default behavior here is to not set up a service worker, i.e. the CRA boilerplate allows the users to opt-in for the offline-first experience.

    2. Update the manifest file

    • The CRA boilerplate provides a manifest file out of the box. This file is located at /public/manifest.json and needs to be modified to include the name of the PWA, description, splash screen configuration and much more. You can read more about available configuration options in the manifest file here.

    Our modified manifest file looks like this:

    {
     "short_name": "User Mgmt.",
     "name": "User Management",
     "icons": [
       {
         "src": "favicon.ico",
         "sizes": "64x64 32x32 24x24 16x16",
         "type": "image/x-icon"
       },
       {
         "src": "logo192.png",
         "type": "image/png",
         "sizes": "192x192"
       },
       {
         "src": "logo512.png",
         "type": "image/png",
         "sizes": "512x512"
       }
     ],
     "start_url": ".",
     "display": "standalone",
     "theme_color": "#aaffaa",
     "background_color": "#ffffff"
    }

    PWA Splash Screen

    Here the display mode selected is “standalone” which tells the web browsers to give this PWA the same look and feel as that of a standalone app. Other display options include, “browser,” which is the default mode and launches the PWA like a traditional web app and “fullscreen,” which opens the PWA in fullscreen mode – hiding all other elements such as navigation, the address bar and the status bar.

    The manifest can be inspected using Chrome dev tools > Application tab > Manifest.

    1. Test the PWA:

    • To test a progressive web app, build it completely first. This is because PWA features, such as caching aren’t enabled while running the app in dev mode to ensure hassle-free development  
    • Create a production build with: npm run build
    • Change into the build directory: cd build
    • Host the app locally: http-server or python3 -m http.server 8080
    • Test the application by logging in to http://localhost:8080

    2. Audit the PWA: If you are testing the app for the first time on a desktop or laptop browser, PWA may look like just another website. To test and audit various aspects of the PWA, let’s use Lighthouse, which is a tool built by Google specifically for this purpose.

    PWA on mobile

    At this point, we already have a simple PWA which can be published on the Internet and made available to billions of devices. Now let’s try to enhance the app by improving its offline viewing experience.

    1. Offline indication: Since service workers can operate without the Internet as well, let’s add an offline indicator banner to let users know the current state of the application. We will use navigator.onLine along with the “online” and “offline” window events to detect the connection status.

     // state
      const [offline, setOffline] = React.useState(false);
      // effects
      React.useEffect(() => {
        window.addEventListener("offline", offlineListener);
        return () => {
          window.removeEventListener("offline", offlineListener);
        };
      }, []);
      
      {/* add to jsx */}
      {offline ? (
        <div className="banner-offline">The app is currently offline</div>
      ) : null}

    The easiest way to test this is to just turn off the Wi-Fi on your dev machine. Chrome dev tools also provide an option to test this without actually going offline. Head over to Dev tools > Network and then select “Offline” from the dropdown in the top section. This should bring up the banner when the app is offline.

    2. Let’s cache a network request using service worker

    CRA comes with its own service-worker.js file which caches all static assets such as JavaScript and CSS files that are a part of the application bundle. To put custom logic into the service worker, let’s create a new file called ‘custom-service-worker.js’ and combine the two.

    • Install react-app-rewired and update package.json:
    1. yarn add react-app-rewired
    2. Update the package.json as follows:
    "scripts": {
       "start": "react-app-rewired start",
       "build": "react-app-rewired build",
       "test": "react-app-rewired test",
       "eject": "react-app-rewired eject"
    },

    • Create a config file to override how CRA generates service workers and inject our custom service worker, i.e. combine the two service worker files.
    const WorkboxWebpackPlugin = require("workbox-webpack-plugin");
    module.exports = function override(config, env) {
      config.plugins = config.plugins.map((plugin) => {
        if (plugin.constructor.name === "GenerateSW") {
          return new WorkboxWebpackPlugin.InjectManifest({
           swSrc: "./src/service-worker-custom.js",
           swDest: "service-worker.js"
          });
        }
        return plugin;
      });
      return config;
    };

    • Create service-worker-custom.js file and cache network request in there:
    workbox.skipWaiting();
    workbox.clientsClaim();
    workbox.routing.registerRoute(
      new RegExp("/users"),
      workbox.strategies.NetworkFirst()
    );
    workbox.precaching.precacheAndRoute(self.__precacheManifest || [])

    Your app should now work correctly in the offline mode.

    Distributing and publishing a PWA

    PWAs can be published just like any other website and only have one additional requirement, i.e. it must be served over HTTPs. When a user visits PWA from mobile or tablet, a pop-up is displayed asking the user if they’d like to install the app to their home screen.

    Conclusion

    Building PWAs with React enables engineering teams to develop, deploy and publish progressive web apps for billions of devices using technologies they’re already familiar with. Existing React apps can also be converted to a PWA. PWAs are fun to build, easy to ship and distribute, and add a lot of value to customers by providing native-live experience, better engagement via features, such as add to homescreen, push notifications and more without any installation process. 

  • Amazon Lex + AWS Lambda: Beyond Hello World

    In my previous blog, I explained how to get started with Amazon Lex and build simple bots. This blog aims at exploring the Lambda functions used by Amazon Lex for code validation and fulfillment. We will go along with the same example we created in our first blog i.e. purchasing a book and will see in details how the dots are connected.

    This blog is divided into following sections:

    1. Lambda function input format
    2. Response format
    3. Managing conversation context
    4. An example (demonstration to understand better how context is maintained to make data flow between two different intents)

    NOTE: Input to a Lambda function will change according to the language you use to create the function. Since we have used NodeJS for our example, everything will thus be explained using it.

    Section 1:  Lambda function input format

    When communication is started with a Bot, Amazon Lex passes control to Lambda function, we have defined while creating the bot.

    There are three arguments that Amazon Lex passes to a Lambda function:

    1. Event:

    event is a JSON variable containing all details regarding a bot conversation. Every time lambda function is invoked, event JSON is sent by Amazon Lex which contains the details of the respective message sent by the user to the bot.

    Below is a sample event JSON:

    {  
    currentIntent: {    
    name: 'orderBook',    
    slots: {      
      bookType: null,      
      bookName: 'null'    
     },    
      confirmationStatus: 'None'
     },
     bot: {  
      name: 'PurchaseBook',  
      alias: '$LATEST',  
      version: '$LATEST'
     },
     userId: 'user-1',
     inputTranscript: 'buy me a book',
     invocationSource: 'DialogCodeHook',
     outputDialogMode: 'Text',
     messageVersion: '1.0'
     };

    Format of event JSON is explained below:-

    • currentIntent:  It will contain information regarding the intent of message sent by the user to the bot. It contains following keys:
    • name: intent name  (for e.g orderBook, we defined this intent in our previous blog).
    • slots: It will contain a map of slot names configured for that particular intent,  populated with values recognized by Amazon Lex during the conversation. Default values are null.  
    • confirmationStatus:  It provides the user response to a confirmation prompt if there is one. Possible values for this variable are:
    • None: Default value
    • Confirmed: When the user responds with a confirmation w.r.t confirmation prompt.
    • Denied: When the user responds with a deny w.r.t confirmation prompt.
    • inputTranscipt: Text input by the user for processing. In case of audio input, the text will be extracted from audio. This is the text that is actually processed to recognize intents and slot values.                                
    • invocationSource: Its value directs the reason for invoking the Lambda function. It can have following two values:
    • DialogCodeHook:  This value directs the Lambda function to initialize the validation of user’s data input. If the intent is not clear, Amazon Lex can’t invoke the Lambda function.
    • FulfillmentCodeHook: This value is set to fulfil the intent. If the intent is configured to invoke a Lambda function as a fulfilment code hook, Amazon Lex sets the invocationSource to this value only after it has all the slot data to fulfil the intent.
    • bot: Details of bot that processed the request. It consists of below information:
    • name:  name of the bot.
    • alias: alias of the bot version.
    • version: the version of the bot.
    • userId: Its value is defined by the client application. Amazon Lex passes it to the Lambda function.
    • outputDialogMode:  Its value depends on how you have configured your bot. Its value can be Text / Voice.
    • messageVersion: The version of the message that identifies the format of the event data going into the Lambda function and the expected format of the response from a Lambda function. In the current implementation, only message version 1.0 is supported. Therefore, the console assumes the default value of 1.0 and doesn’t show the message version.
    • sessionAttributes:  Application-specific session attributes that the client sent in the request. It is optional.

    2. Context:

    AWS Lambda uses this parameter to provide the runtime information of the Lambda function that is executing. Some useful information we can get from context object are:-

    • The time is remaining before AWS Lambda terminates the Lambda function.
    • The CloudWatch log stream associated with the Lambda function that is executing.
    • The AWS request ID returned to the client that invoked the Lambda function which can be used for any follow-up inquiry with AWS support.

    Section 2: Response Format

    Amazon Lex expects a response from a Lambda function in the following format:

    {  
    sessionAttributes: {},  
    dialogAction: {   
    type: "ElicitIntent/ ElicitSlot/ ConfirmIntent/ Delegate/ Close",
    <structure based on type> 
    }
    }

    The response consists of two fields. The sessionAttributes field is optional, the dialogAction field is required. The contents of the dialogAction field depends on the value of the type field.

    • sessionAttributes: This is an optional field, it can be empty. If the function has to send something back to the client it should be passed under sessionAttributes. We will see its use-case in Section-4.
    • dialogAction (Required): Type of this field defines the next course of action. There are five types of dialogAction explained below:-

    1) Close: Informs Amazon Lex not to expect a response from the user. This is the case when all slots get filled. If you don’t specify a message, Amazon Lex uses the goodbye message or the follow-up message configured for the intent.

    dialogAction: {   
    type: "Close",   
    fulfillmentState: "Fulfilled/ Failed", // (required)   
    message: { // (optional)     
    contentType: "PlainText or SSML",     
    content: "Message to convey to the user"   
    } 
    }

    2) ConfirmIntent: Informs Amazon Lex that the user is expected to give a yes or no answer to confirm or deny the current intent. The slots field must contain an entry for each of the slots configured for the specified intent. If the value of a slot is unknown, you must set it to null. The message and responseCard fields are optional.

    dialogAction: {   
    type: "ConfirmIntent",   
    intentName: "orderBook",   
    slots: {     
      bookName: "value",     
      bookType: "value",   
     }   
     message: { // (optional)     
      contentType: "PlainText or SSML",     
      content: "Message to convey to the user"   
      } 
      }

    3) Delegate:  Directs Amazon Lex to choose the next course of action based on the bot configuration. The response must include any session attributes, and the slots field must include all of the slots specified for the requested intent. If the value of the field is unknown, you must set it to null. You will get a DependencyFailedException exception if your fulfilment function returns the Delegate dialog action without removing any slots.

    dialogAction: {   
    type: "Delegate",   
    slots: {     
      slot1: "value",     
      slot2: "value"   
     } 
     }

    4) ElicitIntent: Informs Amazon Lex that the user is expected to respond with an utterance that includes an intent. For example, “I want a buy a book” which indicates the OrderBook intent. The utterance “book,” on the other hand, is not sufficient for Amazon Lex to infer the user’s intent

    dialogAction: 
    {   type: "ElicitIntent",   
    message: { // (optional)     
    contentType: "PlainText or SSML",     
    content: "Message to convey to the user"   
    } 
    }

    5) ElicitSlot:  Informs Amazon Lex that the user is expected to provide a slot value in the response. In below structure, we are informing Amazon lex that user response should provide value for the slot named ‘bookName’.

    dialogAction: {   
      type: "ElicitSlot",   
      intentName: "orderBook",   
      slots: {     
        bookName: "",     
        bookType: "fiction",   
       },   
       slotToElicit: "bookName",   
       message: { // (optional)     
       contentType: "PlainText or SSML",     
       content: "Message to convey to the user"   
       }
       }

    Section 3: Managing Conversation Context

    Conversation context is the information that a user, your application, or a Lambda function provides to an Amazon Lex bot to fulfill an intent. Conversation context includes slot data that the user provides, request attributes set by the client application, and session attributes that the client application and Lambda functions create.

    1. Setting session timeout

    Session timeout is the length of time that a conversation session lasts. For in-progress conversations, Amazon Lex retains the context information, slot data, and session attributes till the session ends. Default session duration is 5 minutes but it can be changed upto 24 hrs while creating the bot in Amazon Lex console.

    2.Setting session attributes

    Session attributes contain application-specific information that is passed between a bot and a client application during a session. Amazon Lex passes session attributes to all Lambda functions configured for a bot. If a Lambda function adds or updates session attributes, Amazon Lex passes the new information back to the client application.

    Session attributes persist for the duration of the session. Amazon Lex stores them in an encrypted data store until the session ends.

    3. Sharing information between intents

    If you have created a bot with more than one intent, information can be shared between them using session attributes. Attributes defined while fulfilling an intent can be used in other defined intent.

    For example, a user of the book ordering bot starts by ordering books. the bot engages in a conversation with the user, gathering slot data, such as book name, and quantity. When the user places an order, the Lambda function that fulfils the order sets the lastConfirmedReservation session attribute containing information regarding ordered book and currentReservationPrice containing the price of the book. So, when the user has fulfilled the intent orderMagazine, the final price will be calculated on the bases of currentReservationPrice.

    lastConfirmedReservation session attribute containing information regarding ordered book and currentReservationPrice containing the price of the book. So, when the user also fulfilled the intent orderMagazine, the final price will be calculated on the basis of currentReservationPrice.

    Section 4:  Example

    The details of example Bot are below:

    Bot Name: PurchaseBot

    Intents :

    • orderBook – bookName, bookType
    • orderMagazine – magazineName, issueMonth

    Session attributes set while fulfilling the intent “orderBook” are:

    1. lastConfirmedReservation: In this variable, we are storing slot values corresponding to intent orderBook.
    2. currentReservationPrice: Book price is calculated and stored in this variable

    When intent orderBook gets fulfilled we will ask the user if he also wants to order a magazine. If the user responds with a confirmation bot will start fulfilling the intent “orderMagazine”.  

    Conclusion

    AWS Lambda functions are used as code hooks for your Amazon Lex bot. You can identify Lambda functions to perform initialization and validation, fulfillment, or both in your intent configuration. This blog bought more technical insight of how Amazon Lex works and how it communicates with Lambda functions. This blog explains how a conversation context is maintained using the session attributes. I hope you find the information useful.

  • Ensure Continuous Delivery On Kubernetes With GitOps’ Argo CD

    What is GitOps?

    GitOps is a Continuous Deployment model for cloud-native applications. In GitOps, the Git repositories which contain the declarative descriptions of the infrastructure are considered as the single source of truth for the desired state of the system and we need to have an automated way to ensure that the deployed state of the system always matches the state defined in the Git repository. All the changes (deployment/upgrade/rollback) on the environment are triggered by changes (commits) made on the Git repository

    The artifacts that we run on any environment always have a corresponding code for them on some Git repositories. Can we say the same thing for our infrastructure code?

    Infrastructure as code tools, completely declarative orchestration tools like Kubernetes allow us to represent the entire state of our system in a declarative way. GitOps intends to make use of this ability and make infrastructure-related operations more developer-centric.

    Role of Infrastructure as Code (IaC) in GitOps

    The ability to represent the infrastructure as code is at the core of GitOps. But just having versioned controlled infrastructure as code doesn’t mean GitOps, we also need to have a mechanism in place to keep (try to keep) our deployed state in sync with the state we define in the Git repository.

    Infrastructure as Code is necessary but not sufficient to achieve GitOps

    GitOps does pull-based deployments

    Most of the deployment pipelines we see currently, push the changes in the deployed environment. For example, consider that we need to upgrade our application to a newer version then we will update its docker image tag in some repository which will trigger a deployment pipeline and update the deployed application. Here the changes were pushed on the environment. In GitOps, we just need to update the image tag on the Git repository for that environment and the changes will be pulled to the environment to match the updated state in the Git repository. The magic of keeping the deployed state in sync with state-defined on Git is achieved with the help of operators/agents. The operator is like a control loop which can identify differences between the deployed state and the desired state and make sure they are the same.

    Key benefits of GitOps:

    1. All the changes are verifiable and auditable as they make their way into the system through Git repositories.
    2. Easy and consistent replication of the environment as Git repository is the single source of truth. This makes disaster recovery much quicker and simpler.
    3. More developer-centric experience for operating infrastructure. Also a smaller learning curve for deploying dev environments.
    4. Consistent rollback of application as well as infrastructure state.

    Introduction to Argo CD

    Argo CD is a continuous delivery tool that works on the principles of GitOps and is built specifically for Kubernetes. The product was developed and open-sourced by Intuit and is currently a part of CNCF.

    Key components of Argo CD:

    1. API Server: Just like K8s, Argo CD also has an API server that exposes APIs that other systems can interact with. The API server is responsible for managing the application, repository and cluster credentials, enforcing authentication and authorization, etc.
    2. Repository server: The repository server keeps a local cache of the Git repository, which holds the K8s manifest files for the application. This service is called by other services to get the K8s manifests.  
    3. Application controller: The application controller continuously watches the deployed state of the application and compares it with the desired state of the application, reports the API server whenever they are not in sync with each other and seldom takes corrective actions as well. It is also responsible for executing user-defined hooks for various lifecycle events of the application.

    Key objects/resources in Argo CD:

    1. Application: Argo CD allows us to represent the instance of the application which we want to deploy in an environment by creating Kubernetes objects of a custom resource definition(CRD) named Application. In the specification of Application type objects, we specify the source (repository) of our application’s K8s manifest files, the K8s server where we want to deploy those manifests, namespace, and other information.
    2. AppProject: Just like Application, Argo CD provides another CRD named AppProject. AppProjects are used to logically group related-applications.
    3. Repo Credentials: In the case of private repositories, we need to provide access credentials. For credentials, Argo CD uses the K8s secrets and config map. First, we create objects of secret types and then we update a special-purpose configuration map named argocd-cm with the repository URL and the secret which contains the credentials.
    4. Cluster Credentials: Along with Git repository credentials, we also need to provide the K8s cluster credentials. These credentials are also managed using K8s secret, we are required to add the label argocd.argoproj.io/secret-type: cluster to these secrets.

    Demo:

    Enough of theory, let’s try out the things we discussed above. For the demo, I have created a simple app named message-app. This app reads a message set in the environment variable named MESSAGE. We will populate the values of this environment variable using a K8s config map. I have kept the K8s manifest files for the app in a separate repository. We have the application and the K8s manifest files ready. Now we are all set to install Argo CD and deploy our application.

    Installing Argo CD:

    For installing Argo CD, we first need to create a namespace named argocd.

    kubectl create namespace argocd
    kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

    Applying the files from the argo-cd repo directly is fine for demo purposes, but in actual environments, you must copy the file in your repository before applying them. 

    We can see that this command has created the core components and CRDs we discussed earlier in the blog. There are some additional resources as well but we can ignore them for the time being.

    Accessing the Argo CD GUI

    We have the Argo CD running in our cluster, Argo CD also provides a GUI which gives us a graphical representation of our k8s objects. It allows us to view events, pod logs, and other configurations.

    By default, the GUI service is not exposed outside the cluster. Let us update its service type to LoadBalancer so that we can access it from outside.

    kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}'

    After this, we can use the external IP of the argocd-server service and access the GUI. 

    The initial username is admin and the password is the name of the api-server pod. The password can be obtained by listing the pods in the argocd namespace or directly by this command.

    kubectl get pods -n argocd -l app.kubernetes.io/name=argocd-server -o name | cut -d'/' -f 2 

    Deploy the app:

    Now let’s go ahead and create our application for the staging environment for our message app.

    apiVersion: argoproj.io/v1alpha1
    kind: Application
    metadata:
      name: message-app-staging
      namespace: argocd
      environment: staging
      finalizers:
        - resources-finalizer.argocd.argoproj.io
    spec:
      project: default
    
      # Source of the application manifests
      source:
        repoURL: https://github.com/akash-gautam/message-app-manifests.git
        targetRevision: HEAD
        path: manifests/staging
    
      # Destination cluster and namespace to deploy the application
      destination:
        server: https://kubernetes.default.svc
        namespace: staging
    
      syncPolicy:
        automated:
          prune: false
          selfHeal: false

    In the application spec, we have specified the repository, where our manifest files are stored and also the path of the files in the repository. 

    We want to deploy our app in the same k8s cluster where ArgoCD is running so we have specified the local k8s service URL in the destination. We want the resources to be deployed in the staging namespace, so we have set it accordingly.

    In the sync policy, we have enabled automated sync. We have kept the project as default. 

    Adding the resources-finalizer.argocd.argoproj.io ensures that all the resources created for the application are deleted when the Application is deleted. This is fine for our demo setup but might not always be desirable in real-life scenarios.

    Our git repos are public so we don’t need to create secrets for git repo credentials.

    We are deploying in the same cluster where Argo CD itself is running. As this is a demo setup, we can use the admin user created by Argo CD, so we don’t need to create secrets for cluster credentials either.

    Now let’s go ahead and create the application and see the magic happen.

    kubectl apply -f message-app-staging.yaml

    As soon as the application is created, we can see it on the GUI. 

    By clicking on the application, we can see all the Kubernetes objects created for it.

    It also shows the objects which are indirectly created by the objects we create. In the above image, we can see the replica set and endpoint object which are created as a result of creating the deployment and service respectively.

    We can also click on the individual objects and see their configuration. For pods, we can see events and logs as well.

    As our app is deployed now, we can grab public IP of message-app service and access it on the browser.

    We can see that our app is deployed and accessible.

    Updating the app

    For updating our application, all we need to do is commit our changes to the GitHub repository. We know the message-app just displays the message we pass to it via. Config map, so let’s update the message and push it to the repository.

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: message-configmap
      labels:
        app: message-app
    data:
      MESSAGE: "This too shall pass" #Put the message you want to display here.

    Once the commit is done, Argo CD will start to sync again.

    Once the sync is done, we will restart our message app pod, so that it picks up the latest values in the config map. Then we need to refresh the browser to see updated values.

    As we discussed earlier, for making any changes to the environment, we just need to update the repo which is being used as the source for the environment and then the changes will get pulled in the environment. 

    We can follow an exact similar approach and deploy the application to the production environment as well. We just need to create a new application object and set the manifest path and deployment namespace accordingly.

    Conclusion: 

    It’s still early days for GitOps, but it has already been successfully implemented at scale by many organizations. As the GitOps tools mature along with the ever-growing adoption of Kubernetes, I think many organizations will consider adopting GitOps soon. GitOps is not limited only to Kubernetes, but the completely declarative nature of Kubernetes makes it simpler to achieve GitOps. Argo CD is a deployment tool that’s tailored for Kubernetes and allows us to do deployments in a Kubernetes native way while following the principles of GitOps.I hope this blog helped you in understanding how what and why of GitOps and gave some insights to Argo CD.

  • Building a WebSocket Service with AWS Lambda & DynamoDB

    WebSocket is an effective way for full-duplex, real-time communication between a web server and a client. It is widely used for building real-time web applications along with helper libraries that offer better features. Implementing WebSockets requires a persistent connection between two parties. Serverless functions are known for short execution time and non-persistent behavior. However, with the API Gateway support for WebSocket endpoints, it is possible to implement a Serverless service built on AWS Lambda, API Gateway, and DynamoDB.

    Prerequisites

    A basic understanding of real-time web applications will help with this implementation. Throughout this article, we will be using Serverless Framework for developing and deploying the WebSocket service. Also, Node.js is used to write the business logic. 

    Behind the scenes, Serverless uses Cloudformation to create various required resources, like API Gateway APIs, AWS Lambda functions, IAM roles and policies, etc.

    Why Serverless?

    Serverless Framework abstracts the complex syntax needed for creating the Cloudformation stacks and helps us focus on the business logic of the services. Along with that, there are a variety of plugins available that help developing serverless applications easier.

    Why DynamoDB?

    We need persistent storage for WebSocket connection data, along with AWS Lambda. DynamoDB, a serverless key-value database from AWS, offers low latency, making it a great fit for storing and retrieving WebSocket connection details.

    Overview

    In this application, we’ll be creating an AWS Lambda service that accepts the WebSocket connections coming via API Gateway. The connections and subscriptions to topics are persisted using DynamoDB. We will be using ws for implementing basic WebSocket clients for the demonstration. The implementation has a Lambda consuming WebSocket that receives the connections and handles the communication. 

    Base Setup

    We will be using the default Node.js boilerplate offered by Serverless as a starting point.

    serverless create --template aws-nodejs

    A few of the Serverless plugins are installed and used to speed up the development and deployment of the Serverless stack. We also add the webpack config given here to support the latest JS syntax.

    Adding Lambda role and policies:

    The lambda function requires a role attached to it that has enough permissions to access DynamoDB and Execute API. These are the links for the configuration files:

    Link to dynamoDB.yaml

    Link to lambdaRole.yaml

    Adding custom config for plugins:

    The plugins used for local development must have the custom config added in the yaml file.

    This is how our serverless.yaml file should look like after the base serverless configuration:

    service: websocket-app
    frameworkVersion: '2'
    custom:
     dynamodb:
       stages:
         - dev
       start:
         port: 8000
         inMemory: true
         heapInitial: 200m
         heapMax: 1g
         migrate: true
         convertEmptyValues: true
     webpack:
       keepOutputDirectory: true
       packager: 'npm'
       includeModules:
         forceExclude:
           - aws-sdk
     
    provider:
     name: aws
     runtime: nodejs12.x
     lambdaHashingVersion: 20201221
    plugins:
     - serverless-dynamodb-local
     - serverless-plugin-existing-s3
     - serverless-dotenv-plugin
     - serverless-webpack
     - serverless-offline
    resources:
     - Resources: ${file(./config/dynamoDB.yaml)}
     - Resources: ${file(./config/lambdaRoles.yaml)}
    functions:
     hello:
       handler: handler.hello

    Add WebSocket Lambda:

    We need to create a lambda function that accepts WebSocket events from API Gateway. As you can see, we’ve defined 3 WebSocket events for the lambda function.

    • $connect
    • $disconnect
    • $default

    These 3 events stand for the default routes that come with WebSocket API Gateway offering. $connect and $disconnect are used for initialization and termination of the socket connection, where $default route is for data transfer.

    functions:
     websocket:
       handler: lambda/websocket.handler
       events:
         - websocket:
             route: $connect
         - websocket:
             route: $disconnect
         - websocket:
             route: $default

    We can go ahead and update how data is sent and add custom WebSocket routes to the application.

    The lambda needs to establish a connection with the client and handle the subscriptions. The logic for updating the DynamoDB is written in a utility class client. Whenever a connection is received, we create a record in the topics table.

    console.log(`Received socket connectionId: ${event.requestContext && event.requestContext.connectionId}`);
           if (!(event.requestContext && event.requestContext.connectionId)) {
               throw new Error('Invalid event. Missing `connectionId` parameter.');
           }
           const connectionId = event.requestContext.connectionId;
           const route = event.requestContext.routeKey;
           console.log(`data from ${connectionId} ${event.body}`);
           const connection = new Client(connectionId);
           const response = { statusCode: 200, body: '' };
     
           if (route === '$connect') {
               console.log(`Route ${route} - Socket connectionId connectedconected: ${event.requestContext && event.requestContext.connectionId}`);
               await new Client(connectionId).connect();
               return response;
           } 

    The Client utility class internally creates a record for the given connectionId in the DynamoDB topics table.

    async subscribe({ topic, ttl }) {
       return dynamoDBClient
         .put({ 
            Item: {
             topic,
             connectionId: this.connectionId,
            ttl: typeof ttl === 'number' ? ttl : Math.floor(Date.now() / 1000) + 60 * 60 * 2,
           },
           TableName: process.env.TOPICS_TABLE,
         }).promise();
     }

    Similarly, for the $disconnect route, we remove the INITIAL_CONNECTION topic record when a client disconnects.

    else if (route === '$disconnect') {
     console.log(`Route ${route} - Socket disconnected: ${ event.requestContext.connectionId}`);
               await new Client(connectionId).unsubscribe();
               return response;
           }

    The client.unsubscribe method internally removes the connection entry from the DynamoDB table. Here, the getTopics method fetches all the topics the particular client has subscribed to.

    async unsubscribe() {
       const topics = await this.getTopics();
       if (!topics) {
         throw Error(`Topics got undefined`);
       }
       return this.removeTopics({
         [process.env.TOPICS_TABLE]: topics.map(({ topic, connectionId }) => ({
           DeleteRequest: { Key: { topic, connectionId } },
         })),
       });
     }

    Now comes the default route part of the lambda where we customize message handling. In this implementation, we’re relaying our message handling based on the event.body.type, which indicates what kind of message is received from the client to server. The subscribe type here is used to subscribe to new topics. Similarly, the message type is used to receive the message from one client and then publish it to other clients who have subscribed to the same topic as the sender.

    console.log(`Route ${route} - data from ${connectionId}`);
               if (!event.body) {
                   return response;
               }
               let body = JSON.parse(event.body);
               const topic = body.topic;
               if (body.type === 'subscribe') {
                   connection.subscribe({ topic });
                   console.log(`Client subscribing for topic: ${topic}`);
               }
               if (body.type === 'message') {
                   await new Topic(topic).publishMessage({ data: body.message });
                   console.error(`Published messages to subscribers`);
                   return response;
               }
               return response;

    Similar to $connect, the subscribe type of payload, when received, creates a new subscription for the mentioned topic.

    Publishing the messages

    Here is the interesting part of this lambda. When a client sends a payload with type message, the lambda calls the publishMessage method with the data received. The method gets the subscribers active for the topic and publishes messages using another utility TopicSubscriber.sendMessage

    async publishMessage(data) {
       const subscribers = await this.getSubscribers();
       const promises = subscribers.map(async ({ connectionId, subscriptionId }) => {
         const TopicSubscriber = new Client(connectionId);
           const res = await TopicSubscriber.sendMessage({
             id: subscriptionId,
             payload: { data },
             type: 'data',
           });
           return res;
       });
       return Promise.all(promises);
     }

    The sendMessage executes the API endpoint, which is the API Gateway URL after deployment. As we’re using serverless-offline for the local development, the IS_OFFLINE env variable is automatically set.

    const endpoint =  process.env.IS_OFFLINE ? 'http://localhost:3001' : process.env.PUBLISH_ENDPOINT;
       console.log('publish endpoint', endpoint);
       const gatewayClient = new ApiGatewayManagementApi({
         apiVersion: '2018-11-29',
         credentials: config,
         endpoint,
       });
       return gatewayClient
         .postToConnection({
           ConnectionId: this.connectionId,
           Data: JSON.stringify(message),
         })
         .promise();

    Instead of manually invoking the API endpoint, we can also use DynamoDB streams to trigger a lambda asynchronously and publish messages to topics.

    Implementing the client

    For testing the socket implementation, we will be using a node.js script ws-client.js. This creates two nodejs ws clients: one that sends the data and another that receives it.

    const WebSocket = require('ws');
    const sockedEndpoint = 'http://0.0.0.0:3001';
    const ws1 = new WebSocket(sockedEndpoint, {
     perMessageDeflate: false
    });
    const ws2 = new WebSocket(sockedEndpoint, {
     perMessageDeflate: false
    });

    The first client on connect sends the data at an interval of one second to a topic named general. The count increments each send.

    ws1.on('open', () => {
       console.log('WS1 connected');
       let count = 0;
       setInterval(() => {
         const data = {
           type: 'message',
           message: `count is ${count}`,
           topic: 'general'
         }
         const message  = JSON.stringify(data);
         ws1.send(message, (err) => {
           if(err) {
             console.log(`Error occurred while send data ${err.message}`)
           }
           console.log(`WS1 OUT ${message}`);
         })
         count++;
       }, 15000)
    })

    The second client on connect will first subscribe to the general topic and then attach a handler for receiving data.

    ws2.on('open', () => {
     console.log('WS2 connected');
     const data = {
       type: 'subscribe',
       topic: 'general'
     }
     ws2.send(JSON.stringify(data), (err) => {
       if(err) {
         console.log(`Error occurred while send data ${err.message}`)
       }
     })
    });
    ws2.on('message', ( message) => {
     console.log(`ws2 IN ${message}`);
    });

    Once the service is running, we should be able to see the following output, where the two clients successfully sharing and receiving the messages with our socket server.

    Conclusion

    With API Gateway WebSocket support and DynamoDB, we’re able to implement persistent socket connections using serverless functions. The implementation can be improved and can be as complex as needed.

    WebSocket is an effective way for full-duplex, real-time communication between a web server and a client. It is widely used for building real-time web applications along with helper libraries that offer better features. Implementing WebSockets requires a persistent connection between two parties. Serverless functions are known for short execution time and non-persistent behavior. However, with the API Gateway support for WebSocket endpoints, it is possible to implement a Serverless service built on AWS Lambda, API Gateway, and DynamoDB.

  • Elasticsearch – Basic and Advanced Concepts

    What is Elasticsearch?

    In our previous blog, we have seen Elasticsearch is a highly scalable open-source full-text search and analytics engine, built on the top of Apache Lucene. Elasticsearch allows you to store, search, and analyze huge volumes of data as quickly as possible and in near real-time.

    Basic Concepts –

    • Index – Large collection of JSON documents. Can be compared to a database in relational databases. Every document must reside in an index.
    • Shards – Since, there is no limit on the number of documents that reside in an index, indices are often horizontally partitioned as shards that reside on nodes in the cluster. 
      Max documents allowed in a shard = 2,147,483,519 (as of now)
    • Type – Logical partition of an index. Similar to a table in relational databases. 
    • Fields – Similar to a column in relational databases. 
    • Analyzers – Used while indexing/searching the documents. These contain “tokenizers” that split phrases/text into tokens and “token-filters”, that filter/modify tokens during indexing & searching.
    • Mappings – Combination of Field + Analyzers. It defines how your fields can be stored & indexed.

    Inverted Index

    ES uses Inverted Indexes under the hood. Inverted Index is an index which maps terms to documents containing them.

    Let’s say, we have 3 documents :

    1. Food is great
    2. It is raining
    3. Wind is strong

    An inverted index for these documents can be constructed as –

    The terms in the dictionary are stored in a sorted order to find them quickly.

    Searching multiple terms is done by performing a lookup on the terms in the index. It performs either UNION or INTERSECTION on them and fetches relevant matching documents.

    An ES Index is spanned across multiple shards, each document is routed to a shard in a round–robin fashion while indexing. We can customize which shard to route the document, and which shard search-requests are sent to.

    ES Index is made of multiple Lucene indexes, which in turn, are made up of index segments. These are write once, read many types of indices, i.e the index files Lucene writes are immutable (except for deletions).

    Analyzers –

    Analysis is the process of converting text into tokens or terms which are added to the inverted index for searching. Analysis is performed by an analyzer. An analyzer can be either a built-in or a custom. 

    We can define single analyzer for both indexing & searching, or a different search-analyzer and an index-analyzer for a mapping.

    Building blocks of analyzer- 

    • Character filters – receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters.
    • Tokenizers – receives a stream of characters, breaks it up into individual tokens. 
    • Token filters – receives the token stream and may add, remove, or change tokens.

    Some Commonly used built-in analyzers –

    1. Standard –

    Divides text into terms on word boundaries. Lower-cases all terms. Removes punctuation and stopwords (if specified, default = None).

    Text:  The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.

    Output: [the, 2, quick, brown, foxes, jumped, over, the, lazy, dog’s, bone]

    2. Simple/Lowercase –

    Divides text into terms whenever it encounters a non-letter character. Lower-cases all terms.

    Text: The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.

    Output: [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

    3. Whitespace –

    Divides text into terms whenever it encounters a white-space character.

    Text: The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.

    Output: [ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog’s, bone.]

    4. Stopword –

    Same as simple-analyzer with stop word removal by default.

    Text: The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.

    Output: [ quick, brown, foxes, jumped, over, lazy, dog, s, bone]

    5. Keyword / NOOP –

    Returns the entire input string as it is.

    Text: The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.

    Output: [The 2 QUICK Brown-Foxes jumped over the lazy dog’s bone.]

    Some Commonly used built-in tokenizers –

    1. Standard –

    Divides text into terms on word boundaries, removes most punctuation.

    2. Letter –

    Divides text into terms whenever it encounters a non-letter character.

    3. Lowercase –

    Letter tokenizer which lowercases all tokens.

    4. Whitespace –

    Divides text into terms whenever it encounters any white-space character.

    5. UAX-URL-EMAIL –

    Standard tokenizer which recognizes URLs and email addresses as single tokens.

    6. N-Gram –

    Divides text into terms when it encounters anything from a list of specified characters (e.g. whitespace or punctuation), and returns n-grams of each word: a sliding window of continuous letters, e.g. quick → [qu, ui, ic, ck, qui, quic, quick, uic, uick, ick].

    7. Edge-N-Gram –

    It is similar to N-Gram tokenizer with n-grams anchored to the start of the word (prefix- based NGrams). e.g. quick → [q, qu, qui, quic, quick].

    8. Keyword –

    Emits exact same text as a single term.

    Make your mappings right –

    Analyzers if not made right, can increase your search time extensively. 

    Avoid using regular expressions in queries as much as possible. Let your analyzers handle them.

    ES provides multiple tokenizers (standard, whitespace, ngram, edge-ngram, etc) which can be directly used, or you can create your own tokenizer. 

    A simple use-case where we had to search for a user who either has “brad” in their name or “brad_pitt” in their email (substring based search), one would simply go and write a regex for this query, if no proper analyzers are written for this mapping.

    {
      "query": {
        "bool": {
          "should": [
            {
              "regexp": {
                "email.raw": ".*brad_pitt.*"
              }
            },
            {
              "regexp": {
                "name.raw": ".*brad.*"
              }
            }
          ]
        }
      }
    }

    This took 16s for us to fetch 1 lakh out of 60 million documents

    Instead, we created an n-gram analyzer with lower-case filter which would generate all relevant tokens while indexing.

    The above regex query was updated to –

    {
      "query": {
        "bool": {
          "multi_match": {
            "query": "brad",
            "fields": [
              "email.suggestion",
              "full_name.suggestion"
            ]
          }
        }
      },
      "size": 25
    }

    This took 109ms for us to fetch 1 lakh out of 60 million documents

    Thus, previous search query which took more than 10-25s got reduced to less than 800-900ms to fetch the same set of records.

    Had the use-case been to search results where name starts with “brad” or email starts with “brad_pitt” (prefix based search), it is better to go for edge-n-gram analyzer or suggesters.

    Performance Improvement with Filter Queries –

    Use Filter queries whenever possible. 

    ES usually scores documents and returns them in sorted order as per their scores. This may take a hit on performance if scoring of documents is not relevant to our use-case. In such scenarios, use “filter” queries which give boolean scores to documents.

    {
      "query": {
        "bool": {
          "multi_match": {
            "query": "brad",
            "fields": [
              "email.suggestion",
              "full_name.suggestion"
            ]
          }
        }
      },
      "size": 25
    }

    Above query can now be written as –

    {
      "query": {
        "bool": {
          "filter": {
            "bool": {
              "must": [
                {
                  "multi_match": {
                    "query": "brad",
                    "fields": [
                      "email.suggestion",
                      "full_name.suggestion"
                    ]
                  }
                }
              ]
            }
          }
        }
      },
      "size": 25
    }

    This will reduce query-time by a few milliseconds.

    Re-indexing made faster –

    Before creating any mappings, know your use-case well.

    ES does not allow us to alter existing mappings unlike “ALTER” command in relational databases, although we can keep adding new mappings to the index. 

    The only way to change existing mappings is by creating a new index, re-indexing existing documents and aliasing the new-index with required name with ZERO downtime on production. Note – This process can take days if you have millions of records to re-index.

    To re-index faster, we can change a few settings  –

    1. Disable swapping – Since no requests will be directed to the new index till indexing is done, we can safely disable swap
    Command for Linux machines –

    sudo swapoff -a

    2. Disable refresh_interval for ES – Default refresh_interval is 1s which can safely be disabled while documents are getting re-indexed.

    3. Change bulk size while indexing – ES usually indexes documents in chunks of size 1k. It is preferred to increase this default size to approx 5 to 10K, although we need to find the sweet spot while reindexing to avoid load on current index.

    4. Reset replica count to 0  – ES creates at least 1 replica per shard, by default. We can set this to 0 while indexing & reset it to required value post indexing.

    Conclusion

    ElasticSearch is a very powerful database for text-based searches. The Elastic ecosystem is widely used for reporting, alerting, machine learning, etc. This article just gives an overview of ElasticSearch mappings and how creating relevant mappings can improve your query performance & accuracy. Giving right mappings, right resources to your ElasticSearch cluster can do wonders.