Category: Services

  • OPA On Kubernetes: An Introduction For Beginners

    Introduction:

    More often than not organizations need to apply various kinds of policies on the environments where they run their applications. These policies might be required to meet compliance requirements, achieve a higher degree of security, achieve standardization across multiple environments, etc. This calls for an automated/declarative way to define and enforce these policies. Policy engines like OPA help us achieve the same. 

    Motivation behind Open Policy Agent (OPA)

    When we run our application, it generally comprises multiple subsystems. Even in the simplest of cases, we will be having an API gateway/load balancer, 1-2 applications and a database. Generally, all these subsystems will have different mechanisms for authorizing the requests, for example, the application might be using JWT tokens to authorize the request, but your database is using grants to authorize the request, it is also possible that your application is accessing some third-party APIs or cloud services which will again have a different way of authorizing the request. Add to this your CI/CD servers, your log server, etc and you can see how many different ways of authorization can exist even in a small system. 

    The existence of so many authorization models in our system makes life difficult when we need to meet compliance or information security requirements or even some self-imposed organizational policies. For example, if we need to adhere to some new compliance requirements then we need to understand and implement the same for all the components which do authorization in our system.

    “The main motivation behind OPA is to achieve unified policy enforcements across the stack

    What are Open Policy Agent (OPA) and OPA Gatekeeper

    The OPA is an open-source, general-purpose policy engine that can be used to enforce policies on various types of software systems like microservices, CI/CD pipelines, gateways, Kubernetes, etc. OPA was developed by Styra and is currently a part of CNCF.

    OPA provides us with REST APIs which our system can call to check if the policies are being met for a request payload or not. It also provides us with a high-level declarative language, Rego which allows us to specify the policies we want to enforce as code. This provides us with lots of flexibility while defining our policies.

    The above image shows the architecture of OPA. It exposes APIs which any service that needs to make an authorization or policy decision, can call (policy query) and then OPA can make a decision based on the Rego code for the policy and return a decision to the service that further processes the request accordingly. The enforcement is done by the actual service itself, OPA is responsible only for making the decision. This is how OPA becomes a general-purpose policy engine and supports a large number of services.   

    The Gatekeeper project is a Kubernetes specific implementation of the OPA. Gatekeeper allows us to use OPA in a Kubernetes native way to enforce the desired policies. 

    How Gatekeeper enforces policies

    On the Kubernetes cluster, the Gatekeeper is installed as a ValidatingAdmissionWebhook. The Admission Controllers can intercept requests after they have been authenticated and authorized by the K8s API server, but before they are persisted in the database. If any of the admission controllers rejects the request then the overall request is rejected. The limitation of admission controllers is that they need to be compiled into the kube-apiserver and can be enabled only when the apiserver starts up. 

    To overcome this rigidity of the admission controller, admission webhooks were introduced. Once we enable admission webhooks controllers in our cluster, they can send admission requests to external HTTP callbacks and receive admission responses. Admission webhook can be of two types MutatingAdmissionWebhook and ValidatingAdmissionWebhook. The difference between the two is that mutating webhooks can modify the objects that they receive while validating webhooks cannot. The below image roughly shows the flow of an API request once both mutating and validating admission controllers are enabled.

     

    The role of Gatekeeper is to simply check if the request meets the defined policy or not, that is why it is installed as a validating webhook.

    Demo:

    Install Gatekeeper:

    kubectl apply -f
    https://raw.githubusercontent.com/open-policy-agent/gatekeeper/master/deploy/gatekeeper.yaml

    Now we have Gatekeeper up and running in our cluster. The above installation also created a CRD named `constrainttemplates.templates.gatekeeper.sh’. This CRD allows us to create constraint templates for the policy we want to enforce. In the constraint template, we define the constraints logic using the Rego code and also its schema. Once the constraint template is created, we can create the constraints which are instances of the constraint templates, created for specific resources. Think of it as function and actual function calls, the constraint templates are like functions that are invoked with different values of the parameter (resource kind and other values) by constraints.

    To get a better understanding of the same, let’s go ahead and create constraints templates and constraints.

    The policy that we want to enforce is to prevent developers from creating a service of type LoadBalancer in the `dev` namespace of the cluster, where they verify the working of other code. Creating services of type LoadBalancer in the dev environment is adding unnecessary costs. 

    Below is the constraint template for the same.

    apiVersion: templates.gatekeeper.sh/v1beta1
    kind: ConstraintTemplate
    metadata:
      name: lbtypesvcnotallowed
    spec:
      crd:
        spec:
          names:
            kind: LBTypeSvcNotAllowed
            listKind: LBTypeSvcNotAllowedList
            plural: lbtypesvcnotallowed
            singular: lbtypesvcnotallowed
      targets:
        - target: admission.k8s.gatekeeper.sh
          rego: |
            package kubernetes.admission
            violation[{"msg": msg}] {
                        input.review.kind.kind = "Service"
                        input.review.operation = "CREATE"
                        input.review.object.spec.type = "LoadBalancer"
                        msg := "LoadBalancer Services are not permitted"
            }

    In the constraint template spec, we define a new object kind/type which we will use while creating the constraints, then in the target, we specify the Rego code which will verify if the request meets the policy or not. In the Rego code, we specify a violation that if the request is to create a service of type LoadBalancer then the request should be denied.

    Using the above template, we can now define constraints:

    apiVersion: constraints.gatekeeper.sh/v1beta1
    kind: LBTypeSvcNotAllowed
    metadata:
      name: deny-lb-type-svc-dev-ns
    spec:
      match:
        kinds:
          - apiGroups: [""]
            kinds: ["Service"]
        namespaces:
          - "dev"

    Here we have specified the kind of the Kubernetes object (Service) on which we want to apply the constraint and we have specified the namespace as dev because we want the constraint to be enforced only on the dev namespace.

    Let’s go ahead and create the constraint template and constraint:

    Note: After creating the constraint template, please check if its status is true or not, otherwise you will get an error while creating the constraints. Also it is advisable to verify the Rego code snippet before using them in the constraints template.

    Now let’s try to create a service of type LoadBalancer in the dev namespace:

    kind: Service
    apiVersion: v1
    metadata:
      name: opa-service
    spec:
      type: LoadBalancer
      selector:
        app: opa-app
      ports:
      - protocol: TCP
        port: 80
        targetPort: 8080

    When we tried to create a service of type LoadBalancer in the dev namespace, we got the error that it was denied by the admission webhook due to `deny-lb-type-svc-dev-ns` constraint, but when we try to create the service in the default namespace, we were able to do so.

    Here we are not passing any parameters to the Rego policy from our constraints, but we can certainly do so to make our policy more generic, for example, we can add a field named servicetype to constraint template and in the policy code, deny all the request where the servicetype value defined in the constraint matches the value of the request. With this, we will be able to deny service of types other than LoadBalancer as well in any namespace of our cluster.

    Gatekeeper also provides auditing for resources that were created before the constraint was applied. The information is available in the status of the constraint objects. This helps us in identifying which objects in our cluster are not compliant with our constraints. 

    Conclusion:

    OPA allows us to apply fine-grained policies in our Kubernetes clusters and can be instrumental in improving the overall security of Kubernetes clusters which has always been a concern for many organizations while adopting or migrating to Kubernetes. It also makes meeting the compliance and audit requirements much simpler. There is some learning curve as we need to get familiar with Rego to code our policies, but the language is very simple and there are quite a few good examples to help in getting started.

  • How to Write Jenkinsfile for Angular and .Net Based Applications

    If you landed here directly and want to know how to setup Jenkins master-slave architecture, please visit this post related to Setting-up the Jenkins Master-Slave Architecture.

    The source code that we are using here is also a continuation of the code that was written in this GitHub Packer-Terraform-Jenkins repository.

    Creating Jenkinsfile

    We will create some Jenkinsfile to execute a job from our Jenkins master.

    Here I will create two Jenkinsfile ideally, it is expected that your Jenkinsfile is present in source code repo but it can be passed directly in the job as well.

    There are 2 ways of writing Jenkinsfile – Scripted and Declarative. You can find numerous points online giving their difference. We will be creating both of them to do a build so that we can get a hang of both of them.

    Jenkinsfile for Angular App (Scripted)

    As mentioned before we will be highlighting both formats of writing the Jenkinsfile. For the Angular app, we will be writing a scripted one but can be easily written in declarative format too.

    We will be running this inside a docker container. Thus, the tests are also going to get executed in a headless manner.

    Here is the Jenkinsfile for reference.

    Here we are trying to leverage Docker volume to keep updating our source code on bare metal and use docker container for the environments.

    Dissecting Node App’s Jenkinsfile

    1. We are using CleanWs() to clear the workspace.
    2. Next is the Main build in which we define our complete build process.
    3. We are pulling the required images.
    4. Highlighting the steps that we will be executing.
    5. Checkout SCM: Checking out our code from Git
    6. We are now starting the node container inside of which we will be running npm install and npm run lint.
    7. Get test dependency: Here we are downloading chrome.json which will be used in the next step when starting the container.
    8. Here we test our app. Specific changes for running the test are mentioned below.
    9. Build: Finally we build the app.
    10. Deploy: Once CI is completed we need to start with CD. The CD itself can be a blog of itself but wanted to highlight what basic deployment would do.
    11. Here we are using Nginx container to host our application.
    12. If the container does not exist it will create a container and use the “dist” folder for deployment.
    13. If Nginx container exists, then it will ask for user input to recreate a container or not.
    14. If you select not to create, don’t worry as we are using Nginx it will do a hot reload with new changes.

    The angular application used here was created using the standard generate command given by the CLI itself. Although the build and install give no trouble in a bare metal some tweaks are required for running test in a container.

    In karma.conf.js update browsers withChromeHeadless.

    Next in protractor.conf.js update browserName with chrome and add

    chromeOptions': {
    args': ['--headless', '--disable-gpu', '--window-size=800x600']
    },

    That’s it! And We have our CI pipeline setup for Angular based application.

    Jenkinsfile for .Net App (Declarative)

    For a .Net application, we have to setup MSBuild and MSDeploy. In the blog post mentioned above, we have already setup MSBuild and we will shortly discuss how to setup MSDeploy.

    To do the Windows deployment we have two options. Either setup MSBuild in Jenkins Global Tool Configuration or use the full path of MSBuild on the slave machine.

    Passing the path is fairly simple and here we will discuss how to use global tool configuration in a Jenkinsfile.

    First, get the path of MSBuild from your server. If it is not the latest version then the path is different and is available in Current directory otherwise always in <version> directory.</version>

    As we are using MSBuild 2017. Our MSBuild path is:

    C:Program Files (x86)Microsoft Visual Studio2017BuildToolsMSBuild15.0Bin

    Place this in /configureTools/ —> MSBuild

    Now you have your configuration ready to be used in Jenkinsfile.

    Jenkinsfile to build and test the app is given below.

    As seen above the structure of Declarative syntax is almost same as that of Declarative. Depending upon which one you find easier to read you should opt the syntax.

    Dissecting Dotnet App’s Jenkinsfile

    1. In this case too we are cleaning the workspace as the first step.
    2. Checkout: This is also the same as before.
    3. Nuget Restore: We are downloading dependent required packages for both PrimeService and PrimeService.Tests
    4. Build: Building the Dotnet app using MSBuild tool which we had configured earlier before writing the Jenkinsfile.
    5. UnitTest: Here we have used dotnet test although we could’ve used MSTest as well here just wanted to highlight how easy dotnet utility makes it. We can even use dotnet build for the build as well.
    6. Deploy: Deploying on the IIS server. Creation of IIS we are covering below.

    From the above-given examples, you get a hang of what Jenkinsfile looks like and how it can be used for creating jobs. Above file highlights basic job creation but it can be extended to everything that old-style job creation could do.

    Creating IIS Server

    Unlike our Angular application where we just had to get another image and we were good to go. Here we will have to Packer to create our IIS server. We will be automating the creation process and will be using it to host applications.

    Here is a Powershell script for IIS for reference.

    # To list all Windows Features: dism /online /Get-Features
    # Get-WindowsOptionalFeature -Online 
    # LIST All IIS FEATURES: 
    # Get-WindowsOptionalFeature -Online | where FeatureName -like 'IIS-*'
    
    # NetFx dependencies
    dism /online /Enable-Feature /FeatureName:NetFx4 /All
    
    # ASP dependencies
    dism /online /enable-feature /all /featurename:IIS-ASPNET45
    
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-WebServerRole
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-WebServer 
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-CommonHttpFeatures
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-Security 
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-RequestFiltering 
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-StaticContent
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-DefaultDocument
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-DirectoryBrowsing
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-HttpErrors 
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ApplicationDevelopment
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-WebSockets 
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ApplicationInit
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-NetFxExtensibility45
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ISAPIExtensions
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ISAPIFilter
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ASP
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ASPNET45
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ServerSideIncludes
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-HealthAndDiagnostics
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-HttpLogging 
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-Performance
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-HttpCompressionStatic
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-WebServerManagementTools
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ManagementConsole 
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ManagementService
    
    # Install Chocolatey
    Set-ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))
    
    # Install WebDeploy (It will deploy 3.6)
    choco install webdeploy -y

    We won’t be deploying any application on it as we have created a sample app for PrimeNumber. But in the real world, you might be deploying Web Based application and you will need IIS. We have covered here the basic idea of how to install IIS along with any dependency that might be required.

    Conclusion

    In this post, we have covered deploying Windows and Linux based applications using Jenkinsfile in both scripted and declarative format.

    Thanks for Reading! Till next time…!!

  • Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow

    Recently, I was involved in building an ETL (Extract-Transform-Load) pipeline. It included extracting data from MongoDB collections, perform transformations and then loading it into Redshift tables. Many ETL solutions are available in the market which kind-of solves the issue, but the key part of an ETL process lies in its ability to transform or process raw data before it is pushed to its destination.

    Each ETL pipeline comes with a specific business requirement around processing data which is hard to be achieved using off-the-shelf ETL solutions. This is why a majority of ETL solutions are custom built manually, from scratch. In this blog, I am going to talk about my learning around building a custom ETL solution which involved moving data from MongoDB to Redshift using Apache Airflow.

    Background:

    I began by writing a Python-based command line tool which supported different phases of ETL, like extracting data from MongoDB, processing extracted data locally, uploading the processed data to S3, loading data from S3 to Redshift, post-processing and cleanup. I used the PyMongo library to interact with MongoDB and the Boto library for interacting with Redshift and S3.

    I kept each operation atomic so that multiple instances of each operation can run independently of each other, which will help to achieve parallelism. One of the major challenges was to achieve parallelism while running the ETL tasks. One option was to develop our own framework based on threads or developing a distributed task scheduler tool using a message broker tool like Celery combined with RabbitMQ. After doing some research I settled for Apache Airflow. Airflow is a Python-based scheduler where you can define DAGs (Directed Acyclic Graphs), which would run as per the given schedule and run tasks in parallel in each phase of your ETL. You can define DAG as Python code and it also enables you to handle the state of your DAG run using environment variables. Features like task retries on failure handling are a plus.

    We faced several challenges while getting the above ETL workflow to be near real-time and fault tolerant. We discuss the challenges faced and the solutions below:

    Keeping your ETL code changes in sync with Redshift schema

    While you are building the ETL tool, you may end up fetching a new field from MongoDB, but at the same time, you have to add that column to the corresponding Redshift table. If you fail to do so the ETL pipeline will start failing. In order to tackle this, I created a database migration tool which would become the first step in my ETL workflow.

    The migration tool would:

    • keep the migration status in a Redshift table and
    • would track all migration scripts in a code directory.

    In each ETL run, it would get the most recently ran migrations from Redshift and would search for any new migration script available in the code directory. If found it would run the newly found migration script after which the regular ETL tasks would run. This adds the onus on the developer to add a migration script if he is making any changes like addition or removal of a field that he is fetching from MongoDB.

    Maintaining data consistency

    While extracting data from MongoDB, one needs to ensure all the collections are extracted at a specific point in time else there can be data inconsistency issues. We need to solve this problem at multiple levels:

    • While extracting data from MongoDB define parameters like modified date and extract data from different collections with a filter as records less than or equal to that date. This will ensure you fetch point in time data from MongoDB.
    • While loading data into Redshift tables, don’t load directly to master table, instead load it to some staging table. Once you are done loading data in staging for all related collections, load it to master from staging within a single transaction. This way data is either updated in all related tables or in none of the tables.

    A single bad record can break your ETL

    While moving data across the ETL pipeline into Redshift, one needs to take care of field formats. For example, the Date field in the incoming data can be different than that in the Redshift schema design. Another example can be that the incoming data can exceed the length of the field in the schema. Redshift’s COPY command which is used to load data from files to redshift tables is very vulnerable to such changes in data types. Even a single incorrectly formatted record will lead to all your data getting rejected and effectively breaking the ETL pipeline.

    There are multiple ways in which we can solve this problem. Either handle it in one of the transform jobs in the pipeline. Alternately we put the onus on Redshift to handle these variances. Redshift’s COPY command has many options which can help you solve these problems. Some of the very useful options are

    • ACCEPTANYDATE: Allows any date format, including invalid formats such as 00/00/00 00:00:00, to be loaded without generating an error.
    • ACCEPTINVCHARS: Enables loading of data into VARCHAR columns even if the data contains invalid UTF-8 characters.
    • TRUNCATECOLUMNS: Truncates data in columns to the appropriate number of characters so that it fits the column specification.

    Redshift going out of storage

    Redshift is based on PostgreSQL and one of the common problems is when you delete records from Redshift tables it does not actually free up space. So if your ETL process is deleting and creating new records frequently, then you may run out of Redshift storage space. VACUUM operation for Redshift is the solution to this problem. Instead of making VACUUM operation a part of your main ETL flow, define a different workflow which runs on a different schedule to run VACUUM operation. VACUUM operation reclaims space and resorts rows in either a specified table or all tables in the current database. VACUUM operation can be FULL, SORT ONLY, DELETE ONLY & REINDEX. More information on VACUUM can be found here.

    ETL instance going out of storage

    Your ETL will be generating a lot of files by extracting data from MongoDB onto your ETL instance. It is very important to periodically delete those files otherwise you are very likely to go out of storage on your ETL server. If your data from MongoDB is huge, you might end up creating large files on your ETL server. Again, I would recommend defining a different workflow which runs on a different schedule to run a cleanup operation.

    Making ETL Near Real Time

    Processing only the delta rather than doing a full load in each ETL run

    ETL would be faster if you keep track of the already processed data and process only the new data. If you are doing a full load of data in each ETL run, then the solution would not scale as your data scales. As a solution to this, we made it mandatory for the collection in our MongoDB to have a created and a modified date. Our ETL would check the maximum value of the modified date for the given collection from the Redshift table. It will then generate the filter query to fetch only those records from MongoDB which have modified date greater than that of the maximum value. It may be difficult for you to make changes in your product, but it’s worth the effort!

    Compressing and splitting files while loading

    A good approach is to write files in some compressed format. It saves your storage space on ETL server and also helps when you load data to Redshift. Redshift COPY command suggests that you provide compressed files as input. Also instead of a single huge file, you should split your files into parts and give all files to a single COPY command. This will enable Redshift to use it’s computing resources across the cluster to do the copy in parallel, leading to faster loads.

    Streaming mongo data directly to S3 instead of writing it to ETL server

    One of the major overhead in the ETL process is to write data first to ETL server and then uploading it to S3. In order to reduce disk IO, you should not store data to ETL server. Instead, use MongoDB’s handy stream API. For MongoDB Node driver, both the collection.find() and the collection.aggregate() function return cursors. The stream method also accepts a transform function as a parameter. All your custom transform logic could go into the transform function. AWS S3’s node library’s upload() function, also accepts readable streams. Use the stream from the MongoDB Node stream method, pipe it into zlib to gzip it, then feed the readable stream into AWS S3’s Node library. Simple! You will see a large improvement in your ETL process by this simple but important change.

    Optimizing Redshift Queries

    Optimizing Redshift Queries helps in making the ETL system highly scalable, efficient and also reduce the cost. Lets look at some of the approaches:

    Add a distribution key

    Redshift database is clustered, meaning your data is stored across cluster nodes. When you query for certain set of records, Redshift has to search for those records in each node, leading to slow queries. A distribution key is a single metric, which will decide the data distribution of all data records across your tables. If you have a single metric which is available for all your data, you can specify it as distribution key. When loading data into Redshift, all data for a certain value of distribution key will be placed on a single node of Redshift cluster. So when you query for certain records Redshift knows exactly where to search for your data. This is only useful when you are also using the distribution key to query the data.

    Source: Slideshare

     

    Generating a numeric primary key for string primary key

    In MongoDB, you can have any type of field as your primary key. If your Mongo collections are having a non-numeric primary key and you are using those same keys in Redshift, your joins will end up being on string keys which are slower. Instead, generate numeric keys for your string keys and joining on it which will make queries run much faster. Redshift supports specifying a column with an attribute as IDENTITY which will auto-generate numeric unique value for the column which you can use as your primary key.

    Conclusion:

    In this blog, I have covered the best practices around building ETL pipelines for Redshift  based on my learning. There are many more recommended practices which can be easily found in Redshift and MongoDB documentation. 

  • Prow + Kubernetes – A Perfect Combination To Execute CI/CD At Scale

    Intro

    Kubernetes is currently the hottest and standard way of deploying workloads in the cloud. It’s well-suited for companies and vendors that need self-healing, high availability, cloud-agnostic characteristics, and easy extensibility.

    Now, on another front, a problem has arisen within the CI/CD domain. Since people are using Kubernetes as the underlying orchestrator, they need a robust CI/CD tool that is entirely Kubernetes-native.

    Enter Prow

    Prow compliments the Kubernetes family in the realm of automation and CI/CD.

    In fact, it is the only project that best exemplifies why and how Kubernetes is such a superb platform to execute CI/CD at scale.

    Prow (meaning: portion of a ship’s bow—ship’s front end–that’s above water) is a Kubernetes-native CI/CD system, and it has been used by many companies over the past few years like Kyma, Istio, Kubeflow, Openshift, etc.

    Where did it come from?

    Kubernetes is one of the largest and most successful open-source projects on GitHub. When it comes to Prow’s conception , the Kubernetes community was trying hard to keep its head above water in matters of CI/CD. Their needs included the execution of more than 10k CI/CD jobs/day, spanning over 100+ different repositories in various GitHub organizations—and other automation technology stacks were just not capable of handling everything at this scale.

    So, the Kubernetes Testing SIG created their own tools to compliment Prow. Because Prow is currently residing under Kubernetes test-infra project, one might underestimate its true prowess/capabilities. I personally would like to see Prow receive a dedicated repo coming out from under the umbrella of test-infra.

    What is Prow?

    Prow is not too complex to understand but still vast in a subtle way. It is designed and built on a distributed microservice architecture native to Kubernetes.

    It has many components that integrate with one another (plank, hook, etc.) and a bunch of standalone ones that are more of a plug-n-play nature (trigger, config-updater, etc.).

    For the context of this blog, I will not be covering Prow’s entire architecture, but feel free to dive into it on your own later. 

    Just to name the main building blocks for Prow:

    • Hook – acts as an API gateway to intercept all requests from Github, which then creates a Prow job custom resource that reads the job configuration as well as calls any specific plugin if needed.
    • Plank – is the Prow job controller; after Hook creates a Prow job, Plank processes it and creates a Kubernetes pod for it to run the tests.
    • Deck – serves as the UI for the history of jobs that ran in the past or are currently running.
    • Horologium – is the component that processes periodic jobs only.
    • Sinker responsible for cleaning up old jobs and pods from the cluster.

    More can be found here: Prow Architecture. Note that this link is not the official doc from Kubernetes but from another great open source project that uses Prow extensively day-in-day-out – Kyma.

    This is how Prow can be picturized:


     

     

    Here is a list of things Prow can do and why it was conceived in the first place.

    • GitHub Automation on a wide range

      – ChatOps via slash command like “/foo
      – Fine-tuned policies and permission management in GitHub via OWNERS files
      – tide – PR/merge automation
      ghProxy – to avoid hitting API limits and to use GitHub API request cache
      – label plugin – labels management 
      – branchprotector – branch protection configuration 
      – releasenote – release notes management
    • Job Execution engine – Plank‍
    • Job status Reporting to CI/CD dashboard – crier‍
    • Dashboards for comprehensive job/PR history, merge status, real-time logs, and other statuses – Deck‍
    • Plug-n-play service to interact with GCS and show job artifacts on dashboard – Spyglass‍
    • Super easy pluggable Prometheus stack for observability – metrics‍
    • Config-as-Code for Prow itself – updateconfig‍
    • And many more, like sinker, branch protector, etc.

    Possible Jobs in Prow

    Here, a job means any “task that is executed over a trigger.” This trigger can be anything from a github commit to a new PR or a periodic cron trigger. Possible jobs in Prow include:  

    • Presubmit – these jobs are triggered when a new github PR is created.
    • Postsubmit – triggered when there is a new commit.
    • Periodic – triggered on a specific cron time trigger.

    Possible states for a job

    • triggered – a new Prow-job custom resource is created reading the job configs
    • pending – a pod is created in response to the Prow-job to run the scripts/tests; Prow-job will be marked pending while the pod is getting created and running 
    • success – if a pod succeeds, the Prow-job status will change to success 
    • failure – if a pod fails, the Prow-job status will be marked failure
    • aborted – when a job is running and the same one is retriggered, then the first pro-job execution will be aborted and its status will change to aborted and the new one is marked pending

    What a job config looks like:

    presubmits:
      kubernetes/community:
      - name: pull-community-verify  # convention: (job type)-(repo name)-(suite name)
        branches:
        - master
        decorate: true
        always_run: true
        spec:
          containers:
          - image: golang:1.12.5
            command:
            - /bin/bash
            args:
            - -c
            - "export PATH=$GOPATH/bin:$PATH && make verify"

    • Here, this job is a “presubmit” type, meaning it will be executed when a PR is created against the “master” branch in repo “kubernetes/community”.
    • As shown in spec, a pod will be created from image “Golang” where this repo will be cloned, and the mentioned command will be executed at the start of the container.
    • The output of that command will decide if the pod has succeeded or failed, which will, in turn, decide if the Prow job has successfully completed.

    More jobs configs used by Kubernetes itself can be found here – Jobs

    Getting a minimalistic Prow cluster up and running on the local system in minutes.

    Pre-reqs:

    • Knowledge of Kubernetes 
    • Knowledge of Google Cloud and IAM

    For the context of this blog, I have created a sample github repo containing all the basic manifest files and config files. For this repo, the basic CI has also been configured. Feel free to clone/fork this and use it as a getting started guide.

    Let’s look at the directory structure for the repo:

    .
    ├── docker/     # Contains docker image in which all the CI jobs will run
    ├── hack/       # Contains small hack scripts used in a wide range of jobs 
    ├── hello.go
    ├── hello_test.go
    ├── Dockerfile
    ├── Makefile
    ├── prow
    │   ├── cluster/       # Install prow on k8s cluster
    │   ├── jobs/          # CI jobs config
    │   ├── labels.yaml    # Prow label config for managing github labels
    │   ├── config.yaml    # Prow config
    │   └── plugins.yaml   # Prow plugins config
    └── README.md

    1. Create a bot account. For info, look here. Add this bot as a collaborator in your repo. 

    2. Create an OAuth2 token from the GitHub GUI for the bot account.

    $ echo "PUT_TOKEN_HERE" > oauth
    $ kubectl create secret generic oauth --from-file=oauth=oauth

    3. Create an OpenSSL token to be used with the Hook.

    $ openssl rand -hex 20 > hmac
    $ kubectl create secret generic hmac --from-file=hmac=hmac

    4. Install all the Prow components mentioned in prow-starter.yaml.

    $ make deploy-prow

    5. Update all the jobs and plugins needed for the CI (rules mentioned in the Makefile). Use commands:

    • Updates in plugins.yaml and presubmits.yaml:
    • Change the repo name (velotio-tech/k8s-prow-guide) for the jobs to be configured 
    • Updates in config.yaml:
    • Create a GCS bucket 
    • Update the name of GCS bucket (GCS_BUCKET_NAME) in the config.yaml
    • Create a service_account.json with GCS storage permission and download the JSON file 
    • Create a secret from above service_account.json
    $ kubectl create secret generic gcs-sa --from-file=service-account.json=service-account.json

    • Update the secret name (GCS_SERVICE_ACC) in config.yaml
    $ make update-config
    $ make update-plugins
    $ make update-jobs

    6. For exposing a webhook from GitHub repo and pointing it to the local machine, use Ultrahook. Install Ultrahook. This will give you a publicly accessible endpoint. In my case, the result looked like this: http://github.sanster23.ultrahook.com. 

    $ echo "api_key: <API_KEY_ULTRAHOOK>" > ~/.ultrahook
    $ ultrahook github http://<MINIKUBE_IP>:<HOOK_NODE_PORT>/hook

    7. Create a webhook in your repo so that all events can be published to Hook via the public URL above:

    • Set the webhook URL from Step 6
    • Set Content Type as application/json
    • Set the value of token the same as hmac token secret, created in Step 2 
    • Check the “Send me everything” box

    8. Create a new PR and see the magic.

    9. Dashboard for Prow will be accessible at http://<minikube_ip>:<deck_node_port></deck_node_port></minikube_ip>

    • MINIKUBE_IP : 192.168.99.100  ( Run “minikube ip”)
    • DECK_NODE_PORT :  32710 ( Run “kubectl get svc deck” )

    I will leave you guys with an official reference of Prow Dashboard:

    What’s Next

    Above is an effort just to give you a taste of what Prow can do with and how easy it is to set up at any scale of infra and for a project of any complexity.

    P.S. – The content surrounding Prow is scarce, making it a bit unexplored in certain ways, but I found this helpful channel on the Kubernetes Slack #prow. Hopefully, this helps you explore the uncharted waters of Kubernetes Native CI/CD. 

  • Setting Up A Robust Authentication Environment For OpenSSH Using QR Code PAM

    Do you like WhatsApp Web authentication? Well, WhatsApp Web has always fascinated me with the simplicity of QR-Code based authentication. Though there are similar authentication UIs available, I always wondered whether a remote secure shell (SSH) could be authenticated with a QR code with this kind of simplicity while keeping the auth process secure. In this guide, we will see how to write and implement a bare-bones PAM module for OpenSSH Linux-based system.

    “OpenSSH is the premier connectivity tool for remote login with the SSH protocol. It encrypts all traffic to eliminate eavesdropping, connection hijacking, and other attacks. In addition, OpenSSH provides a large suite of secure tunneling capabilities, several authentication methods, and sophisticated configuration options.”

    openssh.com

    Meet PAM!

    PAM, short for “Pluggable Authentication Module,” is a middleware that abstracts authentication features on Linux and UNIX-like operating systems. PAM has been around for more than two decades. The authentication process could be cumbersome with each service looking for authenticating users with a different set of hardware and software, such as username-password, fingerprint module, face recognition, two-factor authentication, LDAP, etc. But the underlining process remains the same, i.e., users must be authenticated as who they say they are. This is where PAM comes into the picture and provides an API to the application layer and provides built-in functions to implement and extend PAM capability.

    Source: Redhat

    Understand how OpenSSH interacts with PAM

    The Linux host OpenSSH (sshd daemon) begins by reading the configuration defined in /etc/pam.conf or alternatively in /etc/pam.d configuration files. The config files are usually defined with service names having various realms (auth, account, session, password). The “auth” realm is what takes care of authenticating users as who they say. A typical sshd PAM service file on Ubuntu OS can be seen below, and you can relate with your own flavor of Linux:

    @include common-auth
    account    required     pam_nologin.so
    @include common-account
    session [success=ok ignore=ignore module_unknown=ignore default=bad]        pam_selinux.so close
    session    required     pam_loginuid.so
    session    optional     pam_keyinit.so force revoke
    @include common-session
    session    optional     pam_motd.so  motd=/run/motd.dynamic
    session    optional     pam_motd.so noupdate
    session    optional     pam_mail.so standard noenv # [1]
    session    required     pam_limits.so
    session    required     pam_env.so # [1]
    session    required     pam_env.so user_readenv=1 envfile=/etc/default/locale
    session [success=ok ignore=ignore module_unknown=ignore default=bad]        pam_selinux.so open
    @include common-password

    The common-auth file has an “auth” realm with the pam_unix.so PAM module, which is responsible for authenticating the user with a password. Our goal is to write a PAM module that replaces pam_unix.so with our own version.

    When OpenSSH makes calls to the PAM module, the very first function it looks for is “pam_sm_authenticate,” along with some other mandatory function such as pam_sm_setcred. Thus, we will be implementing the pam_sm_authenticate function, which will be an entry point to our shared object library. The module should return PAM_SUCCESS (0) as the return code for successful authentication.

    Application Architecture

    The project architecture has four main applications. The backend is hosted on an AWS cloud with minimal and low-cost infrastructure resources.

    1. PAM Module: Provides QR-Code auth prompt to client SSH Login

    2. Android Mobile App: Authenticates SSH login by scanning a QR code

    3. QR Auth Server API: Backend application to which our Android App connects and communicates and shares authentication payload along with some other meta information

    4. WebSocket Server (API Gateway WebSocket, and NodeJS) App: PAM Module and server-side app shares auth message payload in real time

    When a user connects to the remote server via SSH, a PAM module is triggered, offering a QR code for authentication. Information is exchanged between the API gateway WebSocket, which in terms saves temporary auth data in DynamoDB. A user then uses an Android mobile app (written in react-native) to scan the QR code.

    Upon scanning, the app connects to the API gateway. An API call is first authenticated by AWS Cognito to avoid any intrusion. The request is then proxied to the Lambda function, which authenticates input payload comparing information available in DynamoDB. Upon successful authentication, the Lambda function makes a call to the API gateway WebSocket to inform the PAM to authenticate the user.

    Framework and Toolchains

    PAM modules are shared object libraries that must be be written in C (although other languages can be used to compile and link or probably make cross programming language calls like python pam or pam_exec). Below are the framework and toolset I am using to serve this project:

    1. gcc, make, automake, autoreconf, libpam (GNU dev tools on Ubuntu OS)

    2. libqrencode, libwebsockets, libpam, libssl, libcrypto (C libraries)

    3. NodeJS, express (for server-side app)

    4. API gateway and API Gateway webSocket, AWS Lambda (AWS Cloud Services for hosting serverless server side app)

    5. Serverless framework (for easily deploying infrastructure)

    6. react-native, react-native-qrcode-scanner (for Android mobile app)

    7. AWS Cognito (for authentication)

    8. AWS Amplify Library

    This guide assumes you have a basic understanding of the Linux OS, C programming language, pointers, and gcc code compilation. For the backend APIs, I prefer to use NodeJS as a primary programming language, but you may opt for the language of your choice for designing HTTP APIs.

    Authentication with QR Code PAM Module

    When the module initializes, we first want to generate a random string with the help “/dev/urandom” character device. Byte string obtained from this device contains non-screen characters, so we encode them with Base64. Let’s call this string an auth verification string.

    void get_random_string(char *random_str,int length)
    {
       FILE *fp = fopen("/dev/urandom","r");
       if(!fp){
           perror("Unble to open urandom device");
           exit(EXIT_FAILURE);
       }
       fread(random_str,length,1,fp);
       fclose(fp);
    }
     
    char random_string[11];
      
      //get random string
       get_random_string(random_string,10);
      //convert random string to base64 coz input string is coming from /dev/urandom and may contain binary chars
       const int encoded_length = Base64encode_len(10);
       base64_string=(char *)malloc(encoded_length+1);
       Base64encode(base64_string,random_string,10);
       base64_string[encoded_length]='';

    We then initiate a WebSocket connection with the help of the libwebsockets library and connect to our API Gateway WebSocket endpoint. Once the connection is established, we inform that a user may try to authenticate with auth verification string. The API Gateway WebSocket returns a unique connection ID to our PAM module.

    static void connect_client(struct lws_sorted_usec_list *sul)
    {
       struct vhd_minimal_client_echo *vhd =
           lws_container_of(sul, struct vhd_minimal_client_echo, sul);
       struct lws_client_connect_info i;
       char host[128];
       lws_snprintf(host, sizeof(host), "%s:%u", *vhd->ads, *vhd->port);
       memset(&i, 0, sizeof(i));
       i.context = vhd->context;
      //i.port = *vhd->port;
       i.port = *vhd->port;
       i.address = *vhd->ads;
       i.path = *vhd->url;
       i.host = host;
       i.origin = host;
       i.ssl_connection = LCCSCF_USE_SSL | LCCSCF_ALLOW_SELFSIGNED | LCCSCF_SKIP_SERVER_CERT_HOSTNAME_CHECK | LCCSCF_PIPELINE;
      //i.ssl_connection = 0;
       if ((*vhd->options) & 2)
           i.ssl_connection |= LCCSCF_USE_SSL;
       i.vhost = vhd->vhost;
       i.iface = *vhd->iface;
      //i.protocol = ;
       i.pwsi = &vhd->client_wsi;
      //lwsl_user("connecting to %s:%d/%s\n", i.address, i.port, i.path);
       log_message(LOG_INFO,ws_applogic.pamh,"About to create connection %s",host);
      //return !lws_client_connect_via_info(&i);
       if (!lws_client_connect_via_info(&i))
           lws_sul_schedule(vhd->context, 0, &vhd->sul,
                    connect_client, 10 * LWS_US_PER_SEC);
    }

    Upon receiving the connection id from the server, the PAM module converts this connection id to SHA1 hash string and finally composes a unique string for generating QR Code. This string consists of three parts separated by colons (:), i.e.,

    “qrauth:BASE64(AUTH_VERIFY_STRING):SHA1(CONNECTION_ID).” For example, let’s say a random Base64 encoded string is “UX6t4PcS5doEeA==” and connection id is “KZlfidYvBcwCFFw=”

    Then the final encoded string is “qrauth:UX6t4PcS5doEeA==:2fc58b0cc3b13c3f2db49a5b4660ad47c873b81a.

    This string is then encoded to the UTF8 QR code with the help of libqrencode library and the authentication screen is prompted by the PAM module.

    char *con_id=strstr(msg,ws_com_strings[READ_WS_CONNECTION_ID]);
               int length = strlen(ws_com_strings[READ_WS_CONNECTION_ID]);
              
               if(!con_id){
                   pam_login_status=PAM_AUTH_ERR;
                   interrupted=1;
                   return;
               }
               con_id+=length;
               log_message(LOG_DEBUG,ws_applogic.pamh,"strstr is %s",con_id);
               string_crypt(ws_applogic.sha_code_hex, con_id);
               sprintf(temp_text,"qrauth:%s:%s",ws_applogic.authkey,ws_applogic.sha_code_hex);
               char *qr_encoded_text=get_qrcode_string(temp_text);
               ws_applogic.qr_encoded_text=qr_encoded_text;
               conv_info(ws_applogic.pamh,"\nSSH Auth via QR Code\n\n");
               conv_info(ws_applogic.pamh, ws_applogic.qr_encoded_text);
               log_message(LOG_INFO,ws_applogic.pamh,"Use Mobile App to Scan \n %s",ws_applogic.qr_encoded_text);
               log_message(LOG_INFO,ws_applogic.pamh,"%s",temp_text);
               ws_applogic.current_action=READ_WS_AUTH_VERIFIED;
               sprintf(temp_text,ws_com_strings[SEND_WS_EXPECT_AUTH],ws_applogic.authkey,ws_applogic.username);
               websocket_write_back(wsi,temp_text,-1);
               conv_read(ws_applogic.pamh,"\n\nUse Mobile SSH QR Auth App to Authentiate SSh Login and Press Enter\n\n",PAM_PROMPT_ECHO_ON);

    API Gateway WebSocket App

    We used a serverless framework for easily creating and deploying our infrastructure resources. With serverless cli, we use aws-nodejs template (serverless create –template aws-nodejs). You can find a detailed guide on Serverless, API Gateway WebSocket, and DynamoDB here. Below is the template YAML definition. Note that the DynamoDB resource has TTL set to expires_at property. This field holds the UNIX epoch timestamp.

    What this means is that any record that we store is automatically deleted as per the epoch time set. We plan to keep the record only for 5 minutes. This also means the user must authenticate themselves within 5 minutes of the authentication request to the remote SSH server.

    service: ssh-qrapp-websocket
    frameworkVersion: '2'
    useDotenv: true
    provider:
     name: aws
     runtime: nodejs12.x
     lambdaHashingVersion: 20201221
     websocketsApiName: ssh-qrapp-websocket
     websocketsApiRouteSelectionExpression: $request.body.action
     region: ap-south-1
      iam:
       role:
         statements:
           - Effect: Allow
             Action:
               - "dynamodb:query"
               - "dynamodb:GetItem"
               - "dynamodb:PutItem"
             Resource:
               - Fn::GetAtt: [ SSHAuthDB, Arn ]
      environment:
       REGION: ${env:REGION}
       DYNAMODB_TABLE: SSHAuthDB
       WEBSOCKET_ENDPOINT: ${env:WEBSOCKET_ENDPOINT}
       NODE_ENV: ${env:NODE_ENV}
    package:
     patterns:
       - '!node_modules/**'
       - handler.js
       - '!package.json'
       - '!package-lock.json'
    plugins:
     - serverless-dotenv-plugin
    layers:
     sshQRAPPLibs:
       path: layer
       compatibleRuntimes:
         - nodejs12.x
    functions:
     connectionHandler:
       handler: handler.connectHandler
       timeout: 60
       memorySize: 256
       layers:
         - {Ref: SshQRAPPLibsLambdaLayer}
       events:
         - websocket:
            route: $connect
            routeResponseSelectionExpression: $default
     disconnectHandler:
       handler: handler.disconnectHandler
       memorySize: 256
       timeout: 60
       layers:
         - {Ref: SshQRAPPLibsLambdaLayer}
       events:
         - websocket: $disconnect
     defaultHandler:
       handler: handler.defaultHandler
       memorySize: 256
       timeout: 60
       layers:
         - {Ref: SshQRAPPLibsLambdaLayer}
       events:
         - websocket: $default
     customQueryHandler:
       handler: handler.queryHandler
       memorySize: 256
       timeout: 60
       layers:
         - {Ref: SshQRAPPLibsLambdaLayer}
       events:
         - websocket:
            route: expectauth
            routeResponseSelectionExpression: $default
         - websocket:
            route: getconid
            routeResponseSelectionExpression: $default
         - websocket:
            route: verifyauth
            routeResponseSelectionExpression: $default
     resources:
     Resources:
       SSHAuthDB:
         Type: AWS::DynamoDB::Table
         Properties:
           TableName: ${env:DYNAMODB_TABLE}
           AttributeDefinitions:
             - AttributeName: authkey
               AttributeType: S
           KeySchema:
             - AttributeName: authkey
               KeyType: HASH
           TimeToLiveSpecification:
             AttributeName: expires_at
             Enabled: true
           ProvisionedThroughput:
             ReadCapacityUnits: 2
             WriteCapacityUnits: 2

    The API Gateway WebSocket has three custom events. These events come as an argument to the lambda function in “event.body.action.” API Gateway WebSocket calls them as route selection expressions. These custom events are:

    • The “expectauth” event is sent by the PAM module to WebSocket informing that a client has asked for authentication and mobile application may try to authenticate by scanning QR code. During this event, the WebSocket handler stores the connection ID along with auth verification string. This key acts as a primary key to our DynamoDB table.
    • The “getconid” event is sent to retrieve the current connection ID so that the PAM can generate a SHA1 sum and provide a QR Code prompt.
    • The “verifyauth” event is sent by the PAM module to confirm and verify authentication. During this event, even the WebSocket server expects random challenge response text. WebSocket server retrieves data payload from DynamoDB with auth verification string as primary key, and tries to find the key “authVerified” marked as “true” (more on this later).
    queryHandler: async (event,context) => {
       const payload = JSON.parse(event.body);
       const documentClient = new DynamoDB.DocumentClient({
         region : process.env.REGION
       });
       try {
         switch(payload.action){
           case 'expectauth':
            
             const expires_at = parseInt(new Date().getTime() / 1000) + 300;
      
             await documentClient.put({
               TableName : process.env.DYNAMODB_TABLE,
               Item: {
                 authkey : payload.authkey,
                 connectionId : event.requestContext.connectionId,
                 username : payload.username,
                 expires_at : expires_at,
                 authVerified: false
               }
             }).promise();
             return {
               statusCode: 200,
               body : "OK"
             };
           case 'getconid':
             return {
               statusCode: 200,
               body: `connectionid:${event.requestContext.connectionId}`
             };
           case 'verifyauth':
             const data = await documentClient.get({
               TableName : process.env.DYNAMODB_TABLE,
               Key : {
                 authkey : payload.authkey
               }
             }).promise();
             if(!("Item" in data)){
               throw "Failed to query data";
             }
             if(data.Item.authVerified === true){
               return {
                 statusCode: 200,
                 body: `authverified:${payload.challengeText}`
               }
             }
             throw "auth verification failed";
         }
       } catch (error) {
         console.log(error);
       }
       return {
         statusCode:  200,
         body : "ok"
        };
      
     }

    Android App: SSH QR Code Auth

     

    The Android app consists of two parts. App login and scanning the QR code for authentication. The AWS Cognito and Amplify library ease out the process of a secure login. Just wrapping your react-native app with “withAutheticator” component you get ready to use “Login Screen.” We then use the react-native-qrcode-scanner component to scan the QR Code.

    This component returns decoded string on the successful scan. Application logic then breaks the string and finds the validity of the string decoded. If the decoded string is a valid application string, an API call is made to the server with the appropriate payload.

    render(){
       return (
         <View style={styles.container}>
           {this.state.authQRCode ?
           <AuthQRCode
            hideAuthQRCode = {this.hideAuthQRCode}
            qrScanData = {this.qrScanData}
           />
           :
           <View style={{marginVertical: 10}}>
           <Button title="Auth SSH Login" onPress={this.showAuthQRCode} />
           <View style={{margin:10}} />
           <Button title="Sign Out" onPress={this.signout} />
           </View>
          
           }
         </View>
       );
     }
         const scanCode = e.data.split(':');
         if(scanCode.length <3){
           throw "invalid qr code";
         }
         const [appstring,authcode,shacode] = scanCode;
         if(appstring !== "qrauth"){
           throw "Not a valid app qr code";
         }
         const authsession = await Auth.currentSession();
         const jwtToken = authsession.getIdToken().jwtToken;
         const response = await axios({
           url : "https://API_GATEWAY_URL/v1/app/sshqrauth/qrauth",
           method : "post",
           headers : {
             Authorization : jwtToken,
             'Content-Type' : 'application/json'
           },
           responseType: "json",
           data : {
             authcode,
             shacode
           }
         });
         if(response.data.status === 200){
           rescanQRCode=false;
           setTimeout(this.hideAuthQRCode, 1000);
         }

    This guide does not cover how to deploy react-native Android applications. You may refer to the official react-native guide to deploy your application to the Android mobile device.

    QR Auth API

    The QR Auth API is built using a serverless framework with aws-nodejs template. It uses API Gateway as HTTP API and AWS Cognito for authorizing input requests. The serverless YAML definition is defined below.

    service: ssh-qrauth-server
    frameworkVersion: '2 || 3'
    useDotenv: true
    provider:
     name: aws
     runtime: nodejs12.x
     lambdaHashingVersion: 20201221
     deploymentBucket:
       name: ${env:DEPLOYMENT_BUCKET_NAME}
     httpApi:
       authorizers:
         cognitoJWTAuth:
           identitySource: $request.header.Authorization
           issuerUrl: ${env:COGNITO_ISSUER}
           audience:
             - ${env:COGNITO_AUDIENCE}
     region: ap-south-1
     iam:
       role:
         statements:
         - Effect: "Allow"
           Action:
             - "dynamodb:Query"
             - "dynamodb:PutItem"
             - "dynamodb:GetItem"
           Resource:
             - ${env:DYNAMO_DB_ARN}
         - Effect: "Allow"
           Action:
             - "execute-api:Invoke"
             - "execute-api:ManageConnections"
           Resource:
             - ${env:API_GATEWAY_WEBSOCKET_API_ARN}/*
     environment:
       REGION: ${env:REGION}
       COGNITO_ISSUER: ${env:COGNITO_ISSUER}
       DYNAMODB_TABLE: ${env:DYNAMODB_TABLE}
       COGNITO_AUDIENCE: ${env:COGNITO_AUDIENCE}
       POOLID: ${env:POOLID}
       COGNITOIDP: ${env:COGNITOIDP}
       WEBSOCKET_ENDPOINT: ${env:WEBSOCKET_ENDPOINT}
    package:
     patterns:
       - '!node_modules/**'
       - handler.js
       - '!package.json'
       - '!package-lock.json'
       - '!.env'
       - '!test.http'
    plugins:
     - serverless-deployment-bucket
     - serverless-dotenv-plugin
    layers:
     qrauthLibs:
       path: layer
       compatibleRuntimes:
         - nodejs12.x
    functions:
     sshauthqrcode:
       handler: handler.authqrcode
       memorySize: 256
       timeout: 30
       layers:
         - {Ref: QrauthLibsLambdaLayer}
       events:
         - httpApi:
             path: /v1/app/sshqrauth/qrauth
             method: post
             authorizer:
               name: cognitoJWTAuth

    Once the API Gateway authenticates the incoming requests, control is handed over to the serverless-express router. At this stage, we verify the payload for the auth verify string, which is scanned by the Android mobile app. This auth verify string must be available in the DynamoDB table. Upon retrieving the record pointed by auth verification string, we read the connection ID property and convert it to SHA1 hash. If the hash matches with the hash available in the request payload, we update the record “authVerified” as “true” and inform the PAM module via API Gateway WebSocket API. PAM Module then takes care of further validation via challenge response text.

    The entire authentication flow is depicted in a flow diagram, and the architecture is depicted in the cover post of this blog.

     

    Compiling and Installing PAM module

    Unlike any other C programs, PAM modules are shared libraries. Therefore, the compiled code when loaded in memory may go at this arbitrary place. Thus, the module must be compiled as position independent. With gcc while compiling, we must pass -fPIC option. Further while linking and generating shared object binary, we should use -shared flag.

    gcc -I$PWD -fPIC -c $(ls *.c)
    gcc -shared -o pam_qrapp_auth.so $(ls *.o) -lpam -lqrencode -lssl -lcrypto -lpthread -lwebsockets

    To ease this process of compiling and validating libraries, I prefer to use the autoconf tool. The entire project is checked out at my GitHub repository along with autoconf scripts.

    Once the shared object file is generated (pam_qrapp_auth.so), copy this file to the “/usr/lib64/security/” directory and run ldconfig command to inform OS new shared library is available. Remove common-auth (from /etc/pam.d/sshd if applicable) or any line that uses “auth” realm with pam_unix.so module recursively used in /etc/pam.d/sshd. pam_unix.so module enforces a password or private key authentication. We then need to add our module to the auth realm (“auth required pam_qrapp_auth.so”). Depending upon your Linux flavor, your /etc/pam.d/sshd file may look similar to below:

    auth       required     pam_qrapp_auth.so
    account    required     pam_nologin.so
    @include common-account
    session [success=ok ignore=ignore module_unknown=ignore default=bad]        pam_selinux.so close
    session    required     pam_loginuid.so
    session    optional     pam_keyinit.so force revoke
    @include common-session
    session    optional     pam_motd.so  motd=/run/motd.dynamic
    session    optional     pam_motd.so noupdate
    session    optional     pam_mail.so standard noenv # [1]
    session    required     pam_limits.so
    session    required     pam_env.so # [1]
    session    required     pam_env.so user_readenv=1 envfile=/etc/default/locale
    session [success=ok ignore=ignore module_unknown=ignore default=bad]        pam_selinux.so open
    @include common-password

    Finally, we need to configure our sshd daemon configuration file to allow challenge response authentication. Open file /etc/ssh/sshd_config and add “ChallengeResponseAuthentication yes” if already not available or commented or set to “no.” Reload the sshd service by issuing the command “systemctl reload sshd.” Voila, and we are done here.

    Conclusion

    This guide was a barebones tutorial and not meant for production use. There are certain flaws to this PAM module. For example, our module should prompt for changing the password if the password is expired or login should be denied if an account is a locked and similar feature that addresses security. Also, the Android mobile app should be bound with ssh username so that, AWS Cognito user bound with ssh username could only authenticate.

    One known limitation to this PAM module is we have to always hit enter after scanning the QR Code via Android Mobile App. This limitation is because of how OpenSSH itself is implemented. OpenSSH server blocks all the informational text unless user input is required. In our case, the informational text is UTF8 QR Code itself.

    However, no such input is required from the interactive device, as the authentication event comes from the WebSocket to PAM module. If we do not ask the user to exclusively press enter after scanning the QR Code our QR Code will never be displayed. Thus input here is a dummy. This is a known issue for OpenSSH PAM_TEXT_INFO. Find more about the issue here.

    References

    Pluggable authentication module

    An introduction to Pluggable Authentication Modules (PAM) in Linux

    Custom PAM for SSHD in C

    google-authenticator-libpam

    PAM_TEXT_INFO and PAM_ERROR_MSG conversation not honoured during PAM authentication

  • Building Dynamic Forms in React Using Formik

    Every day we see a huge number of web applications allowing us customizations. It involves drag & drop or metadata-driven UI interfaces to support multiple layouts while having a single backend. Feedback taking system is one of the simplest examples of such products, where on the admin side, one can manage the layout and on the consumer side, users are shown that layout to capture the data. This post focuses on building a microframework to support such use cases with the help of React and Formik.

    Building big forms in React can be extremely time consuming and tedious when structural changes are requested. Handling their validations also takes too much time in the development life cycle. If we use Redux-based solutions to simplify this, like Redux-form, we see a lot of performance bottlenecks. So here comes Formik!

    Why Formik?

    “Why” is one of the most important questions while solving any problem. There are quite a few reasons to lean towards Formik for the implementation of such systems, such as:

    • Simplicity
    • Advanced validation support with Yup
    • Good community support with a lot of people helping on Github

    Being said that, it’s one of the easiest frameworks for quick form building activities. Formik’s clean API lets us use it without worrying about a lot of state management.

    Yup is probably the best library out there for validation and Formik provides out of the box support for Yup validations which makes it more programmer-friendly!!

    API Responses:

    We need to follow certain API structures to let our React code understand which component to render where.

    Let’s assume we will be getting responses from the backend API in the following fashion.

    [{
       “type” : “text”,
       “field”: “name”
       “name” : “User’s name”,
       “style” : {
             “width” : “50%
        }
    }]

    We can have any number of fields but each one will have two mandatory unique properties type and field. We will use those properties to build UI as well as response.

    So let’s start with building the simplest form with React and Formik.

    import React from 'react';
    import { useFormik } from 'formik';
    
    const SignupForm = () => {
      const formik = useFormik({
        initialValues: {
          email: '',
        },
        onSubmit: values => {
          alert(JSON.stringify(values, null, 2));
        },
      });
      return (
        <form onSubmit={formik.handleSubmit}>
          <label htmlFor="email">Email Address</label>
          <input
            id="email"
            name="email"
            type="email"
            onChange={formik.handleChange}
            value={formik.values.email}
          />
          <button type="submit">Submit</button>
        </form>
      );
    };
    
    export default SignupForm;

    import React from 'react';
    
    export default ({ name }) => <h1>Hello {name}!</h1>;

    <div id="root"></div>

    import React, { Component } from 'react';
    import { render } from 'react-dom';
    import Basic from './Basic';
    import './style.css';
    
    class App extends Component {
      constructor() {
        super();
        this.state = {
          name: 'React'
        };
      }
    
      render() {
        return (
          <div>
            <Basic />
          </div>
        );
      }
    }
    
    render(<App />, document.getElementById('root'));

    {
      "name": "react",
      "version": "0.0.0",
      "private": true,
      "dependencies": {
        "react": "^16.12.0",
        "react-dom": "^16.12.0",
        "formik": "latest"
      },
      "scripts": {
        "start": "react-scripts start",
        "build": "react-scripts build",
        "test": "react-scripts test --env=jsdom",
        "eject": "react-scripts eject"
      },
      "devDependencies": {
        "react-scripts": "latest"
      }
    }

    h1, p {
      font-family: Lato;
    }

    You can view the fiddle of above code here to see the live demo.

    We will go with the latest functional components to build this form. You can find more information on useFormik hook at useFormik Hook documentation.  

    It’s nothing more than just a wrapper for Formik functionality.

    Adding dynamic nature

    So let’s first create and import the mocked API response to build the UI dynamically.

    import React from 'react';
    import { useFormik } from 'formik';
    import response from "./apiresponse"
    
    const SignupForm = () => {
      const formik = useFormik({
        initialValues: {
          email: '',
        },
        onSubmit: values => {
          alert(JSON.stringify(values, null, 2));
        },
      });
      return (
        <form onSubmit={formik.handleSubmit}>
          <label htmlFor="email">Email Address</label>
          <input
            id="email"
            name="email"
            type="email"
            onChange={formik.handleChange}
            value={formik.values.email}
          />
          <button type="submit">Submit</button>
        </form>
      );
    };
    
    export default SignupForm;

    You can view the fiddle here.

    We simply imported the file and made it available for processing. So now, we need to write the logic to build components dynamically.
    So let’s visualize the DOM hierarchy of components possible:

    <Container>
    	<TextField />
    	<NumberField />
    	<Container />
    		<TextField />
    		<BooleanField />
    	</Container >
    </Container>

    We can have a recurring container within the container, so let’s address this by adding a children attribute in API response.

    export default [
      {
        "type": "text",
        "field": "name",
        "label": "User's name"
      },
      {
        "type": "number",
        "field": "number",
        "label": "User's age",
      },
      {
        "type": "none",
        "field": "none",
        "children": [
          {
            "type": "text",
            "field": "user.hobbies",
            "label": "User's hobbies"
          }
        ]
      }
    ]

    You can see the fiddle with response processing here with live demo.

    To process the recursive nature, we will create a separate component.

    import React, { useMemo } from 'react';
    
    const RecursiveContainer = ({config, formik}) => {
      const builder = (individualConfig) => {
        switch (individualConfig.type) {
          case 'text':
            return (
                    <>
                    <div>
                      <label htmlFor={individualConfig.field}>{individualConfig.label}</label>
                      <input type='text' 
                        name={individualConfig.field} 
                        onChange={formik.handleChange} style={{...individualConfig.style}} />
                      </div>
                    </>
                  );
          case 'number':
            return (
              <>
                <div>
                  <label htmlFor={individualConfig.field}>{individualConfig.label}</label>
                      <input type='number' 
                        name={individualConfig.field} 
                        onChange={formik.handleChange} style={{...individualConfig.style}} />
                </div>
              </>
            )
          case 'array':
            return (
              <RecursiveContainer config={individualConfig.children || []} formik={formik} />
            );
          default:
            return <div>Unsupported field</div>
        }
      }
    
      return (
        <>
          {config.map((c) => {
            return builder(c);
          })}
        </>
      );
    };
    
    export default RecursiveContainer;

    You can view the complete fiddle of the recursive component here.

    So what we do in this is pretty simple. We pass config which is a JSON object that is retrieved from the API response. We simply iterate through config and build the component based on type. When the type is an array, we create the same component RecursiveContainer which is basic recursion.

    We can optimize it by passing the depth and restricting to nth possible depth to avoid going out of stack errors at runtime. Specifying the depth will ultimately make it less prone to runtime errors. There is no standard limit, it varies from use case to use case. If you are planning to build a system that is based on a compliance questionnaire, it can go to a max depth of 5 to 7, while for the basic signup form, it’s often seen to be only 2.

    So we generated the forms but how do we validate them? How do we enforce required, min, max checks on the form?

    For this, Yup is very helpful. Yup is an object schema validation library that helps us validate the object and give us results back. Its chaining like syntax makes it very much easier to build incremental validation functions.

    Yup provides us with a vast variety of existing validations. We can combine them, specify error or warning messages to be thrown and much more.

    You can find more information on Yup at Yup Official Documentation

    To build a validation function, we need to pass a Yup schema to Formik.

    Here is a simple example: 

    import React from 'react';
    import { useFormik } from 'formik';
    import response from "./apiresponse"
    import RecursiveContainer from './RecursiveContainer';
    import * as yup from 'yup';
    
    const SignupForm = () => {
      const signupSchema = yup.object().shape({
          name: yup.string().required()
      });
    
      const formik = useFormik({
        initialValues: {
        },
        onSubmit: values => {
          alert(JSON.stringify(values, null, 2));
        },
        validationSchema: signupSchema
      });
      console.log(formik, response)
      return (
        <form onSubmit={formik.handleSubmit}>
          <RecursiveContainer config={response} formik={formik} />
          <button type="submit">Submit</button>
        </form>
      );
    };
    
    export default SignupForm;

    You can see the schema usage example here.

    In this example, we simply created a schema and passed it to useFormik hook. You can notice now unless and until the user enters the name field, the form submission is not working.

    Here is a simple hack to make the button disabled until all necessary fields are filled.

    import React from 'react';
    import { useFormik } from 'formik';
    import response from "./apiresponse"
    import RecursiveContainer from './RecursiveContainer';
    import * as yup from 'yup';
    
    const SignupForm = () => {
      const signupSchema = yup.object().shape({
          name: yup.string().required()
      });
    
      const formik = useFormik({
        initialValues: {
        },
        onSubmit: values => {
          alert(JSON.stringify(values, null, 2));
        },
        validationSchema: signupSchema
      });
      console.log(formik, response)
      return (
        <form onSubmit={formik.handleSubmit}>
          <RecursiveContainer config={response} formik={formik} />
          <button type="submit" disabled={!formik.isValid}>Submit</button>
        </form>
      );
    };
    
    export default SignupForm;

    You can see how to use submit validation with live fiddle here

    We do get a vast variety of output from Formik while the form is being rendered and we can use them the way it suits us. You can find the full API of Formik at Formik Official Documentation

    So existing validations are fine but we often get into cases where we would like to build our own validations. How do we write them and integrate them with Yup validations?

    For this, there are 2 different ways with Formik + Yup. Either we can extend the Yup to support the additional validation or pass validation function to the Formik. The validation function approach is much simpler. You just need to write a function that gives back an error object to Formik. As simple as it sounds, it does get messy at times.

    So we will see an example of adding custom validation to Yup. Yup provides us an addMethod interface to add our own user-defined validations in the application.

    Let’s say we want to create an alias for existing validation for supporting casing because that’s the most common mistake we see. Url becomes url, trim is coming from the backend as Trim. These method names are case sensitive so if we say Yup.Url, it will fail. But with Yup.url, we get a function. These are just some examples, but you can also alias them with some other names like I can have an alias required to be as readable as NotEmpty.

    The usage is very simple and straightforward as follows: 

    yup.addMethod(yup.string, “URL”, function(...args) {
    return this.url(...args);
    });

    This will create an alias for url as URL.

    Here is an example of custom method validation which takes Y and N as boolean values.

    const validator = function (message) {
        return this.test('is-string-boolean', message, function (value) {
          if (isEmpty(value)) {
            return true;
          }
    
          if (['Y', 'N'].indexOf(value) !== -1) {
            return true;
          } else {
            return false;
          }
        });
      };

    With the above, we will be able to execute yup.string().stringBoolean() and yup.string().StringBoolean().

    It’s a pretty handy syntax that lets users create their own validations. You can create many more validations in your project to be used with Yup and reuse them wherever required.

    So writing schema is also a cumbersome task and is useless if the form is dynamic. When the form is dynamic then validations also need to be dynamic. Yup’s chaining-like syntax lets us achieve it very easily.

    We will consider that the backend sends us additional following things with metadata.

    [{
       “type” : “text”,
       “field”: “name”
       “name” : “User’s name”,
       “style” : {
             “width” : “50%
        },
       “validationType”: “string”,
       “validations”: [{
              type: “required”,
              params: [“Name is required”]
        }]
    }]

    validationType will hold the Yup’s data types like string, number, date, etc and validations will hold the validations that need to be applied to that field.

    So let’s have a look at the following snippet which utilizes the above structure and generates dynamic validation.

    import * as yup from 'yup';
    
    /** Adding just additional methods here */
    
    yup.addMethod(yup.string, "URL", function(...args) {
        return this.url(...args);
    });
    
    
    const validator = function (message) {
        return this.test('is-string-boolean', message, function (value) {
          if (isEmpty(value)) {
            return true;
          }
    
          if (['Y', 'N'].indexOf(value) !== -1) {
            return true;
          } else {
            return false;
          }
        });
      };
    
    yup.addMethod(yup.string, "stringBoolean", validator);
    yup.addMethod(yup.string, "StringBoolean", validator);
    
    
    
    
    export function createYupSchema(schema, config) {
      const { field, validationType, validations = [] } = config;
      if (!yup[validationType]) {
        return schema;
      }
      let validator = yup[validationType]();
      validations.forEach((validation) => {
        const { params, type } = validation;
        if (!validator[type]) {
          return;
        }
        validator = validator[type](...params);
      });
      if (field.indexOf('.') !== -1) {
        // nested fields are not covered in this example but are eash to handle tough
      } else {
        schema[field] = validator;
      }
    
      return schema;
    }
    
    export const getYupSchemaFromMetaData = (
      metadata,
      additionalValidations,
      forceRemove
    ) => {
      const yepSchema = metadata.reduce(createYupSchema, {});
      const mergedSchema = {
        ...yepSchema,
        ...additionalValidations,
      };
    
      forceRemove.forEach((field) => {
        delete mergedSchema[field];
      });
    
      const validateSchema = yup.object().shape(mergedSchema);
    
      return validateSchema;
    };

    You can see the complete live fiddle with dynamic validations with formik here.

    Here we have added the above code snippets to show how easily we can add a new method to Yup. Along with it, there are two functions createYupSchema and getYupSchemaFromMetaData which drive the whole logic for building dynamic schema. We are passing the validations in response and building the validation from it.

    createYupSchema simply builds Yup validation based on the validation array and validationType. getYupSchemaFromMetaData basically iterates over the response array and builds Yup validation for each field and at the end, it wraps it in the Object schema. In this way, we can generate dynamic validations. One can even go further and create nested validations with recursion.

    Conclusion

    It’s often seen that adding just another field is time-consuming in the traditional approach of writing the large boilerplate for forms, while with this approach, it eliminates the need for hardcoding the fields and allows them to be backend-driven. 

    Formik provides very optimized state management which reduces performance issues that we generally see when Redux is used and updated quite frequently.

    As we see above, it’s very easy to build dynamic forms with Formik. We can save the templates and even create template libraries that are very common with question and answer systems. If utilized correctly, we can simply have the templates saved in some NoSQL databases, like MongoDB and can generate a vast number of forms quickly with ease along with validations.

    To learn more and build optimized solutions you can also refer to <fastfield> and <field> APIs at their </field></fastfield>official documentation. Thanks for reading!

  • Building Google Photos Alternative Using AWS Serverless

    Being an avid Google Photos user, I really love some of its features, such as album, face search, and unlimited storage. However, when Google announced the end of unlimited storage on June 1st, 2021, I started thinking about how I could create a cheaper solution that would meet my photo backup requirement.

    “Taking an image, freezing a moment, reveals how rich reality truly is.”

    – Anonymous

    Google offers 100 GB of storage for 130 INR. This storage can be used across various Google applications. However, I don’t use all the space in one go. For me, I snap photos randomly. Sometimes, I visit places and take random snaps with my DSLR and smartphone. So, in general, I upload approximately 200 photos monthly. The size of these photos varies in the range of 4MB to 30MB. On average, I may be using 4GB of monthly storage for backup on my external hard drive to keep raw photos, even the bad ones. Photos backed up on the cloud should be visually high-quality, and it’s good to have a raw copy available at the same time, so that you may do some lightroom changes (although I never touch them 😛). So, here is my minimal requirement:

    • Should support social authentication (Google sign-in preferred).
    • Photos should be stored securely in raw format.
    • Storage should be scaled with usage.
    • Uploading and downloading photos should be easy.
    • Web view for preview would be a plus.
    • Should have almost no operations headache and solution should be as cheap as possible 😉.

    Selecting Tech Stack

    To avoid operation headaches with servers going down, scaling, or maybe application crashing and overall monitoring, I opted for a serverless solution with AWS. The AWS S3 is infinite scalable storage and you only pay for the amount of storage you used. On top of that, you can opt for the S3 storage class, which is efficient and cost-effective.

    – Infrastructure Stack

    1. AWS API Gateway (http api)
    2. AWS Lambda (for processing images and API gateway queries)
    3. Dynamodb (for storing image metadata)
    4. AWS Cognito (for authentication)
    5. AWS S3 Bucket (for storage and web application hosting)
    6. AWS Certificate Manager (to use SSL certificate for a custom domain with API gateway)

    – Software Stack

    1. NodeJS
    2. ReactJS and Material-UI (front-end framework and UI)
    3. AWS Amplify (for simplifying auth flow with cognito)
    4. Sharp (high-speed nodejs library for converting images)
    5. Express and serversless-http
    6. Infinite Scroller (for gallery view)
    7. Serverless Framework (for ease of deployment and Infrastructure as Code)

    Create S3 Buckets:

    We will create three S3 buckets. Create one for hosting a frontend application (refer to architecture diagram, more on this discussed later in the build and hosting part). The second one is for temporarily uploading images. The third one is for actual backup and storage (enable server-side encryption on this bucket). A temporary upload bucket will process uploaded images. 

    During pre-processing, we will resize the original image into two different sizes. One is for thumbnail purposes (400px width), another one is for viewing purposes, but with reduced quality (webp format). Once images are resized, upload all three (raw, thumbnail, and webview) to the third S3 bucket and create a record in dynamodb. Set up object expiry policy on the temporary bucket for 1 day. This way, uploaded objects are automatically deleted from the temporary bucket.

    Setup trigger on the temporary bucket for uploaded images:

    We will need to set up an S3 PUT event, which will trigger our Lambda function to download and process images. We will filter the suffix jpg (and jpeg) for an event trigger, meaning that any file with extension .jpg and .jpeg uploaded to our temporary bucket will automatically invoke a lambda function with the event payload. The lambda function with the help of the event payload will download the uploaded file and perform processing. Your serverless function definition would look like:

    functions:
     lambda:
       handler: index.handler
       memorySize: 512
       timeout: 60
       layers:
         - {Ref: PhotoParserLibsLambdaLayer}
       events:
         - s3:
             bucket: your-temporary-bucket-name
             event: s3:ObjectCreated:*
             rules:
               - suffix: .jpg
             existing: true
         - s3:
             bucket: your-temporary-bucket-name
             event: s3:ObjectCreated:*
             rules:
               - suffix: .jpeg
             existing: true

    Notice that in the YAML events section, we set “existing:true”. This ensures that the bucket will not be created during the serverless deployment. However, if you plan not to manually create your s3 bucket, you can let the framework create a bucket for you.

    DynamoDB as metadatadb:

    AWS dynamodb is a key-value document db that is suitable for our use case. Dynamodb will help us retrieve the list of photos available in the time series. Dynamodb uses a primary key for uniquely identifying each record. A primary key can be composed of a hash key and range key (also called a sort key). A range key is optional. We will use a federated identity ID (discussed in setup authorization) as the hash key (partition key) and name it the username for attribute definition with the type string. We will use the timestamp attribute definition name as a range key with a type number. Range key will help us query results with time-series (Unix epoch). We can also use dynamodb secondary indexes to sort results more specifically. However, to keep the application simple, we’re going to opt-out of this feature for now. Your serverless resource definition would look like:

    resources:
     Resources:
       MetaDataDB:
         Type: AWS::DynamoDB::Table
         Properties:
           TableName: your-dynamodb-table-name
           AttributeDefinitions:
             - AttributeName: username
               AttributeType: S
             - AttributeName: timestamp
               AttributeType: N
           KeySchema:
             - AttributeName: username
               KeyType: HASH
             - AttributeName: timestamp
               KeyType: RANGE
           BillingMode: PAY_PER_REQUEST

    Finally, you also need to set up the IAM role so that the process image lambda function would have access to the S3 bucket and dynamodb. Here is the serverless definition for the IAM role.

    # you can add statements to the Lambda function's IAM Role here
     iam:
       role:
         statements:
         - Effect: "Allow"
           Action:
             - "s3:ListBucket"
           Resource:
             - arn:aws:s3:::your-temporary-bucket-name
             - arn:aws:s3:::your-actual-photo-bucket-name
         - Effect: "Allow"
           Action:
             - "s3:GetObject"
             - "s3:DeleteObject"
           Resource: arn:aws:s3:::your-temporary-bucket-name/*
         - Effect: "Allow"
           Action:
             - "s3:PutObject"
           Resource: arn:aws:s3:::your-actual-photo-bucket-name/*
         - Effect: "Allow"
           Action:
             - "dynamodb:PutItem"
           Resource:
             - Fn::GetAtt: [ MetaDataDB, Arn ]

    Setup Authentication:

    Okay, to set up a Cognito user pool, head to the Cognito console and create a user pool with below config:

    1. Pool Name: photobucket-users

    2. How do you want your end-users to sign in?

    • Select: Email Address or Phone Number
    • Select: Allow Email Addresses
    • Check: (Recommended) Enable case insensitivity for username input

    3. Which standard attributes are required?

    • email

    4. Keep the defaults for “Policies”

    5. MFA and Verification:

    • I opted to manually reset the password for each user (since this is internal app)
    • Disabled user verification

    6. Keep the default for Message Customizations, tags, and devices.

    7. App Clients :

    • App client name: myappclient
    • Let the refresh token, access token, and id token be default
    • Check all “Auth flow configurations”
    • Check enable token revocation

    8. Skip Triggers

    9. Review and create the pool

    Once created, goto app integration -> domain name. Create a domain Cognito subdomain of your choice and note this. Next, I plan to use the Google sign-in feature with Cognito Federation Identity Providers. Use this guide to set up a Google social identity with Cognito.

    Setup Authorization:

    Once the user identity is verified, we need to allow them to access the s3 bucket with limited permissions. Head to the Cognito console, select federated identities, and create a new identity pool. Follow these steps to configure:

    1. Identity pool name: photobucket_auth

    2. Keep Unauthenticated and Authentication flow settings unchecked.

    3. Authentication providers:

    • User Pool I: Enter the user pool ID obtained during authentication setup
    • App Client I: Enter the app client ID generated during the authentication setup. (Cognito user pool -> App Clients -> App client ID)

    4. Setup permissions:

    • Expand view details (Role Summary)
    • For authenticated identities: edit policy document and use the below JSON policy and skip unauthenticated identities with the default configuration.
    {
       "Version": "2012-10-17",
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "mobileanalytics:PutEvents",
                   "cognito-sync:*",
                   "cognito-identity:*"
               ],
               "Resource": [
                   "*"
               ]
           },
           {
               "Sid": "ListYourObjects",
               "Effect": "Allow",
               "Action": "s3:ListBucket",
               "Resource": [
                   "arn:aws:s3:::your-actual-photo-bucket-name"
               ],
               "Condition": {
                   "StringLike": {
                       "s3:prefix": [
                           "${cognito-identity.amazonaws.com:sub}/",
                           "${cognito-identity.amazonaws.com:sub}/*"
                       ]
                   }
               }
           },
           {
               "Sid": "ReadYourObjects",
               "Effect": "Allow",
               "Action": [
                   "s3:GetObject"
               ],
               "Resource": [
                   "arn:aws:s3:::your-actual-photo-bucket-name/${cognito-identity.amazonaws.com:sub}",
                   "arn:aws:s3:::your-actual-photo-bucket-name/${cognito-identity.amazonaws.com:sub}/*"
               ]
           }
       ]
    }

    ${cognito-identity.amazonaws.com:sub} is a special AWS variable. When a user is authenticated with a federated identity, each user is assigned a unique identity. What the above policy means is that any user who is authenticated should have access to objects prefixed by their own identity ID. This is how we intend users to gain authorization in a limited area within the S3 bucket.

    Copy the Identity Pool ID (from sample code section). You will need this in your backend to get the identity id of the authenticated user via JWT token.

    Amplify configuration for the frontend UI sign-in:

    This object helps you set up the minimal configuration for your application. This is all that we need to sign in via Cognito and access the S3 photo bucket.

    const awsconfig = {
       Auth : {
           identityPoolId: "idenity pool id created during authorization setup",
           region : "your aws region",
           identityPoolRegion: "same as above if cognito is in same region",
           userPoolId : "cognito user pool id created during authentication setup",
           userPoolWebClientId : "cognito app client id",
           cookieStorage : {
               domain : "https://your-app-domain-name", //this is very important
               secure: true
           },
           oauth: {
               domain : "{cognito domain name}.auth.{cognito region name}.amazoncognito.com",
               scope : ["profile","email","openid"],
               redirectSignIn: 'https://your-app-domain-name',
               redirectSignOut: 'https://your-app-domain-name',
               responseType : "token"
           }
       },
       Storage: {
           AWSS3 : {
               bucket: "your-actual-bucket-name",
               region: "region-of-your-bucket"
           }
       }
    };
    export default awsconfig;

    You can then use the below code to configure and sign in via social authentication.

    import Amplify, {Auth} from 'aws-amplify';
    import awsconfig from './aws-config';
    Amplify.configure(awsconfig);
    //once the amplify is configured you can use below call with onClick event of buttons or any other visual component to sign in.
    //Example
    <Button startIcon={<img alt="Sigin in With Google" src={logo} />} fullWidth variant="outlined" color="primary" onClick={() => Auth.federatedSignIn({provider: 'Google'})}>
       Sign in with Google
    </Button>

    Gallery View:

    When the application is loaded, we use the PhotoGallery component to load photos and view thumbnails on-page. The Photogallery component is a wrapper around the InfinityScoller component, which keeps loading images as the user scrolls. The idea here is that we query a max of 10 images in one go. Our backend returns a list of 10 images (just the map and metadata to the S3 bucket). We must load these images from the S3 bucket and then show thumbnails on-screen as a gallery view. When the user reaches the bottom of the screen or there is empty space left, the InfiniteScroller component loads 10 more images. This continues untill our backend replies with a stop marker.

    The key point here is that we need to send the JWT Token as a header to our backend service via an ajax call. The JWT Token is obtained post a sign-in from Amplify framework. An example of obtaininga JWT token:

    let authsession = await Auth.currentSession();
    let jwtToken = authsession.getIdToken().jwtToken;
    let photoList = await axios.get(url,{
       headers : {
           Authorization: jwtToken
       },
       responseType : "json"
    });

    An example of an infinite scroller component usage is given below. Note that “gallery” is JSX composed array of photo thumbnails. The “loadMore” method calls our ajax function to the server-side backend and updates the “gallery” variable and sets the “hasMore” variable to true/false so that the infinite scroller component can stop queering when there are no photos left to display on the screen.

    <InfiniteScroll
       loadMore={this.fetchPhotos}
       hasMore={this.state.hasMore}
       loader={<div style={{padding:"70px"}} key={0}><LinearProgress color="secondary" /></div>}
    >
       <div style={{ marginTop: "80px", position: "relative", textAlign: "center" }}>
           <div className="image-grid" style={{ marginTop: "30px" }}>
               {gallery}
           </div>
           {this.state.openLightBox ?
           <LightBox src={this.state.lightBoxImg} callback={this.closeLightBox} />
           : null}
       </div>
    </InfiniteScroll>

    The Lightbox component gives a zoom effect to the thumbnail. When the thumbnail is clicked, a higher resolution picture (webp version) is downloaded from the S3 bucket and shown on the screen. We use a storage object from the Amplify library. Downloaded content is a blob and must be converted into image data. To do so, we use the javascript native method, createObjectURL. Below is the sample code that downloads the object from the s3 bucket and then converts it into a viewable image for the HTML IMG tag.

    thumbClick = (index) => {
       const urlCreater = window.URL || window.webkitURL;
       try {
           this.setState({
               openLightBox: true
           });
           Storage.get(this.state.photoList[index].src,{download: true}).then(data=>{
               let image = urlCreater.createObjectURL(data.Body);
               this.setState({
                   lightBoxImg : image
               });
           });
              
       } catch (error) {
           console.log(error);
           this.setState({
               openLightBox: false,
               lightBoxImg : null
           });
       }
    };

    Uploading Photos:

    The S3 SDK lets you generate a pre-signed POST URL. Anyone who gets this URL will be able to upload objects to the S3 bucket directly without needing credentials. Of course, we can actually set up some boundaries, like a max object size, key of the uploaded object, etc. Refer to this AWS blog for more on pre-signed URLs. Here is the sample code to generate a pre-signed URL.

    let s3Params = {
       Bucket: "your-temporary-bucket-name,
       Conditions : [
           ["content-length-range",1,31457280]
       ],
       Fields : {
           key: "path/to/your/object"
       },
       Expires: 300 //in seconds
    };
    const s3 = new S3({region : process.env.AWSREGION });
    s3.createPresignedPost(s3Params)

    For a better UX, we can allow our users to upload more than one photo at a time. However, a pre-signed URL lets you upload a single object at a time. To overcome this, we generate multiple pre-signed URLs. Initially, we send a request to our backend asking to upload photos with expected keys. This request is originated once the user selects photos to upload. Our backend then generates pre-signed URLs for us. Our frontend React app then provides the illusion that all photos are being uploaded as a whole.

    When the upload is successful, the S3 PUT event is triggered, which we discussed earlier. The complete flow of the application is given in a sequence diagram. You can find the complete source code here in my GitHub repository.

    React Build Steps and Hosting:

    The ideal way to build the react app is to execute an npm run build. However, we take a slightly different approach. We are not using the S3 static website for serving frontend UI. For one reason, S3 static websites are non-SSL unless we use CloudFront. Therefore, we will make the API gateway our application’s entry point. Thus, the UI will also be served from the API gateway. However, we want to reduce calls made to the API gateway. For this reason, we will only deliver the index.html file hosted with the help API gateway/Lamda, and the rest of the static files (react supporting JS files) from S3 bucket.

    Your index.html should have all the reference paths pointed to the S3 bucket. The build mustexclusively specify that static files are located in a different location than what’s relative to the index.html file. Your S3 bucket needs to be public with the right bucket policy and CORS set so that the end-user can only retrieve files and not upload nasty objects. Those who are confused about how the S3 static website and S3 public bucket differ may refer to here. Below are the react build steps, bucket policy, and CORS.

    PUBLIC_URL=https://{your-static-bucket-name}.s3.{aws_region}.amazonaws.com/ npm run build
    //Bucket Policy
    {
       "Version": "2012-10-17",
       "Id": "http referer from your domain only",
       "Statement": [
           {
               "Sid": "Allow get requests originating from",
               "Effect": "Allow",
               "Principal": "*",
               "Action": "s3:GetObject",
               "Resource": "arn:aws:s3:::{your-static-bucket-name}/static/*",
               "Condition": {
                   "StringLike": {
                       "aws:Referer": [
                           "https://your-app-domain-name"
                       ]
                   }
               }
           }
       ]
    }
    //CORS
    [
       {
           "AllowedHeaders": [
               "*"
           ],
           "AllowedMethods": [
               "GET"
           ],
           "AllowedOrigins": [
               "https://your-app-domain-name"
           ],
           "ExposeHeaders": []
       }
    ]

    Once a build is complete, upload index.html to a lambda that serves your UI. Run the below shell commands to compress static contents and host them on our static S3 bucket.

    #assuming you are in your react app directory
    mkdir /tmp/s3uploads
    cp -ar build/static /tmp/s3uploads/
    cd /tmp/s3uploads
    #add gzip encoding to all the files
    gzip -9 `find ./ -type f`
    #remove .gz extension from compressed files
    for i in `find ./ -type f`
    do
       mv $i ${i%.*}
    done
    #sync your files to s3 static bucket and mention that these files are compressed with gzip encoding
    #so that browser will not treat them as regular files
    aws s3 --region $AWSREGION sync . s3://${S3_STATIC_BUCKET}/static/ --content-encoding gzip --delete --sse
    cd -
    rm -rf /tmp/s3uploads

    Our backend uses nodejs express framework. Since this is a serverless application, we need to wrap express with a serverless-http framework to work with lambda. Sample source code is given below, along with serverless framework resource definition. Notice that, except for the UI home endpoint ( “/” ), the rest of the API endpoints are authenticated with Cognito on the API gateway itself.

    const serverless = require("serverless-http");
    const express = require("express");
    const app = express();
    .
    .
    .
    .
    .
    .
    app.get("/",(req,res)=> {
     res.sendFile(path.join(__dirname + "/index.html"));
    });
    module.exports.uihome = serverless(app);

    provider:
     name: aws
     runtime: nodejs12.x
     lambdaHashingVersion: 20201221
     httpApi:
       authorizers:
         cognitoJWTAuth:
           identitySource: $request.header.Authorization
           issuerUrl: https://cognito-idp.{AWS_REGION}.amazonaws.com/{COGNITO_USER_POOL_ID}
           audience:
             - COGNITO_APP_CLIENT_ID
    .
    .
    .
    .
    .
    .
    .
    functions:
     react-serve-ui:
       handler: handler.uihome
       memorySize: 256
       timeout: 29
       layers:
         - {Ref: CommonLibsLambdaLayer}
       events:
         - httpApi:
             path: /prep/photoupload
             method: post
             authorizer:
               name: cognitoJWTAuth
         - httpApi:
             path: /list/photos
             method: get
             authorizer:
               name: cognitoJWTAuth
         - httpApi:
             path: /
             method: get

    Final Steps :

    Lastly, we will setup up a custom domain so that we don’t need to use the gibberish domain name generated by the API gateway and certificate for our custom domain. You don’t need to use route53 for this part. If you have an existing domain, you can create a subdomain and point it to the API gateway. First things first: head to the AWS ACM console and generate a certificate for the domain name. Once the request is generated, you need to validate your domain by creating a TXT record as per the ACM console. The ACM is a free service. Domain verification may take few minutes to several hours. Once you have the certificate ready, head back to the API gateway console. Navigate to “custom domain names” and click create.

    1. Enter your application domain name
    2. Check TLS 1.2 as TLS version
    3. Select Endpoint type as Regional
    4. Select ACM certificate from dropdown list
    5. Create domain name

    Select the newly created custom domain. Note the API gateway domain name from Domain Details -> Configuration tab. You will need this to map a CNAME/ALIAS record with your DNS provider. Click on the API mappings tab. Click configure API mappings. From the dropdown, select your API gateway, select stage as default, and click save. You are done here.

    Future Scope and Improvements :

    To improve application latency, we can use CloudFront as CDN. This way, our entry point could be S3, and we no longer need to use API gateway regional endpoint. We can also add AWS WAF as an added security in front of our API gateway to inspect incoming requests and payloads. We can also use Dynamodb secondary indexes so that we can efficiently search metadata in the table. Adding a lifecycle rule on raw photos which have not been accessed for more than a year can be transited to the S3 Glacier storage class. You can further add glacier deep storage transition to save more on storage costs.

  • ClickHouse – The Newest Data Store in Your Big Data Arsenal

    ClickHouse

    ClickHouse is an open-source column-oriented data warehouse for online analytical processing of queries (OLAP). It is fast, scalable, flexible, cost-efficient, and easy to run. It supports the best in the industry query performance while significantly reducing storage requirements through innovative use of columnar storage and compression.

    ClickHouse’s performance exceeds comparable column-oriented database management systems that are available on the market. ClickHouse is a database management system, not a single database. ClickHouse allows creating tables and databases at runtime, loading data, and running queries without reconfiguring and restarting the server.

    ClickHouse processes from hundreds of millions to over a billion rows of data across hundreds of node clusters. It utilizes all available hardware for processing queries to their fastest. The peak processing performance for a single query stands at more than two terabytes per second.

    What makes ClickHouse unique?

    • Data Storage & Compression: ClickHouse is designed to work on regular hard drives but uses SSD and additional RAM if available. Data compression in ClickHouse plays a crucial role in achieving excellent performance. It provides general-purpose compression codecs and some specialized codecs for specific kinds of data. These codecs have different CPU consumption and disk space and help ClickHouse outperform other databases.
    • High Performance: By using vector computation, engine data is processed by vectors which are parts of columns, and achieve high CPU efficiency. It supports parallel processing across multiple cores, turning large queries into parallelized naturally. ClickHouse also supports distributed query processing; data resides across shards which are used for parallel execution of the query.
    • Primary & Secondary Index: Data is sorted physically by the primary key allowing low latency extraction of specific values or ranges. The secondary index in ClickHouse enable the database to know that the query filtering conditions would skip some of the parts entirely. Therefore, these are also called data skipping indexes.
    • Support for Approximated Calculations: ClickHouse trades accuracy for performance by approximated calculations. It provides aggregate functions for an approximated estimate of several distinct values, medians, and quantiles. It retrieves proportionally fewer data from the disk to run queries based on the part of data to get approximated results.
    • Data Replication and Data Integrity Support: All the remaining duplicates retrieve their copies in the background after being written to any available replica. The system keeps identical data on several clones. Most failures are recovered automatically or semi-automatically in complex scenarios.

    But it can’t be all good, can it? there are some disadvantages to ClickHouse as well:

    • No full-fledged transactions.
    • Inability to efficiently and precisely change or remove previously input data. For example, to comply with GDPR, data could well be cleaned up or modified using batch deletes and updates.
    • ClickHouse is less efficient for point queries that retrieve individual rows by their keys due to the sparse index.

    ClickHouse against its contemporaries

    So with all these distinctive features, how does ClickHouse compare with other industry-leading data storage tools. Now, ClickHouse being general-purpose, has a variety of use cases, and it has its pros and cons, so here’s a high-level comparison against the best tools in their domain. Depending on the use case, each tool has its unique traits, and comparison around them would not be fair, but what we care about the most is performance, scalability, cost, and other key attributes that can be compared irrespective of the domain. So here we go:

    ClickHouse vs Snowflake:

    • With its decoupled storage & compute approach, Snowflake is able to segregate workloads and enhance performance. The search optimization service in Snowflake further enhances the performance for point lookups but has additional costs attached with it. ClickHouse, on the other hand, with local runtime and inherent support for multiple forms of indexing, drastically improves query performance.
    • Regarding scalability, ClickHouse being on-prem makes it slightly challenging to scale compared to Snowflake, which is cloud-based. Managing hardware manually by provisioning clusters and migrating is doable but tedious. But one possible solution to tackle is to deploy CH on the cloud, a very good option that is cheaper and, frankly, the most viable. 

    ClickHouse vs Redshift:

    • Redshift is a managed, scalable cloud data warehouse. It offers both provisioned and serverless options. Its RA3 nodes compute scalably and cache the necessary data. Still, even with that, its performance does not separate different workloads that are on the same data putting it on the lower end of the decoupled compute & storage cloud architectures. ClickHouse’s local runtime is one of the fastest. 
    • Both Redshift and ClickHouse are columnar, sort data, allowing read-only specific data. But deploying CH is cheaper, and although RS is tailored to be a ready-to-use tool, CH is better if you’re not entirely dependent on Redshift’s features like configuration, backup & monitoring.

    ClickHouse vs InfluxDB:

    • InfluxDB, written in Go, this open-source no-SQL is one of the most popular choices when it comes to dealing with time-series data and analysis. Despite being a general-purpose analytical DB, ClickHouse provides competitive write performance. 
    • ClickHouse’s data structures like AggregatingMergeTree allow real-time data to be stored in a pre-aggregated format which puts it on par in performance regarding TSDBs. It is significantly faster in heavy queries and comparable in the case of light queries.

    ClickHouse vs PostgreSQL:

    • Postgres is another DB that is very versatile and thus is widely used by the world for various use cases, just like ClickHouse. Postgres, however, is an OLTP DB, so unlike ClickHouse, analytics is not its primary aim, but it’s still used for analytics purposes to a certain extent.
    • In terms of transactional data, ClickHouse’s columnar nature puts it below Postgres, but when it comes to analytical capabilities, even after tuning Postgres to its max potential, for, e.g., by using materialized views, indexing, cache size, buffers, etc. ClickHouse is ahead.  

    ClickHouse vs Apache Druid:

    • Apache Druid is an open-source data store that is primarily used for OLAP. Both Druid & ClickHouse are very similar in terms of their approaches and use cases but differ in terms of their architecture. Druid is mainly used for real-time analytics with heavy ingestions and high uptime.
    • Unlike Druid, ClickHouse has a much simpler deployment. CH can be deployed on only one server, while Druid setup needs multiple types of nodes (master, broker, ingestion, etc.). ClickHouse, with its support for SQL-like nature, provides better flexibility. It is more performant when the deployment is small.

    To summarize the differences between ClickHouse and other data warehouses:

    ClickHouse Engines

    Depending on the type of your table (internal or external) ClickHouse provides an array of engines that help us connect to different data storages and also determine the way data is stored, accessed, and other interactions on it.

    These engines are mainly categorized into two types:

    Database Engines:

    These allow us to work with different databases & tables.
    ClickHouse uses the Atomic database engine to provide configurable table engines and dialects. The popular ones are PostgreSQL, MySQL, and so on.

    Table Engines:

    These determine 

    • how and where data is stored
    • where to read/write it from/to
    • which queries it supports
    • use of indexes
    • concurrent data access and so on.

    These engines are further classified into families based on the above parameters:

    MergeTree Engines:

    This is the most universal and functional table for high-load tasks. The engines of this family support quick data insertion with subsequent background data processing. These engines also support data replication, partitioning, secondary data-skipping indexes and some other features. Following are some of the popular engines in this family:

    • MergeTree
    • SummingMergeTree
    • AggregatingMergeTree

    MergeTree engines with indexing and partitioning support allow data to be processed at a tremendous speed. These can also be leveraged to form materialized views that store aggregated data further improving the performance.

    Log Engines:

    These are lightweight engines with minimum functionality. These work the best when the requirement is to quickly write into many small tables and read them later as a whole. This family consists of:

    • Log
    • StripeLog
    • TinyLog

    These engines append data to the disk in a sequential fashion and support concurrent reading. They do not support indexing, updating, or deleting and hence are only useful when the data is small, sequential, and immutable.

    Integration Engines:

    These are used for communicating with other data storage and processing systems. This support:

    • JDBC
    • MongoDB
    • HDFS
    • S3
    • Kafka and so on.

    Using these engines we can import and export data from external sources. With engines like Kafka we can ingest data directly from a topic to a table in ClickHouse and with the S3 engine, we work directly with S3 objects.

    Special Engines:

    ClickHouse offers some special engines that are specific to the use case. For example:

    • MaterializedView
    • Distributed
    • Merge
    • File and so on.

    These special engines have their own quirks for eg. with File we can export data to a file, update data in the table by updating the file, etc.

    Summary

    We learned that ClickHouse is a very powerful and versatile tool. One that has stellar performance is feature-packed, very cost-efficient, and open-source. We saw a high-level comparison of ClickHouse with some of the best choices in an array of use cases. Although it ultimately comes down to how specific and intense your use case is, ClickHouse and its generic nature measure up pretty well on multiple occasions.

    ClickHouse’s applicability in web analytics, network management, log analysis, time series analysis, asset valuation in financial markets, and security threat identification makes it tremendously versatile. With consistently solving business problems in a low latency response for petabytes of data, ClickHouse is indeed one of the faster data warehouses out there.

    Further Readings

  • Getting Started With Kubernetes Operators (Helm Based) – Part 1

    Introduction

    The concept of operators was introduced by CoreOs in the last quarter of  2016 and post the introduction of operator framework last year, operators are rapidly becoming the standard way of managing applications on Kubernetes especially the ones which are stateful in nature. In this blog post, we will learn what an operator is. Why they are needed and what problems do they solve. We will also create a helm based operator as an example.

    This is the first part of our Kubernetes Operator Series. In the second part, getting started with Kubernetes operators (Ansible based), and the third part, getting started with Kubernetes operators (Golang based), you can learn how to build Ansible and Golang based operators.

    What is an Operator?

    Whenever we deploy our application on Kubernetes we leverage multiple Kubernetes objects like deployment, service, role, ingress, config map, etc. As our application gets complex and our requirements become non-generic, managing our application only with the help of native Kubernetes objects becomes difficult and we often need to introduce manual intervention or some other form of automation to make up for it.

    Operators solve this problem by making our application first class Kubernetes objects that is we no longer deploy our application as a set of native Kubernetes objects but a custom object/resource of its kind, having a more domain-specific schema and then we bake the “operational intelligence” or the “domain-specific knowledge” into the controller responsible for maintaining the desired state of this object. For example, etcd operator has made etcd-cluster a first class object and for deploying the cluster we create an object of Etcd Cluster kind. With operators, we are able to extend Kubernetes functionalities for custom use cases and manage our applications in a Kubernetes specific way allowing us to leverage Kubernetes APIs and Kubectl tooling.

    Operators combine crds and custom controllers and intend to eliminate the requirement for manual intervention (human operator) while performing tasks like an upgrade, handling failure recovery, scaling in case of complex (often stateful) applications and make them more resilient and self-sufficient.

    How to Build Operators ?

    For building and managing operators we mostly leverage the Operator Framework which is an open source tool kit allowing us to build operators in a highly automated, scalable and effective way.  Operator framework comprises of three subcomponents:

    1. Operator SDK: Operator SDK is the most important component of the operator framework. It allows us to bootstrap our operator project in minutes. It exposes higher level APIs and abstraction and saves developers the time to dig deeper into kubernetes APIs and focus more on building the operational logic. It performs common tasks like getting the controller to watch the custom resource (cr) for changes etc as part of the project setup process.
    2. Operator Lifecycle Manager:  Operators also run on the same kubernetes clusters in which they manage applications and more often than not we create multiple operators for multiple applications. Operator lifecycle manager (OLM) provides us a declarative way to install, upgrade and manage all the operators and their dependencies in our cluster.
    3. Operator Metering:  Operator metering is currently an alpha project. It records historical cluster usage and can generate usage reports showing usage breakdown by pod or namespace over arbitrary time periods.

    Types of Operators

    Currently there are three different types of operator we can build:

    1. Helm based operators: Helm based operators allow us to use our existing Helm charts and build operators using them. Helm based operators are quite easy to build and are preferred to deploy a stateless application using operator pattern.
    2. Ansible based Operator: Ansible based operator allows us to use our existing ansible playbooks and roles and build operators using them. There are also easy to build and generally preferred for stateless applications.
    3. Go based operators: Go based operators are built to solve the most complex use cases and are generally preferred for stateful applications. In case of an golang based operator, we build the controller logic ourselves providing it with all our custom requirements. This type of operators is also relatively complex to build.

    Building a Helm based operator

    1. Let’s first install the operator sdk

    go get -d github.com/operator-framework/operator-sdk
    cd $GOPATH/src/github.com/operator-framework/operator-sdk
    git checkout master
    make dep
    make install

    Now we will have the operator-sdk binary in the $GOPATH/bin folder.      

    2.  Setup the project

    For building a helm based operator we can use an existing Helm chart. We will be using the book-store Helm chart which deploys a simple python app and mongodb instances. This app allows us to perform crud operations via. rest endpoints.

    Now we will use the operator-sdk to create our Helm based bookstore-operator project.

    operator-sdk new bookstore-operator --api-version=velotio.com/v1alpha1 --kind=BookStore --type=helm --helm-chart=book-store
      --helm-chart-repo=https://akash-gautam.github.io/helmcharts/

    In the above command, the bookstore-operator is the name of our operator/project. –kind is used to specify the kind of objects this operator will watch and –api-verison is used for versioning of this object. The operator sdk takes only this much information and creates the custom resource definition (crd) and also the custom resource (cr) of its type for us (remember we talked about high-level abstraction operator sdk provides). The above command bootstraps a project with below folder structure.

    bookstore-operator/
    |
    |- build/ # Contains the Dockerfile to build the operator image
    |- deploy/ # Contains the crd,cr and manifest files for deploying operator
    |- helm-charts/ # Contains the helm chart we used while creating the project
    |- watches.yaml # Specifies the resource the operator watches (maintains the state of)

    We had discussed the operator-sdk automates setting up the operator projects and that is exactly what we can observe here. Under the build folder, we have the Dockerfile to build our operator image. Under deploy folder we have a crd folder containing both the crd and the cr. This folder also has operator.yaml file using which we will run the operator in our cluster, along with this we have manifest files for role, rolebinding and service account file to be used while deploying the operator.  We have our book-store helm chart under helm-charts. In the watches.yaml file.

    ---
    - version: v1alpha1
      group: velotio.com
      kind: BookStore
      chart: /opt/helm/helm-charts/book-store

    We can see that the bookstore-operator watches events related to BookStore kind objects and executes the helm chart specified.

    If we take a look at the cr file under deploy/crds (velotio_v1alpha1_bookstore_cr.yaml) folder then we can see that it looks just like the values.yaml file of our book-store helm chart.

    apiVersion: velotio.com/v1alpha1
    kind: BookStore
    metadata:
      name: example-bookstore
    spec:
      # Default values copied from <project_dir>/helm-charts/book-store/values.yaml
      
      # Default values for book-store.
      # This is a YAML-formatted file.
      # Declare variables to be passed into your templates.
      
      replicaCount: 1
      
      image:
        app:
          repository: akash125/pyapp
          tag: latest
          pullPolicy: IfNotPresent
        mongodb:
          repository: mongo
          tag: latest
          pullPolicy: IfNotPresent
          
      service:
        app:
          type: LoadBalancer
          port: 80
          targetPort: 3000
        mongodb:
          type: ClusterIP
          port: 27017
          targetPort: 27017
      
      
      resources: {}
        # We usually recommend not to specify default resources and to leave this as a conscious
        # choice for the user. This also increases chances charts run on environments with little
        # resources, such as Minikube. If you do want to specify resources, uncomment the following
        # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
        # limits:
        #  cpu: 100m
        #  memory: 128Mi
        # requests:
        #  cpu: 100m
        #  memory: 128Mi
      
      nodeSelector: {}
      
      tolerations: []
      
      affinity: {}

    In the case of Helm charts, we use the values.yaml file to pass the parameter to our Helm releases, Helm based operator converts all these configurable parameters into the spec of our custom resource. This allows us to express the values.yaml with a custom resource (CR) which, as a native Kubernetes object, enables the benefits of RBAC applied to it and an audit trail. Now when we want to update out deployed we can simply modify the CR and apply it, and the operator will ensure that the changes we made are reflected in our app.

    For each object of  `BookStore` kind  the bookstore-operator will perform the following actions:

    1. Create the bookstore app deployment if it doesn’t exists.
    2. Create the bookstore app service if it doesn’t exists.
    3. Create the mongodb deployment if it doesn’t exists.
    4. Create the mongodb service if it doesn’t exists.
    5. Ensure deployments and services match their desired configurations like the replica count, image tag, service port etc.  

    3. Build the Bookstore-operator Image

    The Dockerfile for building the operator image is already in our build folder we need to run the below command from the root folder of our operator project to build the image.

    operator-sdk build akash125/bookstore-operator:v0.0.1

    4. Run the Bookstore-operator

    As we have our operator image ready we can now go ahead and run it. The deployment file (operator.yaml under deploy folder) for the operator was created as a part of our project setup we just need to set the image for this deployment to the one we built in the previous step.

    After updating the image in the operator.yaml we are ready to deploy the operator.

    kubectl create -f deploy/service_account.yaml
    kubectl create -f deploy/role.yaml
    kubectl create -f deploy/role_binding.yaml
    kubectl create -f deploy/operator.yaml

    Note: The role created might have more permissions then actually required for the operator so it is always a good idea to review it and trim down the permissions in production setups.

    Verify that the operator pod is in running state.

    5. Deploy the Bookstore App

    Now we have the bookstore-operator running in our cluster we just need to create the custom resource for deploying our bookstore app.

    First, we can create bookstore cr we need to register its crd.

    kubectl apply -f deploy/crds/velotio_v1alpha1_bookstore_crd.yaml

    Now we can create the bookstore object.

    kubectl apply -f deploy/crds/velotio_v1alpha1_bookstore_cr.yaml

    Now we can see that our operator has deployed out book-store app.

    Now let’s grab the external IP of the app and make some requests to store details of books.

    Let’s hit the external IP on the browser and see if it lists the books we just stored:

    The bookstore operator build is available here.

    Conclusion

    Since its early days Kubernetes was believed to be a great tool for managing stateless application but the managing stateful applications on Kubernetes was always considered difficult. Operators are a big leap towards managing stateful applications and other complex distributed, multi (poly) cloud workloads with the same ease that we manage the stateless applications. In this blog post, we learned the basics of Kubernetes operators and build a simple helm based operator. In the next installment of this blog series, we will build an Ansible based Kubernetes operator and then in the last blog we will build a full-fledged Golang based operator for managing stateful workloads.

    Related Reads:

  • How to Make Your Terminal More Productive with Z-Shell (ZSH)

    When working with servers or command-line-based applications, we spend most of our time on the command line. A good-looking and productive terminal is better in many aspects than a GUI (Graphical User Interface) environment since the command line takes less time for most use cases. Today, we’ll look at some of the features that make a terminal cool and productive.

    You can use the following steps on Ubuntu 20.04. if you are using a different operating system, your commands will likely differ. If you’re using Windows, you can choose between Cygwin, WSL, and Git Bash.

    Prerequisites

    Let’s upgrade the system and install some basic tools needed.

    sudo apt update && sudo apt upgrade
    sudo apt install build-essential curl wget git

    Z-Shell (ZSH)

    Zsh is an extended Bourne shell with many improvements, including some features of Bash and other shells.

    Let’s install Z-Shell:

    sudo apt install zsh

    Make it our default shell for our terminal:

    chsh -s $(which zsh)

    Now restart the system and open the terminal again to be welcomed by ZSH. Unlike other shells like Bash, ZSH requires some initial configuration, so it asks for some configuration options the first time we start it and saves them in a file called .zshrc in the home directory (/home/user) where the user is the current system user.

    For now, we’ll skip the manual work and get a head start with the default configuration. Press 2, and ZSH will populate the .zshrc file with some default options. We can change these later.  

    The initial configuration setup can be run again as shown in the below image

    Oh-My-ZSH

    Oh-My-ZSH is a community-driven, open-source framework to manage your ZSH configuration. It comes with many plugins and helpers. It can be installed with one single command as below.

    Installation

    sh -c "$(wget https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh -O -)"

    It’d take a backup of our existing .zshrc in a file zshrc.pre-oh-my-zsh, so whenever you uninstall it, the backup would be restored automatically.

    Font

    A good terminal needs some good fonts, we’d use Terminess nerd font to make our terminal look awesome, which can be downloaded here. Once downloaded, extract and move them to ~/.local/share/fonts to make them available for the current user or to /usr/share/fonts to be available for all the users.

    tar -xvf Terminess.zip
    mv *.ttf ~/.local/share/fonts 

    Once the font is installed, it will look like:

    Among all the things Oh-My-ZSH provides, 2 things are community favorites, plugins, and themes.

    Theme

    My go-to ZSH theme is powerlevel10k because it’s flexible, provides everything out of the box, and is easy to install with one command as shown below:

    git clone --depth=1 https://github.com/romkatv/powerlevel10k.git ${ZSH_CUSTOM:-$HOME/.oh-my-zsh/custom}/themes/powerlevel10k

    To set this theme in .zshrc:

    Close the terminal and start it again. Powerlevel10k will welcome you with the initial setup, go through the setup with the options you want. You can run this setup again by executing the below command:

    p10k configure

    Tools and plugins we can’t live without

    Plugins can be added to the plugins array in the .zshrc file. For all the plugins you want to use from the below list, add those to the plugins array in the .zshrc file like so:

    ZSH-Syntax-Highlighting

    This enables the highlighting of commands as you type and helps you catch syntax errors before you execute them:

    As you can see, “ls” is in green but “lss” is in red.

    Execute below command to install it:

    git clone https://github.com/zsh-users/zsh-syntax-highlighting.git ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-syntax-highlighting

    ZSH Autosuggestions

    This suggests commands as you type based on your history:

    The below command is how you can install it by cloning the git repo:

    git clone https://github.com/zsh-users/zsh-autosuggestions ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-autosuggestions

    ZSH Completions

    For some extra ZSH completion scripts, execute below command

    git clone https://github.com/zsh-users/zsh-completions ${ZSH_CUSTOM:=~/.oh-my-zsh/custom}/plugins/zsh-completions 

    autojump

    It’s a faster way of navigating the file system; it works by maintaining a database of directories you visit the most. More details can be found here.

    sudo apt install autojump 

    You can also use the plugin Z as an alternative if you’re not able to install autojump or for any other reason.

    Internal Plugins

    Some plugins come installed with oh-my-zsh, and they can be included directly in .zshrc file without any installation.

    copyfile

    It copies the content of a file to the clipboard.

    copyfile test.txt

    copypath

    It copies the absolute path of the current directory to the clipboard.

    copybuffer

    This plugin copies the command that is currently typed in the command prompt to the clipboard. It works with the keyboard shortcut CTRL + o.

    sudo

    Sometimes, we forget to prefix a command with sudo, but that can be done in just a second with this plugin. When you hit the ESC key twice, it will prefix the command you’ve typed in the terminal with sudo.

    web-search

    This adds some aliases for searching with Google, Wikipedia, etc. For example, if you want to web-search with Google, you can execute the below command:

    google oh my zsh

    Doing so will open this search in Google:

    More details can be found here.

    Remember, you’d have to add each of these plugins in the .zshrc file as well. So, in the end, this is how the plugins array in .zshrc file should look like:

    plugins=(
            zsh-autosuggestions
            zsh-syntax-highlighting
            zsh-completions
            autojump
            copyfile
            copydir
            copybuffer
            history
            dirhistory
            sudo
            web-search
            git
    ) 

    You can add more plugins, like docker, heroku, kubectl, npm, jsontools, etc., if you’re a developer. There are plugins for system admins as well or for anything else you need. You can explore them here.

    Enhancd

    Enhancd is the next-gen method to navigate file system with cli. It works with a fuzzy finder, we’ll install it fzf for this purpose.

    sudo apt install fzf

    Enhancd can be installed with the zplug plugin manager for ZSH, so first we’ll install zplug with the below command:

    $ curl -sL --proto-redir -all,https https://raw.githubusercontent.com/zplug/installer/master/installer.zsh | zsh

    Append the following to .zshrc:

    source ~/.zplug/init.zsh
    zplug load

    Now close your terminal, open it again, and use zplug to install enhanced

    zplug "b4b4r07/enhancd", use:init.sh

    Aliases

    As a developer, I need to execute git commands many times a day, typing each command every time is too cumbersome, so we can use aliases for them. Aliases need to be added .zshrc, and here’s how we can add them.

    alias gs='git status'
    alias ga='git add .'
    alias gf='git fetch'
    alias gr='git rebase'
    alias gp='git push'
    alias gd='git diff'
    alias gc='git commit'
    alias gh='git checkout'
    alias gst='git stash'
    alias gl='git log --oneline --graph'

    You can add these anywhere in the .zshrc file.

    Colorls

    Another tool that makes you say wow is Colorls. This tool colorizes the output of the ls command. This is how it looks once you install it:

    It works with ruby, below is how you can install both ruby and colors:

    sudo apt install ruby ruby-dev ruby-colorize
    sudo gem install colorls

    Now, restart your terminal and execute the command colors in your terminal to see the magic!

    Bonus – We can add some aliases as well if we want the same output of Colorls when we execute the command ls. Note that we’re adding another alias for ls to make it available as well.

    alias cl='ls'
    alias ls='colorls'
    alias la='colorls -a'
    alias ll='colorls -l'
    alias lla='colorls -la'

    These are the tools and plugins I can’t live without now, Let me know if I’ve missed anything.

    Automation

    Do you wanna repeat this process again, if let’s say, you’ve bought a new laptop and want the same setup?

    You can automate all of this if your answer is no, and that’s why I’ve created Project Automator. This project does a lot more than just setting up a terminal: it works with Arch Linux as of now but you can take the parts you need and make it work with almost any *nix system you like.

    Explaining how it works is beyond the scope of this article, so I’ll have to leave you guys here to explore it on your own.

    Conclusion

    We need to perform many tasks on our systems, and using a GUI(Graphical User Interface) tool for a task can consume a lot of your time, especially if you repeat the same task on a daily basis like converting a media stream, setting up tools on a system, etc.

    Using a command-line tool can save you a lot of time and you can automate repetitive tasks with scripting. It can be a great tool for your arsenal.