Category: Industry

  • A Step Towards Machine Learning Algorithms: Univariate Linear Regression

    These days the concept of Machine Learning is evolving rapidly. The understanding of it is so vast and open that everyone is having their independent thoughts about it. Here I am putting mine. This blog is my experience with the learning algorithms. In this blog, we will get to know the basic difference between Artificial Intelligence, Machine Learning, and Deep Learning. We will also get to know the foundation Machine Learning Algorithm i.e Univariate Linear Regression.

    Intermediate knowledge of Python and its library (Numpy, Pandas, MatPlotLib) is good to start. For Mathematics, a little knowledge of Algebra, Calculus and Graph Theory will help to understand the trick of the algorithm.

    A way to Artificial intelligence, Machine Learning, and Deep Learning

    These are the three buzzwords of today’s Internet world where we are seeing the future of the programming language. Specifically, we can say that this is the place where science domain meets with programming. Here we use scientific concepts and mathematics with a programming language to simulate the decision-making process. Artificial Intelligence is a program or the ability of a machine to make decisions more as humans do. Machine Learning is another program that supports Artificial Intelligence.  It helps the machine to observe the pattern and learn from it to make a decision. Here programming is helping in observing the patterns not in making decisions. Machine learning requires more and more information from various sources to observe all of the variables for any given pattern to make more accurate decisions. Here deep learning is supporting machine learning by creating a network (neural network) to fetch all required information and provide it to machine learning algorithms.

    What is Machine Learning

    Definition: Machine Learning provides machines with the ability to learn autonomously based on experiences, observations and analyzing patterns within a given data set without explicitly programming.

    This is a two-part process. In the first part, it observes and analyses the patterns of given data and makes a shrewd guess of a mathematical function that will be very close to the pattern. There are various methods for this. Few of them are Linear, Non-Linear, logistic, etc. Here we calculate the error function using the guessed mathematical function and the given data. In the second part we will minimize the error function. This minimized function is used for the prediction of the pattern.

    Here are the general steps to understand the process of Machine Learning:

    1. Plot the given dataset on x-y axis
    2. By looking into the graph, we will guess more close mathematical function
    3. Derive the Error function with the given dataset and guessed mathematical function
    4. Try to minimize an error function by using some algorithms
    5. Minimized error function will give us a more accurate mathematical function for the given patterns.

    Getting Started with the First Algorithms: Linear Regression with Univariable

    Linear Regression is a very basic algorithm or we can say the first and foundation algorithm to understand the concept of ML. We will try to understand this with an example of given data of prices of plots for a given area. This example will help us understand it better.

    movieID	title	userID	rating	timestamp
    0	1	Toy story	170	3.0	1162208198000
    1	1	Toy story	175	4.0	1133674606000
    2	1	Toy story	190	4.5	1057778398000
    3	1	Toy story	267	2.5	1084284499000
    4	1	Toy story	325	4.0	1134939391000
    5	1	Toy story	493	3.5	1217711355000
    6	1	Toy story	533	5.0	1050012402000
    7	1	Toy story	545	4.0	1162333326000
    8	1	Toy story	580	5.0	1162374884000
    9	1	Toy story	622	4.0	1215485147000
    10	1	Toy story	788	4.0	1188553740000

    With this data, we can easily determine the price of plots of the given area. But what if we want the price of the plot with area 5.0 * 10 sq mtr. There is no direct price of this in our given dataset. So how we can get the price of the plots with the area not given in the dataset. This we can do using Linear Regression.

    So at first, we will plot this data into a graph.

    The below graphs describe the area of plots (10 sq mtr) in x-axis and its prices in y-axis (Lakhs INR).

    Definition of Linear Regression

    The objective of a linear regression model is to find a relationship between one or more features (independent variables) and a continuous target variable(dependent variable). When there is only feature it is called Univariate Linear Regression and if there are multiple features, it is called Multiple Linear Regression.

    Hypothesis function:

    Here we will try to find the relation between price and area of plots. As this is an example of univariate, we can see that the price is only dependent on the area of the plot.

    By observing this pattern we can have our hypothesis function as below:

    f(x) = w * x + b

    where w is weightage and b is biased.

    For the different value set of (w,b) there can be multiple line possible but for one set of value, it will be close to this pattern.

    When we generalize this function for multivariable then there will be a set of values of w then these constants are also termed as model params.

    Note: There is a range of mathematical functions that relate to this pattern and selection of the function is totally up to us. But point to be taken care is that neither it should be under or overmatched and function must be continuous so that we can easily differentiate it or it should have global minima or maxima.

    Error for a point

    As our hypothesis function is continuous, for every Xi (area points) there will be one Yi  Predicted Price and Y will be the actual price.

    So the error at any point,

    Ei = Yi – Y = F(Xi) – Y

    These errors are also called as residuals. These residuals can be positive (if actual points lie below the predicted line) or negative (if actual points lie above the predicted line). Our motive is to minimize this residual for each of the points.

    Note: While observing the patterns it is possible that few points are very far from the pattern. For these far points, residuals will be much more so if these points are less in numbers than we can avoid these points considering that these are errors in the dataset. Such points are termed as outliers.

    Energy Functions

    As there are m training points, we can calculate the Average Energy function below

    E (w,b) =  1/m ( iΣm  (Ei) )

    and

    our motive is to minimize the energy functions

    min (E (w,b)) at point ( w,b )

    Little Calculus: For any continuous function, the points where the first derivative is zero are the points of either minima or maxima. If the second derivative is negative, it is the point of maxima and if it is positive, it is the point of minima.

    Here we will do the trick – we will convert our energy function into an upper parabola by squaring the error function. It will ensure that our energy function will have only one global minima (the point of our concern). It will simplify our calculation that where the first derivative of the energy function will be zero is the point that we need and the value of  (w,b) at that point will be our required point.

    So our final Energy function is

    E (w,b) =  1/2m ( iΣm  (Ei)2 )

    dividing by 2 doesn’t affect our result and at the time of derivation it will cancel out for e.g

    the first derivative of x2  is 2x.

    Gradient Descent Method

    Gradient descent is a generic optimization algorithm. It iteratively hit and trials the parameters of the model in order to minimize the energy function.

    In the above picture, we can see on the right side:

    1. w0 and w1 is the random initialization and by following gradient descent it is moving towards global minima.
    2. No of turns of the black line is the number of iterations so it must not be more or less.
    3. The distance between the turns is alpha i.e the learning parameter.

    By solving this left side equation we will be able to get model params at the global minima of energy functions.

    Points to consider at the time of Gradient Descent calculations:

    1. Random initialization: We start this algorithm at any random point that is set of random (w, b) value. By moving along this algorithm decide at which direction new trials have to be taken. As we know that it will be the upper parabola so by moving into the right direction (towards the global minima) we will get lesser value compared to the previous point.
    2. No of iterations: No of iteration must not be more or less. If it is lesser, we will not reach global minima and if it is more, then it will be extra calculations around the global minima.
    3. Alpha as learning parameters: when alpha is too small then gradient descent will be slow as it takes unnecessary steps to reach the global minima. If alpha is too big then it might overshoot the global minima. In this case it will neither converge nor diverge.

    Implementation of Gradient Descent in Python

    """ Method to read the csv file using Pandas and later use this data for linear regression. """
    """ Better run with Python 3+. """
    
    # Library to read csv file effectively
    import pandas
    import matplotlib.pyplot as plt
    import numpy as np
    
    # Method to read the csv file
    def load_data(file_name):
    	column_names = ['area', 'price']
    	# To read columns
    	io = pandas.read_csv(file_name,names=column_names, header=None)
    	x_val = (io.values[1:, 0])
    	y_val = (io.values[1:, 1])
    	size_array = len(y_val)
    	for i in range(size_array):
    		x_val[i] = float(x_val[i])
    		y_val[i] = float(y_val[i])
    		return x_val, y_val
    
    # Call the method for a specific file
    x_raw, y_raw = load_data('area-price.csv')
    x_raw = x_raw.astype(np.float)
    y_raw = y_raw.astype(np.float)
    y = y_raw
    
    # Modeling
    w, b = 0.1, 0.1
    num_epoch = 100
    converge_rate = np.zeros([num_epoch , 1], dtype=float)
    learning_rate = 1e-3
    for e in range(num_epoch):
    	# Calculate the gradient of the loss function with respect to arguments (model parameters) manually.
    	y_predicted = w * x_raw + b
    	grad_w, grad_b = (y_predicted - y).dot(x_raw), (y_predicted - y).sum()
    	# Update parameters.
    	w, b = w - learning_rate * grad_w, b - learning_rate * grad_b
    	converge_rate[e] = np.mean(np.square(y_predicted-y))
    
    print(w, b)
    print(f"predicted function f(x) = x * {w} + {b}" )
    calculatedprice = (10 * w) + b
    print(f"price of plot with area 10 sqmtr = 10 * {w} + {b} = {calculatedprice}")

    This is the basic implementation of Gradient Descent algorithms using numpy and Pandas. It is basically reading the area-price.csv file. Here we are normalizing the x-axis for better readability of data points over the graph. We have taken (w,b) as (0.1, 0.1) as random initialization. We have taken 100 as count of iterations and learning rate as .001.

    In every iteration, we are calculating w and b value and seeing it for converging rate.

    We can repeat this calculation for (w,b) for different values of random initialization, no of iterations and learning rate (alpha).

    Note: There is another python Library TensorFlow which is more preferable for such calculations. There are inbuilt functions of Gradient Descent in TensorFlow. But for better understanding, we have used library numpy and pandas here.

    RMSE (Root Mean Square Error)

    RMSE: This is the method to verify that our calculation of (w,b) is accurate at what extent. Below is the basic formula of calculation of RMSE where f is the predicted value and the observed value.

    Note: There is no absolute good or bad threshold value for RMSE, however, we can assume this based on our observed value. For an observed value ranges from 0 to 1000, the RMSE value of 0.7 is small, but if the range goes from 0 to 1, it is not that small.

    Conclusion

    As part of this article, we have seen a little introduction to Machine Learning and the need for it. Then with the help of a very basic example, we learned about one of the various optimization algorithms i.e. Linear Regression (for univariate only). This can be generalized for multivariate also. We then use the Gradient Descent Method for the calculation of the predicted data model in Linear Regression. We also learned the basic flow details of Gradient Descent. There is one example in python for displaying Linear Regression via Gradient Descent.

  • Publish APIs For Your Customers: Deploy Serverless Developer Portal For Amazon API Gateway

    Amazon API Gateway is a fully managed service that allows you to create, secure, publish, test and monitor your APIs. We often come across scenarios where customers of these APIs expect a platform to learn and discover APIs that are available to them (often with examples).

    The Serverless Developer Portal is one such application that is used for developer engagement by making your APIs available to your customers. Further, your customers can use the developer portal to subscribe to an API, browse API documentation, test published APIs, monitor their API usage, and submit their feedback.

    This blog is a detailed step-by-step guide for deploying the Serverless Developer Portal for APIs that are managed via Amazon API Gateway.

    Advantages

    The users of the Amazon API Gateway can be vaguely categorized as –

    API Publishers – They can use the Serverless Developer Portal to expose and secure their APIs for customers which can be integrated with AWS Marketplace for monetary benefits. Furthermore, they can customize the developer portal, including content, styling, logos, custom domains, etc. 

    API Consumers – They could be Frontend/Backend developers, third party customers, or simply students. They can explore available APIs, invoke the APIs, and go through the documentation to get an insight into how each API works with different requests. 

    Developer Portal Architecture

    We would need to establish a basic understanding of how the developer portal works. The Serverless Developer Portal is a serverless application built on microservice architecture using Amazon API Gateway, Amazon Cognito, AWS Lambda, Simple Storage Service and Amazon CloudFront. 

    The developer portal comprises multiple microservices and components as described in the following figure.

    Source: AWS

    There are a few key pieces in the above architecture –

    1. Identity Management: Amazon Cognito is basically the secure user directory of the developer portal responsible for user management. It allows you to configure triggers for registration, authentication, and confirmation, thereby giving you more control over the authentication process. 
    2. Business Logic: AWS Cloudfront is configured to serve your static content hosted in a private S3 bucket. The static content is built using the React JS framework which interacts with backend APIs dictating the business logic for various events. 
    3. Catalog Management: Developer portal uses catalog for rendering the APIs with Swagger specifications on the APIs page. The catalog file (catalog.json in S3 Artifact bucket) is updated whenever an API is published or removed. This is achieved by creating an S3 trigger on AWS Lambda responsible for studying the content of the catalog directory and generating a catalog for the developer portal.  
    4. API Key Creation: API Key is created for consumers at the time of registration. Whenever you subscribe to an API, associated Usage Plans are updated to your API key, thereby giving you access to those APIs as defined by the usage plan. Cognito User – API key mapping is stored in the DynamoDB table along with other registration related details.
    5. Static Asset Uploader: AWS Lambda (Static-Asset-Uploader) is responsible for updating/deploying static assets for the developer portal. Static assets include – content, logos, icons, CSS, JavaScripts, and other media files.

    Let’s move forward to building and deploying a simple Serverless Developer Portal.

    Building Your API

    Start with deploying an API which can be accessed using API Gateway from 

    https://<api-id>.execute-api.region.amazonaws.com/stage

    If you do not have any such API available, create a simple application by jumping to the section, “API Performance Across the Globe,” on this blog.

    Setup custom domain name

    For professional projects, I recommend that you create a custom domain name as they provide simpler and more intuitive URLs you can provide to your API users.

    Make sure your API Gateway domain name is updated in the Route53 record set created after you set up your custom domain name. 

    See more on Setting up custom domain names for REST APIs – Amazon API Gateway

    Enable CORS for an API Resource

    There are two ways you can enable CORS on a resource:

    1. Enable CORS Using the Console
    2. Enable CORS on a resource using the import API from Amazon API Gateway

    Let’s discuss the easiest way to do it using a console.

    1. Open API Gateway console.
    2. Select the API Gateway for your API from the list.
    3. Choose a resource to enable CORS for all the methods under that resource.
      Alternatively, you could choose a method under the resource to enable CORS for just this method.
    4. Select Enable CORS from the Actions drop-down menu.
    5. In the Enable CORS form, do the following:
      – Leave Access-Control-Allow-Headers and Access-Control-Allow-Origin header to default values.
      – Click on Enable CORS and replace existing CORS headers.
    6. Review the changes in Confirm method changes popup, choose Yes, overwrite existing values to apply your CORS settings.

    Once enabled, you can see a mock integration on the OPTIONS method for the selected resource. You must enable CORS for ${proxy} resources too. 

    To verify the CORS is enabled on API resource, try curl on OPTIONS method

    curl -v -X OPTIONS -H "Access-Control-Request-Method: POST" -H "Origin: http://example.com" https://api-id.execute-api.region.amazonaws.com/stage
    

    You should see the response OK in the header:

    < HTTP/1.1 200 OK
    < Content-Type: application/json
    < Content-Length: 0
    < Connection: keep-alive
    < Date: Mon, 13 Apr 2020 16:27:44 GMT
    < x-amzn-RequestId: a50b97b5-2437-436c-b99c-22e00bbe9430
    < Access-Control-Allow-Origin: *
    < Access-Control-Allow-Headers: Content-Type,X-Amz-Date,Authorization,X-Api-Key,X-Amz-Security-Token
    < x-amz-apigw-id: K7voBHDZIAMFu9g=
    < Access-Control-Allow-Methods: DELETE,GET,HEAD,OPTIONS,PATCH,POST,PUT
    < X-Cache: Miss from cloudfront
    < Via: 1.1 1c8c957c4a5bf1213bd57bd7d0ec6570.cloudfront.net (CloudFront)
    < X-Amz-Cf-Pop: BOM50-C1
    < X-Amz-Cf-Id: OmxFzV2-TH2BWPVyOohNrhNlJ-s1ZhYVKyoJaIrA_zyE9i0mRTYxOQ==

    Deploy Developer Portal

    There are two ways to deploy the developer portal for your API. 

    Using SAR

    An easy way will be to deploy api-gateway-dev-portal directly from AWS Serverless Application Repository. 

    Note -If you intend to upgrade your Developer portal to a major version then you need to refer to the Upgrading Instructions which is currently under development.

    Using AWS SAM

    1. Ensure that you have the latest AWS CLI and AWS SAM CLI installed and configured.
    2. Download or clone the API Gateway Serverless Developer Portal repository.
    3. Update the Cloudformation template file – cloudformation/template.yaml.

    Parameters you must configure and verify includes: 

    • ArtifactsS3BucketName
    • DevPortalSiteS3BucketName
    • DevPortalCustomersTableName
    • DevPortalPreLoginAccountsTableName
    • DevPortalAdminEmail
    • DevPortalFeedbackTableName
    • CognitoIdentityPoolName
    • CognitoDomainNameOrPrefix
    • CustomDomainName
    • CustomDomainNameAcmCertArn
    • UseRoute53Nameservers
    • AccountRegistrationMode

    You can view your template file in AWS Cloudformation Designer to get a better idea of all the components/services involved and how they are connected.

    See Developer portal settings for more information about parameters.

    1. Replace the static files in your project with the ones you would like to use.
      dev-portal/public/custom-content
      lambdas/static-asset-uploader/build
      api-logo contains the logos you would like to show on the API page (in png format). Portal checks for an api-id_stage.png file when rendering the API page. If not found, it chooses the default logo – default.png.
      content-fragments includes various markdown files comprising the content of the different pages in the portal. 
      Other static assets including favicon.ico, home-image.png and nav-logo.png that appear on your portal. 
    2. Let’s create a ZIP file of your code and dependencies, and upload it to Amazon S3. Running below command creates an AWS SAM template packaged.yaml, replacing references to local artifacts with the Amazon S3 location where the command uploaded the artifacts:
    sam package --template-file ./cloudformation/template.yaml --output-template-file ./cloudformation/packaged.yaml --s3-bucket {your-lambda-artifacts-bucket-name}

    1. Run the following command from the project root to deploy your portal, replace:
      – {your-template-bucket-name}
      with the name of your Amazon S3 bucket.
      – {custom-prefix}
      with a prefix that is globally unique.
      – {cognito-domain-or-prefix}
      with a unique string.
    sam deploy --template-file ./cloudformation/packaged.yaml --s3-bucket {your-template-bucket-name} --stack-name "{custom-prefix}-dev-portal" --capabilities CAPABILITY_NAMED_IAM

    Note: Ensure that you have required privileges to make deployments, as, during the deployment process, it attempts to create various resources such as AWS Lambda, Cognito User Pool, IAM roles, API Gateway, Cloudfront Distribution, etc. 

    After your developer portal has been fully deployed, you can get its URL by following.

    1. Open the AWS CloudFormation console.
    2. Select your stack you created above.
    3. Open the Outputs section. The URL for the developer portal is specified in the WebSiteURL property.

    Create Usage Plan

    Create a usage plan, to list your API under a subscribable APIs category allowing consumers to access the API using their API keys in the developer portal. Ensure that the API gateway stage is configured for the usage plan.

    Publishing an API

    Only Administrators have permission to publish an API. To create an Administrator account for your developer portal –

    1. Go to the WebSiteURL obtained after the successful deployment. 

    2. On the top right of the home page click on Register.

    Source: Github

    3. Fill the registration form and hit Sign up.

    4. Enter the confirmation code received on your email address provided in the previous step.

    5. Promote the user as Administrator by adding it to AdminGroup. 

    • Open Amazon Cognito User Pool console.
    • Select the User Pool created for your developer portal.
    • From the General Settings > Users and Groups page, select the User you want to promote as Administrator.
    • Click on Add to group and then select the Admin group from the dropdown and confirm.

    6. You will be required to log in again to log in as an Administrator. Click on the Admin Panel and choose the API you wish to publish from the APIs list.

    Setting up an account

    The signup process depends on the registration mode selected for the developer portal. 

    For request registration mode, you need to wait for the Administrator to approve your registration request.

    For invite registration mode, you can only register on the portal when invited by the portal administrator. 

    Subscribing an API

    1. Sign in to the developer portal.
    2. Navigate to the Dashboard page and Copy your API Key.
    3. Go to APIs Page to see a list of published APIs.
    4. Select an API you wish to subscribe to and hit the Subscribe button.

    Tips

    1. When a user subscribes to API, all the APIs published under that usage plan are accessible no matter whether they are published or not.
    2. Whenever you subscribe to an API, the catalog is exported from API Gateway resource documentation. You can customize the workflow or override the catalog swagger definition JSON in S3 bucket as defined in ArtifactsS3BucketName under /catalog/<apiid>_<stage>.json</stage></apiid>
    3. For backend APIs, CORS requests are allowed only from custom domain names selected for your developer portal.
    4. Ensure to set the CORS response header from the published API in order to invoke them from the developer portal.

    Summary

    You’ve seen how to deploy a Serverless Developer Portal and publish an API. If you are creating a serverless application for the first time, you might want to read more on Serverless Computing and AWS Gateway before you get started. 

    Start building your own developer portal. To know more on distributing your API Gateway APIs to your customers follow this AWS guide.

  • Using Packer and Terraform to Setup Jenkins Master-Slave Architecture

    Automation is everywhere and it is better to adopt it as soon as possible. Today, in this blog post, we are going to discuss creating the infrastructure. For this, we will be using AWS for hosting our deployment pipeline. Packer will be used to create AMI’s and Terraform will be used for creating the master/slaves. We will be discussing different ways of connecting the slaves and will also run a sample application with the pipeline.

    Please remember the intent of the blog is to accumulate all the different components together, this means some of the code which should be available in development code repo is also included here. Now that we have highlighted the required tools, 10000 ft view and intent of the blog. Let’s begin.

    Using Packer to Create AMI’s for Jenkins Master and Linux Slave

    Hashicorp has bestowed with some of the most amazing tools for simplifying our life. Packer is one of them. Packer can be used to create custom AMI from already available AMI’s. We just need to create a JSON file and pass installation script as part of creation and it will take care of developing the AMI for us. Install packer depending upon your requirement from Packer downloads page. For simplicity purpose, we will be using Linux machine for creating Jenkins Master and Linux Slave. JSON file for both of them will be same but can be separated if needed.

    Note: user-data passed from terraform will be different which will eventually differentiate their usage.

    We are using Amazon Linux 2 – JSON file for the same.

    {
      "builders": [
      {
        "ami_description": "{{user `ami-description`}}",
        "ami_name": "{{user `ami-name`}}",
        "ami_regions": [
          "us-east-1"
        ],
        "ami_users": [
          "XXXXXXXXXX"
        ],
        "ena_support": "true",
        "instance_type": "t2.medium",
        "region": "us-east-1",
        "source_ami_filter": {
          "filters": {
            "name": "amzn2-ami-hvm-2.0*x86_64*",
            "root-device-type": "ebs",
            "virtualization-type": "hvm"
          },
          "most_recent": true,
          "owners": [
            "amazon"
          ]
        },
        "sriov_support": "true",
        "ssh_username": "ec2-user",
        "tags": {
          "Name": "{{user `ami-name`}}"
        },
        "type": "amazon-ebs"
      }
    ],
    "post-processors": [
      {
        "inline": [
          "echo AMI Name {{user `ami-name`}}",
          "date",
          "exit 0"
        ],
        "type": "shell-local"
      }
    ],
    "provisioners": [
      {
        "script": "install_amazon.bash",
        "type": "shell"
      }
    ],
      "variables": {
        "ami-description": "Amazon Linux for Jenkins Master and Slave ({{isotime \"2006-01-02-15-04-05\"}})",
        "ami-name": "amazon-linux-for-jenkins-{{isotime \"2006-01-02-15-04-05\"}}",
        "aws_access_key": "",
        "aws_secret_key": ""
      }
    }

    As you can see the file is pretty simple. The only thing of interest here is the install_amazon.bash script. In this blog post, we will deploy a Node-based application which is running inside a docker container. Content of the bash file is as follows:

    #!/bin/bash
    
    set -x
    
    # For Node
    curl -sL https://rpm.nodesource.com/setup_10.x | sudo -E bash -
    
    # For xmlstarlet
    sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    
    sudo yum update -y
    
    sleep 10
    
    # Setting up Docker
    sudo yum install -y docker
    sudo usermod -a -G docker ec2-user
    
    # Just to be safe removing previously available java if present
    sudo yum remove -y java
    
    sudo yum install -y python2-pip jq unzip vim tree biosdevname nc mariadb bind-utils at screen tmux xmlstarlet git java-1.8.0-openjdk nc gcc-c++ make nodejs
    
    sudo -H pip install awscli bcrypt
    sudo -H pip install --upgrade awscli
    sudo -H pip install --upgrade aws-ec2-assign-elastic-ip
    
    sudo npm install -g @angular/cli
    
    sudo systemctl enable docker
    sudo systemctl enable atd
    
    sudo yum clean all
    sudo rm -rf /var/cache/yum/
    exit 0
    @velotiotech

    Now there are a lot of things mentioned let’s check them out. As mentioned earlier we will be discussing different ways of connecting to a slave and for one of them, we need xmlstarlet. Rest of the things are packages that we might need in one way or the other.

    Update ami_users with actual user value. This can be found on AWS console Under Support and inside of it Support Center.

    Validate what we have written is right or not by running packer validate amazon.json.

    Once confirmed, build the packer image by running packer build amazon.json.

    After completion check your AWS console and you will find a new AMI created in “My AMI’s”.

    It’s now time to start using terraform for creating the machines. 

    Prerequisite:

    1. Please make sure you create a provider.tf file.

    provider "aws" {
      region                  = "us-east-1"
      shared_credentials_file = "~/.aws/credentials"
      profile                 = "dev"
    }

    The ‘credentials file’ will contain aws_access_key_id and aws_secret_access_key.

    2.  Keep SSH keys handy for server/slave machines. Here is a nice article highlighting how to create it or else create them before hand on aws console and reference it in the code.

    3. VPC:

    # lookup for the "default" VPC
    data "aws_vpc" "default_vpc" {
      default = true
    }
    
    # subnet list in the "default" VPC
    # The "default" VPC has all "public subnets"
    data "aws_subnet_ids" "default_public" {
      vpc_id = "${data.aws_vpc.default_vpc.id}"
    }

    Creating Terraform Script for Spinning up Jenkins Master

    Creating Terraform Script for Spinning up Jenkins Master. Get terraform from terraform download page.

    We will need to set up the Security Group before setting up the instance.

    # Security Group:
    resource "aws_security_group" "jenkins_server" {
      name        = "jenkins_server"
      description = "Jenkins Server: created by Terraform for [dev]"
    
      # legacy name of VPC ID
      vpc_id = "${data.aws_vpc.default_vpc.id}"
    
      tags {
        Name = "jenkins_server"
        env  = "dev"
      }
    }
    
    ###############################################################################
    # ALL INBOUND
    ###############################################################################
    
    # ssh
    resource "aws_security_group_rule" "jenkins_server_from_source_ingress_ssh" {
      type              = "ingress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["<Your Public IP>/32", "172.0.0.0/8"]
      description       = "ssh to jenkins_server"
    }
    
    # web
    resource "aws_security_group_rule" "jenkins_server_from_source_ingress_webui" {
      type              = "ingress"
      from_port         = 8080
      to_port           = 8080
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "jenkins server web"
    }
    
    # JNLP
    resource "aws_security_group_rule" "jenkins_server_from_source_ingress_jnlp" {
      type              = "ingress"
      from_port         = 33453
      to_port           = 33453
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["172.31.0.0/16"]
      description       = "jenkins server JNLP Connection"
    }
    
    ###############################################################################
    # ALL OUTBOUND
    ###############################################################################
    
    resource "aws_security_group_rule" "jenkins_server_to_other_machines_ssh" {
      type              = "egress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins servers to ssh to other machines"
    }
    
    resource "aws_security_group_rule" "jenkins_server_outbound_all_80" {
      type              = "egress"
      from_port         = 80
      to_port           = 80
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins servers for outbound yum"
    }
    
    resource "aws_security_group_rule" "jenkins_server_outbound_all_443" {
      type              = "egress"
      from_port         = 443
      to_port           = 443
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins servers for outbound yum"
    }

    Now that we have a custom AMI and security groups for ourselves let’s use them to create a terraform instance.

    # AMI lookup for this Jenkins Server
    data "aws_ami" "jenkins_server" {
      most_recent      = true
      owners           = ["self"]
    
      filter {
        name   = "name"
        values = ["amazon-linux-for-jenkins*"]
      }
    }
    
    resource "aws_key_pair" "jenkins_server" {
      key_name   = "jenkins_server"
      public_key = "${file("jenkins_server.pub")}"
    }
    
    # lookup the security group of the Jenkins Server
    data "aws_security_group" "jenkins_server" {
      filter {
        name   = "group-name"
        values = ["jenkins_server"]
      }
    }
    
    # userdata for the Jenkins server ...
    data "template_file" "jenkins_server" {
      template = "${file("scripts/jenkins_server.sh")}"
    
      vars {
        env = "dev"
        jenkins_admin_password = "mysupersecretpassword"
      }
    }
    
    # the Jenkins server itself
    resource "aws_instance" "jenkins_server" {
      ami                    		= "${data.aws_ami.jenkins_server.image_id}"
      instance_type          		= "t3.medium"
      key_name               		= "${aws_key_pair.jenkins_server.key_name}"
      subnet_id              		= "${data.aws_subnet_ids.default_public.ids[0]}"
      vpc_security_group_ids 		= ["${data.aws_security_group.jenkins_server.id}"]
      iam_instance_profile   		= "dev_jenkins_server"
      user_data              		= "${data.template_file.jenkins_server.rendered}"
    
      tags {
        "Name" = "jenkins_server"
      }
    
      root_block_device {
        delete_on_termination = true
      }
    }
    
    output "jenkins_server_ami_name" {
        value = "${data.aws_ami.jenkins_server.name}"
    }
    
    output "jenkins_server_ami_id" {
        value = "${data.aws_ami.jenkins_server.id}"
    }
    
    output "jenkins_server_public_ip" {
      value = "${aws_instance.jenkins_server.public_ip}"
    }
    
    output "jenkins_server_private_ip" {
      value = "${aws_instance.jenkins_server.private_ip}"
    }

    As mentioned before, we will be discussing multiple ways in which we can connect the slaves to Jenkins master. But it is already known that every time a new Jenkins comes up, it generates a unique password. Now there are two ways to deal with this, one is to wait for Jenkins to spin up and retrieve that password or just directly edit the admin password while creating Jenkins master. Here we will be discussing how to change the password when configuring Jenkins. (If you need the script to retrieve Jenkins password as soon as it gets created than comment and I will share that with you as well).

    Below is the user data to install Jenkins master, configure its password and install required packages.

    #!/bin/bash
    
    set -x
    
    function wait_for_jenkins()
    {
      while (( 1 )); do
          echo "waiting for Jenkins to launch on port [8080] ..."
          
          nc -zv 127.0.0.1 8080
          if (( $? == 0 )); then
              break
          fi
    
          sleep 10
      done
    
      echo "Jenkins launched"
    }
    
    function updating_jenkins_master_password ()
    {
      cat > /tmp/jenkinsHash.py <<EOF
    import bcrypt
    import sys
    if not sys.argv[1]:
      sys.exit(10)
    plaintext_pwd=sys.argv[1]
    encrypted_pwd=bcrypt.hashpw(sys.argv[1], bcrypt.gensalt(rounds=10, prefix=b"2a"))
    isCorrect=bcrypt.checkpw(plaintext_pwd, encrypted_pwd)
    if not isCorrect:
      sys.exit(20);
    print "{}".format(encrypted_pwd)
    EOF
    
      chmod +x /tmp/jenkinsHash.py
      
      # Wait till /var/lib/jenkins/users/admin* folder gets created
      sleep 10
    
      cd /var/lib/jenkins/users/admin*
      pwd
      while (( 1 )); do
          echo "Waiting for Jenkins to generate admin user's config file ..."
    
          if [[ -f "./config.xml" ]]; then
              break
          fi
    
          sleep 10
      done
    
      echo "Admin config file created"
    
      admin_password=$(python /tmp/jenkinsHash.py ${jenkins_admin_password} 2>&1)
      
      # Please do not remove alter quote as it keeps the hash syntax intact or else while substitution, $<character> will be replaced by null
      xmlstarlet -q ed --inplace -u "/user/properties/hudson.security.HudsonPrivateSecurityRealm_-Details/passwordHash" -v '#jbcrypt:'"$admin_password" config.xml
    
      # Restart
      systemctl restart jenkins
      sleep 10
    }
    
    function install_packages ()
    {
    
      wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat-stable/jenkins.repo
      rpm --import https://jenkins-ci.org/redhat/jenkins-ci.org.key
      yum install -y jenkins
    
      # firewall
      #firewall-cmd --permanent --new-service=jenkins
      #firewall-cmd --permanent --service=jenkins --set-short="Jenkins Service Ports"
      #firewall-cmd --permanent --service=jenkins --set-description="Jenkins Service firewalld port exceptions"
      #firewall-cmd --permanent --service=jenkins --add-port=8080/tcp
      #firewall-cmd --permanent --add-service=jenkins
      #firewall-cmd --zone=public --add-service=http --permanent
      #firewall-cmd --reload
      systemctl enable jenkins
      systemctl restart jenkins
      sleep 10
    }
    
    function configure_jenkins_server ()
    {
      # Jenkins cli
      echo "installing the Jenkins cli ..."
      cp /var/cache/jenkins/war/WEB-INF/jenkins-cli.jar /var/lib/jenkins/jenkins-cli.jar
    
      # Getting initial password
      # PASSWORD=$(cat /var/lib/jenkins/secrets/initialAdminPassword)
      PASSWORD="${jenkins_admin_password}"
      sleep 10
    
      jenkins_dir="/var/lib/jenkins"
      plugins_dir="$jenkins_dir/plugins"
    
      cd $jenkins_dir
    
      # Open JNLP port
      xmlstarlet -q ed --inplace -u "/hudson/slaveAgentPort" -v 33453 config.xml
    
      cd $plugins_dir || { echo "unable to chdir to [$plugins_dir]"; exit 1; }
    
      # List of plugins that are needed to be installed 
      plugin_list="git-client git github-api github-oauth github MSBuild ssh-slaves workflow-aggregator ws-cleanup"
    
      # remove existing plugins, if any ...
      rm -rfv $plugin_list
    
      for plugin in $plugin_list; do
          echo "installing plugin [$plugin] ..."
          java -jar $jenkins_dir/jenkins-cli.jar -s http://127.0.0.1:8080/ -auth admin:$PASSWORD install-plugin $plugin
      done
    
      # Restart jenkins after installing plugins
      java -jar $jenkins_dir/jenkins-cli.jar -s http://127.0.0.1:8080 -auth admin:$PASSWORD safe-restart
    }
    
    ### script starts here ###
    
    install_packages
    
    wait_for_jenkins
    
    updating_jenkins_master_password
    
    wait_for_jenkins
    
    configure_jenkins_server
    
    echo "Done"
    exit 0
    

    There is a lot of stuff that has been covered here. But the most tricky bit is changing Jenkins password. Here we are using a python script which uses brcypt to hash the plain text in Jenkins encryption format and xmlstarlet for replacing that password in the actual location. Also, we are using xmstarlet to edit the JNLP port for windows slave. Do remember initial username for Jenkins is admin.

    Command to run: Initialize terraform – terraform init , Check and apply – terraform plan -> terraform apply

    After successfully running apply command go to AWS console and check for a new instance coming up. Hit the <public ip=””>:8080 and enter credentials as you had passed and you will have the Jenkins master for yourself ready to be used. </public>

    Note: I will be providing the terraform script and permission list of IAM roles for the user at the end of the blog.

    Creating Terraform Script for Spinning up Linux Slave and connect it to master

    We won’t be creating a new image here rather use the same one that we used for Jenkins master.

    VPC will be same and updated Security groups for slave are below:

    resource "aws_security_group" "dev_jenkins_worker_linux" {
      name        = "dev_jenkins_worker_linux"
      description = "Jenkins Server: created by Terraform for [dev]"
    
    # legacy name of VPC ID
      vpc_id = "${data.aws_vpc.default_vpc.id}"
    
      tags {
        Name = "dev_jenkins_worker_linux"
        env  = "dev"
      }
    }
    
    ###############################################################################
    # ALL INBOUND
    ###############################################################################
    
    # ssh
    resource "aws_security_group_rule" "jenkins_worker_linux_from_source_ingress_ssh" {
      type              = "ingress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["<Your Public IP>/32"]
      description       = "ssh to jenkins_worker_linux"
    }
    
    # ssh
    resource "aws_security_group_rule" "jenkins_worker_linux_from_source_ingress_webui" {
      type              = "ingress"
      from_port         = 8080
      to_port           = 8080
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "ssh to jenkins_worker_linux"
    }
    
    
    ###############################################################################
    # ALL OUTBOUND
    ###############################################################################
    
    resource "aws_security_group_rule" "jenkins_worker_linux_to_all_80" {
      type              = "egress"
      from_port         = 80
      to_port           = 80
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker to all 80"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_linux_to_all_443" {
      type              = "egress"
      from_port         = 443
      to_port           = 443
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker to all 443"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_linux_to_other_machines_ssh" {
      type              = "egress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker linux to jenkins server"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_linux_to_jenkins_server_8080" {
      type                     = "egress"
      from_port                = 8080
      to_port                  = 8080
      protocol                 = "tcp"
      security_group_id        = "${aws_security_group.dev_jenkins_worker_linux.id}"
      source_security_group_id = "${aws_security_group.jenkins_server.id}"
      description              = "allow jenkins workers linux to jenkins server"
    }

    Now that we have the required security groups in place it is time to bring into light terraform script for linux slave.

    data "aws_ami" "jenkins_worker_linux" {
      most_recent      = true
      owners           = ["self"]
    
      filter {
        name   = "name"
        values = ["amazon-linux-for-jenkins*"]
      }
    }
    
    resource "aws_key_pair" "jenkins_worker_linux" {
      key_name   = "jenkins_worker_linux"
      public_key = "${file("jenkins_worker.pub")}"
    }
    
    data "local_file" "jenkins_worker_pem" {
      filename = "${path.module}/jenkins_worker.pem"
    }
    
    data "template_file" "userdata_jenkins_worker_linux" {
      template = "${file("scripts/jenkins_worker_linux.sh")}"
    
      vars {
        env         = "dev"
        region      = "us-east-1"
        datacenter  = "dev-us-east-1"
        node_name   = "us-east-1-jenkins_worker_linux"
        domain      = ""
        device_name = "eth0"
        server_ip   = "${aws_instance.jenkins_server.private_ip}"
        worker_pem  = "${data.local_file.jenkins_worker_pem.content}"
        jenkins_username = "admin"
        jenkins_password = "mysupersecretpassword"
      }
    }
    
    # lookup the security group of the Jenkins Server
    data "aws_security_group" "jenkins_worker_linux" {
      filter {
        name   = "group-name"
        values = ["dev_jenkins_worker_linux"]
      }
    }
    
    resource "aws_launch_configuration" "jenkins_worker_linux" {
      name_prefix                 = "dev-jenkins-worker-linux"
      image_id                    = "${data.aws_ami.jenkins_worker_linux.image_id}"
      instance_type               = "t3.medium"
      iam_instance_profile        = "dev_jenkins_worker_linux"
      key_name                    = "${aws_key_pair.jenkins_worker_linux.key_name}"
      security_groups             = ["${data.aws_security_group.jenkins_worker_linux.id}"]
      user_data                   = "${data.template_file.userdata_jenkins_worker_linux.rendered}"
      associate_public_ip_address = false
    
      root_block_device {
        delete_on_termination = true
        volume_size = 100
      }
    
      lifecycle {
        create_before_destroy = true
      }
    }
    
    resource "aws_autoscaling_group" "jenkins_worker_linux" {
      name                      = "dev-jenkins-worker-linux"
      min_size                  = "1"
      max_size                  = "2"
      desired_capacity          = "2"
      health_check_grace_period = 60
      health_check_type         = "EC2"
      vpc_zone_identifier       = ["${data.aws_subnet_ids.default_public.ids}"]
      launch_configuration      = "${aws_launch_configuration.jenkins_worker_linux.name}"
      termination_policies      = ["OldestLaunchConfiguration"]
      wait_for_capacity_timeout = "10m"
      default_cooldown          = 60
    
      tags = [
        {
          key                 = "Name"
          value               = "dev_jenkins_worker_linux"
          propagate_at_launch = true
        },
        {
          key                 = "class"
          value               = "dev_jenkins_worker_linux"
          propagate_at_launch = true
        },
      ]
    }

    And now the final piece of code, which is user-data of slave machine.

    #!/bin/bash
    
    set -x
    
    function wait_for_jenkins ()
    {
        echo "Waiting jenkins to launch on 8080..."
    
        while (( 1 )); do
            echo "Waiting for Jenkins"
    
            nc -zv ${server_ip} 8080
            if (( $? == 0 )); then
                break
            fi
    
            sleep 10
        done
    
        echo "Jenkins launched"
    }
    
    function slave_setup()
    {
        # Wait till jar file gets available
        ret=1
        while (( $ret != 0 )); do
            wget -O /opt/jenkins-cli.jar http://${server_ip}:8080/jnlpJars/jenkins-cli.jar
            ret=$?
    
            echo "jenkins cli ret [$ret]"
        done
    
        ret=1
        while (( $ret != 0 )); do
            wget -O /opt/slave.jar http://${server_ip}:8080/jnlpJars/slave.jar
            ret=$?
    
            echo "jenkins slave ret [$ret]"
        done
        
        mkdir -p /opt/jenkins-slave
        chown -R ec2-user:ec2-user /opt/jenkins-slave
    
        # Register_slave
        JENKINS_URL="http://${server_ip}:8080"
    
        USERNAME="${jenkins_username}"
        
        # PASSWORD=$(cat /tmp/secret)
        PASSWORD="${jenkins_password}"
    
        SLAVE_IP=$(ip -o -4 addr list ${device_name} | head -n1 | awk '{print $4}' | cut -d/ -f1)
        NODE_NAME=$(echo "jenkins-slave-linux-$SLAVE_IP" | tr '.' '-')
        NODE_SLAVE_HOME="/opt/jenkins-slave"
        EXECUTORS=2
        SSH_PORT=22
    
        CRED_ID="$NODE_NAME"
        LABELS="build linux docker"
        USERID="ec2-user"
    
        cd /opt
        
        # Creating CMD utility for jenkins-cli commands
        jenkins_cmd="java -jar /opt/jenkins-cli.jar -s $JENKINS_URL -auth $USERNAME:$PASSWORD"
    
        # Waiting for Jenkins to load all plugins
        while (( 1 )); do
    
          count=$($jenkins_cmd list-plugins 2>/dev/null | wc -l)
          ret=$?
    
          echo "count [$count] ret [$ret]"
    
          if (( $count > 0 )); then
              break
          fi
    
          sleep 30
        done
    
        # Delete Credentials if present for respective slave machines
        $jenkins_cmd delete-credentials system::system::jenkins _ $CRED_ID
    
        # Generating cred.xml for creating credentials on Jenkins server
        cat > /tmp/cred.xml <<EOF
    <com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKey plugin="ssh-credentials@1.16">
      <scope>GLOBAL</scope>
      <id>$CRED_ID</id>
      <description>Generated via Terraform for $SLAVE_IP</description>
      <username>$USERID</username>
      <privateKeySource class="com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKey\$DirectEntryPrivateKeySource">
        <privateKey>${worker_pem}</privateKey>
      </privateKeySource>
    </com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKey>
    EOF
    
        # Creating credential using cred.xml
        cat /tmp/cred.xml | $jenkins_cmd create-credentials-by-xml system::system::jenkins _
    
        # For Deleting Node, used when testing
        $jenkins_cmd delete-node $NODE_NAME
        
        # Generating node.xml for creating node on Jenkins server
        cat > /tmp/node.xml <<EOF
    <slave>
      <name>$NODE_NAME</name>
      <description>Linux Slave</description>
      <remoteFS>$NODE_SLAVE_HOME</remoteFS>
      <numExecutors>$EXECUTORS</numExecutors>
      <mode>NORMAL</mode>
      <retentionStrategy class="hudson.slaves.RetentionStrategy\$Always"/>
      <launcher class="hudson.plugins.sshslaves.SSHLauncher" plugin="ssh-slaves@1.5">
        <host>$SLAVE_IP</host>
        <port>$SSH_PORT</port>
        <credentialsId>$CRED_ID</credentialsId>
      </launcher>
      <label>$LABELS</label>
      <nodeProperties/>
      <userId>$USERID</userId>
    </slave>
    EOF
    
      sleep 10
      
      # Creating node using node.xml
      cat /tmp/node.xml | $jenkins_cmd create-node $NODE_NAME
    }
    
    ### script begins here ###
    
    wait_for_jenkins
    
    slave_setup
    
    echo "Done"
    exit 0

    This will not only create a node on Jenkins master but also attach it.

    Command to run: Initialize terraform – terraform init, Check and apply – terraform plan -> terraform apply

    One drawback of this is, if by any chance slave gets disconnected or goes down, it will remain on Jenkins master as offline, also it will not manually attach itself to Jenkins master.

    Some solutions for them are:

    1. Create a cron job on the slave which will run user-data after a certain interval.

    2. Use swarm plugin.

    3. As we are on AWS, we can even use Amazon EC2 Plugin.

    Maybe in a future blog, we will cover using both of these plugins as well.

    Using Packer to create AMI’s for Windows Slave

    Windows AMI will also be created using packer. All the pointers for Windows will remain as it were for Linux.

    {
      "variables": {
        "ami-description": "Windows Server for Jenkins Slave ({{isotime \"2006-01-02-15-04-05\"}})",
        "ami-name": "windows-slave-for-jenkins-{{isotime \"2006-01-02-15-04-05\"}}",
        "aws_access_key": "",
        "aws_secret_key": ""
      },
    
      "builders": [
        {
          "ami_description": "{{user `ami-description`}}",
          "ami_name": "{{user `ami-name`}}",
          "ami_regions": [
            "us-east-1"
          ],
          "ami_users": [
            "XXXXXXXXXX"
          ],
          "ena_support": "true",
          "instance_type": "t3.medium",
          "region": "us-east-1",
          "source_ami_filter": {
            "filters": {
              "name": "Windows_Server-2016-English-Full-Containers-*",
              "root-device-type": "ebs",
              "virtualization-type": "hvm"
            },
            "most_recent": true,
            "owners": [
              "amazon"
            ]
          },
          "sriov_support": "true",
          "user_data_file": "scripts/SetUpWinRM.ps1",
          "communicator": "winrm",
          "winrm_username": "Administrator",
          "winrm_insecure": true,
          "winrm_use_ssl": true,
          "tags": {
            "Name": "{{user `ami-name`}}"
          },
          "type": "amazon-ebs"
        }
      ],
      "post-processors": [
      {
        "inline": [
          "echo AMI Name {{user `ami-name`}}",
          "date",
          "exit 0"
        ],
        "type": "shell-local"
      }
      ],
      "provisioners": [
        {
          "type": "powershell",
          "valid_exit_codes": [ 0, 3010 ],
          "scripts": [
            "scripts/disable-uac.ps1",
            "scripts/enable-rdp.ps1",
            "install_windows.ps1"
          ]
        },
        {
          "type": "windows-restart",
          "restart_check_command": "powershell -command \"& {Write-Output 'restarted.'}\""
        },
        {
          "type": "powershell",
          "inline": [
            "C:\\ProgramData\\Amazon\\EC2-Windows\\Launch\\Scripts\\InitializeInstance.ps1 -Schedule",
            "C:\\ProgramData\\Amazon\\EC2-Windows\\Launch\\Scripts\\SysprepInstance.ps1 -NoShutdown"
          ]
        }
      ]
    }

    Now when it comes to windows one should know that it does not behave the same way Linux does. For us to be able to communicate with this image an essential component required is WinRM. We set it up at the very beginning as part of user_data_file. Also, windows require user input for a lot of things and while automating it is not possible to provide it as it will break the flow of execution so we disable UAC and enable RDP so that we can connect to that machine from our local desktop for debugging if needed. And at last, we will execute install_windows.ps1 file which will set up our slave. Please note at the last we are calling two PowerShell scripts to generate random password every time a new machine is created. It is mandatory to have them or you will never be able to login into your machines.

    There are multiple user-data in the above code, let’s understand them in their order of appearance.

    SetUpWinRM.ps1:

    <powershell>
    
    write-output "Running User Data Script"
    write-host "(host) Running User Data Script"
    
    Set-ExecutionPolicy Unrestricted -Scope LocalMachine -Force -ErrorAction Ignore
    
    # Don't set this before Set-ExecutionPolicy as it throws an error
    $ErrorActionPreference = "stop"
    
    # Remove HTTP listener
    Remove-Item -Path WSMan:\Localhost\listener\listener* -Recurse
    
    $Cert = New-SelfSignedCertificate -CertstoreLocation Cert:\LocalMachine\My -DnsName "packer"
    New-Item -Path WSMan:\LocalHost\Listener -Transport HTTPS -Address * -CertificateThumbPrint $Cert.Thumbprint -Force
    
    # WinRM
    write-output "Setting up WinRM"
    write-host "(host) setting up WinRM"
    
    cmd.exe /c winrm quickconfig -q
    cmd.exe /c winrm set "winrm/config" '@{MaxTimeoutms="1800000"}'
    cmd.exe /c winrm set "winrm/config/winrs" '@{MaxMemoryPerShellMB="1024"}'
    cmd.exe /c winrm set "winrm/config/service" '@{AllowUnencrypted="true"}'
    cmd.exe /c winrm set "winrm/config/client" '@{AllowUnencrypted="true"}'
    cmd.exe /c winrm set "winrm/config/service/auth" '@{Basic="true"}'
    cmd.exe /c winrm set "winrm/config/client/auth" '@{Basic="true"}'
    cmd.exe /c winrm set "winrm/config/service/auth" '@{CredSSP="true"}'
    cmd.exe /c winrm set "winrm/config/listener?Address=*+Transport=HTTPS" "@{Port=`"5986`";Hostname=`"packer`";CertificateThumbprint=`"$($Cert.Thumbprint)`"}"
    cmd.exe /c netsh advfirewall firewall set rule group="remote administration" new enable=yes
    cmd.exe /c netsh firewall add portopening TCP 5986 "Port 5986"
    cmd.exe /c net stop winrm
    cmd.exe /c sc config winrm start= auto
    cmd.exe /c net start winrm
    
    </powershell>

    The content is pretty straightforward as it is just setting up WInRM. The only thing that matters here is the <powershell> and </powershell>. They are mandatory as packer will not be able to understand what is the type of script. Next, we come across disable-uac.ps1 & enable-rdp.ps1, and we have discussed their purpose before. The last user-data is the actual user-data that we need to install all the required packages in the AMI.

    Chocolatey: a blessing in disguise – Installing required applications in windows by scripting is a real headache as you have to write a lot of stuff just to install a single application but luckily for us we have chocolatey. It works as a package manager for windows and helps us install applications as we are installing packages on Linux. install_windows.ps1 has installation step for chocolatey and how it can be used to install other applications on windows.

    See, such a small script and you can get all the components to run your Windows application in no time (Kidding… This script actually takes around 20 minutes to run :P)

    Remaining user-data can be found here.

    Now that we have the image for ourselves let’s start with terraform script to make this machine a slave of your Jenkins master.

    Creating Terraform Script for Spinning up Windows Slave and Connect it to Master

    This time also we will first create the security groups and then create the slave machine from the same AMI that we developed above.

    resource "aws_security_group" "dev_jenkins_worker_windows" {
      name        = "dev_jenkins_worker_windows"
      description = "Jenkins Server: created by Terraform for [dev]"
    
      # legacy name of VPC ID
      vpc_id = "${data.aws_vpc.default_vpc.id}"
    
      tags {
        Name = "dev_jenkins_worker_windows"
        env  = "dev"
      }
    }
    
    ###############################################################################
    # ALL INBOUND
    ###############################################################################
    
    # ssh
    resource "aws_security_group_rule" "jenkins_worker_windows_from_source_ingress_webui" {
      type              = "ingress"
      from_port         = 8080
      to_port           = 8080
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "ssh to jenkins_worker_windows"
    }
    
    # rdp
    resource "aws_security_group_rule" "jenkins_worker_windows_from_rdp" {
      type              = "ingress"
      from_port         = 3389
      to_port           = 3389
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["<Your Public IP>/32"]
      description       = "rdp to jenkins_worker_windows"
    }
    
    ###############################################################################
    # ALL OUTBOUND
    ###############################################################################
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_all_80" {
      type              = "egress"
      from_port         = 80
      to_port           = 80
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker to all 80"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_all_443" {
      type              = "egress"
      from_port         = 443
      to_port           = 443
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker to all 443"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_jenkins_server_33453" {
      type              = "egress"
      from_port         = 33453
      to_port           = 33453
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["172.31.0.0/16"]
      description       = "allow jenkins worker windows to jenkins server"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_jenkins_server_8080" {
      type                     = "egress"
      from_port                = 8080
      to_port                  = 8080
      protocol                 = "tcp"
      security_group_id        = "${aws_security_group.dev_jenkins_worker_windows.id}"
      source_security_group_id = "${aws_security_group.jenkins_server.id}"
      description              = "allow jenkins workers windows to jenkins server"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_all_22" {
      type              = "egress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker windows to connect outbound from 22"
    }

    Once security groups are in place we move towards creating the terraform file for windows machine itself. Windows can’t connect to Jenkins master using SSH the method we used while connecting the Linux slave instead we have to use JNLP. A quick recap, when creating Jenkins master we used xmlstarlet to modify the JNLP port and also added rules in sg group to allow connection for JNLP. Also, we have opened the port for RDP so that if any issue occurs you can get in the machine and debug it.

    Terraform file:

    # Setting Up Windows Slave 
    data "aws_ami" "jenkins_worker_windows" {
      most_recent      = true
      owners           = ["self"]
    
      filter {
        name   = "name"
        values = ["windows-slave-for-jenkins*"]
      }
    }
    
    resource "aws_key_pair" "jenkins_worker_windows" {
      key_name   = "jenkins_worker_windows"
      public_key = "${file("jenkins_worker.pub")}"
    }
    
    data "template_file" "userdata_jenkins_worker_windows" {
      template = "${file("scripts/jenkins_worker_windows.ps1")}"
    
      vars {
        env         = "dev"
        region      = "us-east-1"
        datacenter  = "dev-us-east-1"
        node_name   = "us-east-1-jenkins_worker_windows"
        domain      = ""
        device_name = "eth0"
        server_ip   = "${aws_instance.jenkins_server.private_ip}"
        worker_pem  = "${data.local_file.jenkins_worker_pem.content}"
        jenkins_username = "admin"
        jenkins_password = "mysupersecretpassword"
      }
    }
    
    # lookup the security group of the Jenkins Server
    data "aws_security_group" "jenkins_worker_windows" {
      filter {
        name   = "group-name"
        values = ["dev_jenkins_worker_windows"]
      }
    }
    
    resource "aws_launch_configuration" "jenkins_worker_windows" {
      name_prefix                 = "dev-jenkins-worker-"
      image_id                    = "${data.aws_ami.jenkins_worker_windows.image_id}"
      instance_type               = "t3.medium"
      iam_instance_profile        = "dev_jenkins_worker_windows"
      key_name                    = "${aws_key_pair.jenkins_worker_windows.key_name}"
      security_groups             = ["${data.aws_security_group.jenkins_worker_windows.id}"]
      user_data                   = "${data.template_file.userdata_jenkins_worker_windows.rendered}"
      associate_public_ip_address = false
    
      root_block_device {
        delete_on_termination = true
        volume_size = 100
      }
    
      lifecycle {
        create_before_destroy = true
      }
    }
    
    resource "aws_autoscaling_group" "jenkins_worker_windows" {
      name                      = "dev-jenkins-worker-windows"
      min_size                  = "1"
      max_size                  = "2"
      desired_capacity          = "2"
      health_check_grace_period = 60
      health_check_type         = "EC2"
      vpc_zone_identifier       = ["${data.aws_subnet_ids.default_public.ids}"]
      launch_configuration      = "${aws_launch_configuration.jenkins_worker_windows.name}"
      termination_policies      = ["OldestLaunchConfiguration"]
      wait_for_capacity_timeout = "10m"
      default_cooldown          = 60
    
      #lifecycle {
      #  create_before_destroy = true
      #}
    
    
      ## on replacement, gives new service time to spin up before moving on to destroy
      #provisioner "local-exec" {
      #  command = "sleep 60"
      #}
    
      tags = [
        {
          key                 = "Name"
          value               = "dev_jenkins_worker_windows"
          propagate_at_launch = true
        },
        {
          key                 = "class"
          value               = "dev_jenkins_worker_windows"
          propagate_at_launch = true
        },
      ]
    }

    Finally, we reach the user-data for the terraform plan. It will download the required jar file, create a node on Jenkins and register itself as a slave.

    <powershell>
    
    function Wait-For-Jenkins {
    
      Write-Host "Waiting jenkins to launch on 8080..."
    
      Do {
      Write-Host "Waiting for Jenkins"
    
       Nc -zv ${server_ip} 8080
       If( $? -eq $true ) {
         Break
       }
       Sleep 10
    
      } While (1)
    
      Do {
       Write-Host "Waiting for JNLP"
          
       Nc -zv ${server_ip} 33453
       If( $? -eq $true ) {
        Break
       }
       Sleep 10
    
      } While (1)      
    
      Write-Host "Jenkins launched"
    }
    
    function Slave-Setup()
    {
      # Register_slave
      $JENKINS_URL="http://${server_ip}:8080"
    
      $USERNAME="${jenkins_username}"
      
      $PASSWORD="${jenkins_password}"
    
      $AUTH = -join ("$USERNAME", ":", "$PASSWORD")
      echo $AUTH
    
      # Below IP collection logic works for Windows Server 2016 edition and needs testing for windows server 2008 edition
      $SLAVE_IP=(ipconfig | findstr /r "[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*" | findstr "IPv4 Address").substring(39) | findstr /B "172.31"
      
      $NODE_NAME="jenkins-slave-windows-$SLAVE_IP"
      
      $NODE_SLAVE_HOME="C:\Jenkins\"
      $EXECUTORS=2
      $JNLP_PORT=33453
    
      $CRED_ID="$NODE_NAME"
      $LABELS="build windows"
      
      # Creating CMD utility for jenkins-cli commands
      # This is not working in windows therefore specify full path
      $jenkins_cmd = "java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth admin:$PASSWORD"
    
      Sleep 20
    
      Write-Host "Downloading jenkins-cli.jar file"
      (New-Object System.Net.WebClient).DownloadFile("$JENKINS_URL/jnlpJars/jenkins-cli.jar", "C:\Jenkins\jenkins-cli.jar")
    
      Write-Host "Downloading slave.jar file"
      (New-Object System.Net.WebClient).DownloadFile("$JENKINS_URL/jnlpJars/slave.jar", "C:\Jenkins\slave.jar")
    
      Sleep 10
    
      # Waiting for Jenkins to load all plugins
      Do {
      
        $count=(java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH list-plugins | Measure-Object -line).Lines
        $ret=$?
    
        Write-Host "count [$count] ret [$ret]"
    
        If ( $count -gt 0 ) {
            Break
        }
    
        sleep 30
      } While ( 1 )
    
      # For Deleting Node, used when testing
      Write-Host "Deleting Node $NODE_NAME if present"
      java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH delete-node $NODE_NAME
      
      # Generating node.xml for creating node on Jenkins server
      $NodeXml = @"
    <slave>
    <name>$NODE_NAME</name>
    <description>Windows Slave</description>
    <remoteFS>$NODE_SLAVE_HOME</remoteFS>
    <numExecutors>$EXECUTORS</numExecutors>
    <mode>NORMAL</mode>
    <retentionStrategy class="hudson.slaves.RetentionStrategy`$Always`"/>
    <launcher class="hudson.slaves.JNLPLauncher">
      <workDirSettings>
        <disabled>false</disabled>
        <internalDir>remoting</internalDir>
        <failIfWorkDirIsMissing>false</failIfWorkDirIsMissing>
      </workDirSettings>
    </launcher>
    <label>$LABELS</label>
    <nodeProperties/>
    </slave>
    "@
      $NodeXml | Out-File -FilePath C:\Jenkins\node.xml 
    
      type C:\Jenkins\node.xml
    
      # Creating node using node.xml
      Write-Host "Creating $NODE_NAME"
      Get-Content -Path C:\Jenkins\node.xml | java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH create-node $NODE_NAME
    
      Write-Host "Registering Node $NODE_NAME via JNLP"
      Start-Process java -ArgumentList "-jar C:\Jenkins\slave.jar -jnlpCredentials $AUTH -jnlpUrl $JENKINS_URL/computer/$NODE_NAME/slave-agent.jnlp"
    }
    
    ### script begins here ###
    
    Wait-For-Jenkins
    
    Slave-Setup
    
    echo "Done"
    </powershell>
    <persist>true</persist>

    Command to run: Initialize terraform – terraform init, Check and apply – terraform plan -> terraform apply

    Same drawbacks are applicable here and the same solutions will work here as well.

    Congratulations! You have a Jenkins master with Windows and Linux slave attached to it.

    IAM roles for reference

    Jenkins Master

    Linux Slave

    Windows Slave

    Bonus:

    If you want to associate IAM permissions to the user but cannot assign FULL ACCESS here is a curated list below for reference:

    Packer Policy

    Terraform Policy

    Conclusion:

    This blog tries to highlight one of the ways in which we can use packer and Terraform to create AMI’s which will serve as Jenkins master and slave. We not only covered their creation but also focused on how to associate security groups and checked some of the basic IAM roles that can be applied. Although we have covered almost all the possible scenarios but still depending on use case, the required changes would be very less and this can serve as a boiler plate code when beginning to plan your infrastructure on cloud.

  • Web Scraping: Introduction, Best Practices & Caveats

    Web scraping is a process to crawl various websites and extract the required data using spiders. This data is processed in a data pipeline and stored in a structured format. Today, web scraping is widely used and has many use cases:

    • Using web scraping, Marketing & Sales companies can fetch lead-related information.
    • Web scraping is useful for Real Estate businesses to get the data of new projects, resale properties, etc.
    • Price comparison portals, like Trivago, extensively use web scraping to get the information of product and price from various e-commerce sites.

    The process of web scraping usually involves spiders, which fetch the HTML documents from relevant websites, extract the needed content based on the business logic, and finally store it in a specific format. This blog is a primer to build highly scalable scrappers. We will cover the following items:

    1. Ways to scrape: We’ll see basic ways to scrape data using techniques and frameworks in Python with some code snippets.
    2. Scraping at scale: Scraping a single page is straightforward, but there are challenges in scraping millions of websites, including managing the spider code, collecting data, and maintaining a data warehouse. We’ll explore such challenges and their solutions to make scraping easy and accurate.
    3. Scraping Guidelines: Scraping data from websites without the owner’s permission can be deemed as malicious. Certain guidelines need to be followed to ensure our scrappers are not blacklisted. We’ll look at some of the best practices one should follow for crawling.

    So let’s start scraping. 

    Different Techniques for Scraping

    Here, we will discuss how to scrape a page and the different libraries available in Python.

    Note: Python is the most popular language for scraping.  

    1. Requests – HTTP Library in Python: To scrape the website or a page, first find out the content of the HTML page in an HTTP response object. The requests library from Python is pretty handy and easy to use. It uses urllib inside. I like ‘requests’ as it’s easy and the code becomes readable too.

    #Example showing how to use the requests library
    import requests
    r = requests.get("https://velotio.com") #Fetch HTML Page

    2. BeautifulSoup: Once you get the webpage, the next step is to extract the data. BeautifulSoup is a powerful Python library that helps you extract the data from the page. It’s easy to use and has a wide range of APIs that’ll help you extract the data. We use the requests library to fetch an HTML page and then use the BeautifulSoup to parse that page. In this example, we can easily fetch the page title and all links on the page. Check out the documentation for all the possible ways in which we can use BeautifulSoup.

    from bs4 import BeautifulSoup
    import requests
    r = requests.get("https://velotio.com") #Fetch HTML Page
    soup = BeautifulSoup(r.text, "html.parser") #Parse HTML Page
    print "Webpage Title:" + soup.title.string
    print "Fetch All Links:" soup.find_all('a')

    3. Python Scrapy Framework:

    Scrapy is a Python-based web scraping framework that allows you to create different kinds of spiders to fetch the source code of the target website. Scrapy starts crawling the web pages present on a certain website, and then you can write the extraction logic to get the required data. Scrapy is built on the top of Twisted, a Python-based asynchronous library that performs the requests in an async fashion to boost up the spider performance. Scrapy is faster than BeautifulSoup. Moreover, it is a framework to write scrapers as opposed to BeautifulSoup, which is just a library to parse HTML pages.

    Here is a simple example of how to use Scrapy. Install Scrapy via pip. Scrapy gives a shell after parsing a website:

    $ pip install scrapy #Install Scrapy"
    $ scrapy shell https://velotio.com
    In [1]: response.xpath("//a").extract() #Fetch all a hrefs

    Now, let’s write a custom spider to parse a website.

    $cat > myspider.py <import scrapy
    
    class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']
    
    def parse(self, response):
    for title in response.css('h2.entry-title'):
    yield {'title': title.css('a ::text').extract_first()}
    EOF
    scrapy runspider myspider.py

    That’s it. Your first custom spider is created. Now. let’s understand the code.

    • name: Name of the spider. In this case, it’s “blogspider”.
    • start_urls: A list of URLs where the spider will begin to crawl from.
    • parse(self, response): This function is called whenever the crawler successfully crawls a URL. The response object used earlier in the Scrapy shell is the same response object that is passed to the parse(..).

    When you run this, Scrapy will look for start URL and will give you all the divs of the h2.entry-title class and extract the associated text from it. Alternatively, you can write your extraction logic in a parse method or create a separate class for extraction and call its object from the parse method.

    You’ve seen how to extract simple items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient. Here is a tutorial for Scrapy and the additional documentation for LinkExtractor by which you can instruct Scrapy to extract links from a web page.

    4. Python lxml.html library:  This is another library from Python just like BeautifulSoup. Scrapy internally uses lxml. It comes with a list of APIs you can use for data extraction. Why will you use this when Scrapy itself can extract the data? Let’s say you want to iterate over the ‘div’ tag and perform some operation on each tag present under “div”, then you can use this library which will give you a list of ‘div’ tags. Now you can simply iterate over them using the iter() function and traverse each child tag inside the parent div tag. Such traversing operations are difficult in scraping. Here is the documentation for this library.

    Challenges while Scraping at Scale

    Let’s look at the challenges and solutions while scraping at large scale, i.e., scraping 100-200 websites regularly:

    1. Data warehousing: Data extraction at a large scale generates vast volumes of information. Fault-tolerant, scalability, security, and high availability are the must-have features for a data warehouse. If your data warehouse is not stable or accessible then operations, like search and filter over data would be an overhead. To achieve this, instead of maintaining own database or infrastructure, you can use Amazon Web Services (AWS). You can use RDS (Relational Database Service) for a structured database and DynamoDB for the non-relational database. AWS takes care of the backup of data. It automatically takes a snapshot of the database. It gives you database error logs as well. This blog explains how to set up infrastructure in the cloud for scraping.  

    2. Pattern Changes: Scraping heavily relies on user interface and its structure, i.e., CSS and Xpath. Now, if the target website gets some adjustments then our scraper may crash completely or it can give random data that we don’t want. This is a common scenario and that’s why it’s more difficult to maintain scrapers than writing it. To handle this case, we can write the test cases for the extraction logic and run them daily, either manually or from CI tools, like Jenkins to track if the target website has changed or not.

    3. Anti-scraping Technologies: Web scraping is a common thing these days, and every website host would want to prevent their data from being scraped. Anti-scraping technologies would help them in this. For example, if you are hitting a particular website from the same IP address on a regular interval then the target website can block your IP. Adding a captcha on a website also helps. There are methods by which we can bypass these anti-scraping methods. For e.g., we can use proxy servers to hide our original IP. There are several proxy services that keep on rotating the IP before each request. Also, it is easy to add support for proxy servers in the code, and in Python, the Scrapy framework does support it.

    4. JavaScript-based dynamic content:  Websites that heavily rely on JavaScript and Ajax to render dynamic content, makes data extraction difficult. Now, Scrapy and related frameworks/libraries will only work or extract what it finds in the HTML document. Ajax calls or JavaScript are executed at runtime so it can’t scrape that. This can be handled by rendering the web page in a headless browser such as Headless Chrome, which essentially allows running Chrome in a server environment. You can also use PhantomJS, which provides a headless Webkit-based environment.

    5. Honeypot traps: Some websites have honeypot traps on the webpages for the detection of web crawlers. They are hard to detect as most of the links are blended with background color or the display property of CSS is set to none. To achieve this requires large coding efforts on both the server and the crawler side, hence this method is not frequently used.

    6. Quality of data: Currently, AI and ML projects are in high demand and these projects need data at large scale. Data integrity is also important as one fault can cause serious problems in AI/ML algorithms. So, in scraping, it is very important to not just scrape the data, but verify its integrity as well. Now doing this in real-time is not possible always, so I would prefer to write test cases of the extraction logic to make sure whatever your spiders are extracting is correct and they are not scraping any bad data

    7. More Data, More Time:  This one is obvious. The larger a website is, the more data it contains, the longer it takes to scrape that site. This may be fine if your purpose for scanning the site isn’t time-sensitive, but that isn’t often the case. Stock prices don’t stay the same over hours. Sales listings, currency exchange rates, media trends, and market prices are just a few examples of time-sensitive data. What to do in this case then? Well, one solution could be to design your spiders carefully. If you’re using Scrapy like framework then apply proper LinkExtractor rules so that spider will not waste time on scraping unrelated URLs.

    You may use multithreading scraping packages available in Python, such as Frontera and Scrapy Redis. Frontera lets you send out only one request per domain at a time, but can hit multiple domains at once, making it great for parallel scraping. Scrapy Redis lets you send out multiple requests to one domain. The right combination of these can result in a very powerful web spider that can handle both the bulk and variation for large websites.

    8. Captchas: Captchas is a good way of keeping crawlers away from a website and it is used by many website hosts. So, in order to scrape the data from such websites, we need a mechanism to solve the captchas. There are packages, software that can solve the captcha and can act as a middleware between the target website and your spider. Also, you may use libraries like Pillow and Tesseract in Python to solve the simple image-based captchas.

    9. Maintaining Deployment: Normally, we don’t want to limit ourselves to scrape just a few websites. We need the maximum amount of data that are present on the Internet and that may introduce scraping of millions of websites. Now, you can imagine the size of the code and the deployment. We can’t run spiders at this scale from a single machine. What I prefer here is to dockerize the scrapers and take advantage of the latest technologies, like AWS ECS, Kubernetes to run our scraper containers. This helps us keeping our scrapers in high availability state and it’s easy to maintain. Also, we can schedule the scrapers to run at regular intervals.

    Scraping Guidelines/ Best Practices

    1. Respect the robots.txt file:  Robots.txt is a text file that webmasters create to instruct search engine robots on how to crawl and index pages on the website. This file generally contains instructions for crawlers. Now, before even planning the extraction logic, you should first check this file. Usually, you can find this at the website admin section. This file has all the rules set on how crawlers should interact with the website. For e.g., if a website has a link to download critical information then they probably don’t want to expose that to crawlers. Another important factor is the frequency interval for crawling, which means that crawlers can only hit the website at specified intervals. If someone has asked not to crawl their website then we better not do it. Because if they catch your crawlers, it can lead to some serious legal issues.

    2. Do not hit the servers too frequently:  As I mentioned above, some websites will have the frequency interval specified for crawlers. We better use it wisely because not every website is tested against the high load. If you are hitting at a constant interval then it creates huge traffic on the server-side, and it may crash or fail to serve other requests. This creates a high impact on user experience as they are more important than the bots. So, we should make the requests according to the specified interval in robots.txt or use a standard delay of 10 seconds. This also helps you not to get blocked by the target website.

    3. User Agent Rotation and Spoofing: Every request consists of a User-Agent string in the header. This string helps to identify the browser you are using, its version, and the platform. If we use the same User-Agent in every request then it’s easy for the target website to check that request is coming from a crawler. So, to make sure we do not face this, try to rotate the User and the Agent between the requests. You can get examples of genuine User-Agent strings on the Internet very easily, try them out. If you’re using Scrapy, you can set USER_AGENT property in settings.py.

    4. Disguise your requests by rotating IPs and Proxy Services: We’ve discussed this in the challenges above. It’s always better to use rotating IPs and proxy service so that your spider won’t get blocked.

    5. Do not follow the same crawling pattern: Now, as you know many websites use anti-scraping technologies, so it’s easy for them to detect your spider if it’s crawling in the same pattern. Normally, we, as a human, would not follow a pattern on a particular website. So, to have your spiders run smoothly, we can introduce actions like mouse movements, clicking a random link, etc, which gives the impression of your spider as a human.

    6. Scrape during off-peak hours: Off-peak hours are suitable for bots/crawlers as the traffic on the website is considerably less. These hours can be identified by the geolocation from where the site’s traffic originates. This also helps to improve the crawling rate and avoid the extra load from spider requests. Thus, it is advisable to schedule the crawlers to run in the off-peak hours.

    7. Use the scraped data responsibly: We should always take the responsibility of the scraped data. It is not acceptable if someone is scraping the data and then republish it somewhere else. This can be considered as breaking the copyright laws and may lead to legal issues. So, it is advisable to check the target website’s Terms of Service page before scraping.

    8. Use Canonical URLs: When we scrape, we tend to scrape duplicate URLs, and hence the duplicate data, which is the last thing we want to do. It may happen in a single website where we get multiple URLs having the same data. In this situation, duplicate URLs will have a canonical URL, which points to the parent or the original URL. By this, we make sure, we don’t scrape duplicate contents. In frameworks like Scrapy, duplicate URLs are handled by default.

    9. Be transparent: Don’t misrepresent your purpose or use deceptive methods to gain access. If you have a login and a password that identifies you to gain access to a source, use it.  Don’t hide who you are. If possible, share your credentials.

    Conclusion

    We’ve seen the basics of scraping, frameworks, how to crawl, and the best practices of scraping. To conclude:

    • Follow target URLs rules while scraping. Don’t make them block your spider.
    • Maintenance of data and spiders at scale is difficult. Use Docker/ Kubernetes and public cloud providers, like AWS to easily scale your web-scraping backend.
    • Always respect the rules of the websites you plan to crawl. If APIs are available, always use them first.
  • Building High-performance Apps: A Checklist To Get It Right

    An app is only as good as the problem it solves. But your app’s performance can be extremely critical to its success as well. A slow-loading web app can make users quit and try out an alternative in no time. Testing an app’s performance should thus be an integral part of your development process and not an afterthought.

    In this article, we will talk about how you can proactively monitor and boost your app’s performance as well as fix common issues that are slowing down the performance of your app.

    I’ll use the following tools for this blog.

    • Lighthouse – A performance audit tool, developed by Google
    • Webpack – A JavaScript bundler

    You can find similar tools online, both free and paid. So let’s give our Vue a new Angular perspective to make our apps React faster.

    Performance Metrics

    First, we need to understand which metrics play an important role in determining an app’s performance. Lighthouse helps us calculate a score based on a weighted average of the below metrics:

    1. First Contentful Paint (FCP) – 15%
    2. Speed Index (SI) – 15%
    3. Largest Contentful Paint (LCP) – 25%
    4. Time to Interactive (TTI) – 15%
    5. Total Blocking Time (TBT) – 25%
    6. Cumulative Layout Shift (CLS) – 5%

    By taking the above stats into account, Lighthouse gauges your app’s performance as such:

    • 0 to 49 (slow): Red
    • 50 to 89 (moderate): Orange
    • 90 to 100 (fast): Green

    I would recommend going through Lighthouse performance scoring to learn more. Once you understand Lighthouse, you can audit websites of your choosing.

    I gathered audit scores for a few websites, including Walmart, Zomato, Reddit, and British Airways. Almost all of them had a performance score below 30. A few even secured a single-digit. 

    To attract more customers, businesses fill their apps with many attractive features. But they ignore the most important thing: performance, which degrades with the addition of each such feature.

    As I said earlier, it’s all about the user experience. You can read more about why performance matters and how it impacts the overall experience.

    Now, with that being said, I want to challenge you to conduct a performance test on your favorite app. Let me know if it receives a good score. If not, then don’t feel bad.

    Follow along with me. 

    Let’s get your app fixed!

    Source: Giphy

    Exploring Opportunities

    If you’re still reading this blog, I expect that your app received a low score, or maybe, you’re just curious.

    Source: Giphy

    Whatever the reason, let’s get started.

    Below your scores are the possible opportunities suggested by Lighthouse. Fixing these affects the performance metrics above and eventually boosts your app’s performance. So let’s check them out one-by-one.

    Here are all the possible opportunities listed by Lighthouse:

    1. Eliminate render-blocking resources
    2. Properly size images
    3. Defer offscreen images
    4. Minify CSS & JavaScript
    5. Serve images in the next-gen formats
    6. Enable text compression
    7. Preconnect to required origins
    8. Avoid multiple page redirects
    9. Use video formats for animated content

    A few other opportunities won’t be covered in this blog, but they are just an extension of the above points. Feel free to read them under the further reading section.

    Eliminate Render-blocking Resources

    Source: Giphy

    This section lists down all the render-blocking resources. The main goal is to reduce their impact by:

    • removing unnecessary resources,
    • deferring non-critical resources, and
    • in-lining critical resources.

    To do that, we need to understand what a render-blocking resource is.

    Render-blocking resource and how to identify

    As the name suggests, it’s a resource that prevents a browser from rendering processed content. Lighthouse identifies the following as render-blocking resources:

    • A <script> </script>tag in <head> </head>that doesn’t have a defer or async attribute
    • A <link rel=””stylesheet””> tag that doesn’t have a media attribute to match a user’s device or a disabled attribute to hint browser to not download if unnecessary
    • A <link rel=””import””> that doesn’t have an async attribute

    To reduce the impact, you need to identify what’s critical and what’s not. You can read how to identify critical resources using the Chrome dev tool.

    Classify Resources

    Classify resources as critical and non-critical based on the following color code:

    • Green (critical): Needed for the first paint.
    • Red (non-critical): Not needed for the first paint but will be needed later.

    Solution

    Now, to eliminate render-blocking resources:

    Extract the critical part into an inline resource and add the correct attributes to the non-critical resources. These attributes will indicate to the browser what to download asynchronously. This can be done manually or by using a JS bundler.

    Webpack users can use the libraries below to do it in a few easy steps:

    • For extracting critical CSS, you can use html-critical-webpack-plugin or critters-webpack-plugin. It’ll generate an inline <style></style> tag in the <head></head> with critical CSS stripped out of the main CSS chunk and preloading the main file
    • For extracting CSS depending on media queries, use media-query-splitting-plugin or media-query-plugin
    • The first paint doesn’t need to be dependent on the JavaScript files. Use lazy loading and code splitting techniques to achieve lazy loading resources (downloading only when requested by the browser). The magic comments in lazy loading make it easy
    • And finally, for the main chunk, vendor chunk, or any other external scripts (included in index.html), you can defer them using script-ext-html-webpack-plugin

    There are many more libraries for inlining CSS and deferring external scripts. Feel free to use as per the use case.

    Use Properly Sized Images

    This section lists all the images used in a page that aren’t properly sized, along with the stats on potential savings for each image.

    How Lighthouse Calculate Oversized Images? 

    Lighthouse calculates potential savings by comparing the rendered size of each image on the page with its actual size. The rendered image varies based on the device pixel ratio. If the size difference is at least 25 KB, the image will fail the audit.

    Solution 

    DO NOT serve images that are larger than their rendered versions! The wasted size just hampers the load time. 

    Alternatively,

    • Use responsive images. With this technique, create multiple versions of the images to be used in the application and serve them depending on the media queries, viewport dimensions, etc
    • Use image CDNs to optimize images. These are like a web service API for transforming images
    • Use vector images, like SVG. These are built on simple primitives and can scale without losing data or change in the file size

    You can resize images online or on your system using tools. Learn how to serve responsive images.

    Learn more about replacing complex icons with SVG. For browsers that don’t support SVG format, here’s A Complete Guide to SVG fallbacks.

    Defer Offscreen Images

    An offscreen image is an image located outside of the visible browser viewport. 

    The audit fails if the page has offscreen images. Lighthouse lists all offscreen or hidden images in your page, along with the potential savings. 

    Solution 

    Load offscreen images only when the user focuses on that part of the viewport. To achieve this, lazy-load these images after loading all critical resources.

    There are many libraries available online that will load images depending on the visible viewport. Feel free to use them as per the use case.

    Minify CSS and JavaScript

    Lighthouse identifies all the CSS and JS files that are not minified. It will list all of them along with potential savings.

    Solution 

    Do as the heading says!

    Source: Giphy

    Minifiers can do it for you. Webpack users can use mini-css-extract-plugin and terser-webpack-plugin for minifying CSS and JS, respectively.

    Serve Images in Next-gen Formats

    Following are the next-gen image formats:

    • Webp
    • JPEG 2000
    • JPEG XR

    The image formats we use regularly (i.e., JPEG and PNG) have inferior compression and quality characteristics compared to next-gen formats. Encoding images in these formats can load your website faster and consume less cellular data.

    Lighthouse converts each image of the older format to Webp format and reports those which ones have potential savings of more than 8 KB.

    Solution 

    Convert all, or at least the images Lighthouse recommends, into the above formats. Use your converted images with the fallback technique below to support all browsers.

    <picture>
      <source type="image/jp2" srcset="my-image.jp2">
      <source type="image/jxr" srcset="my-image.jxr">
      <source type="image/webp" srcset="my-image.webp">
      <source type="image/jpeg" srcset="my-image.jpg">
      <img src="my-image.jpg" alt="">
    </picture>

    Enable Text Compression

    Source: Giphy

    This technique of compressing the original textual information uses compression algorithms to find repeated sequences and replace them with shorter representations. It’s done to further minimize the total network bytes.

    Lighthouse lists all the text-based resources that are not compressed. 

    It computes the potential savings by identifying text-based resources that do not include a content-encoding header set to br, gzip or deflate and compresses each of them with gzip.

    If the potential compression savings is more than 10% of the original size, then the file fails the audit.

    Solution

    Webpack users can use compression-webpack-plugin for text compression. 

    The best part about this plugin is that it supports Google’s Brotli compression algorithm which is superior to gzip. Alternatively, you can also use brotli-webpack-plugin. All you need to do is configure your server to return Content-Encoding as br.

    Brotli compresses faster than gzip and produces smaller files (up to 20% smaller). As of June 2020, Brotli is supported by all major browsers except Safari on iOS and desktop and Internet Explorer.

    Don’t worry. You can still use gzip as a fallback.

    Preconnect to Required Origins

    This section lists all the key fetch requests that are not yet prioritized using <link rel=””preconnect””>.

    Establishing connections often involves significant time, especially when it comes to secure connections. It encounters DNS lookups, redirects, and several round trips to the final server handling the user’s request.

    Solution

    Establish an early connection to required origins. Doing so will improve the user experience without affecting bandwidth usage. 

    To achieve this connection, use preconnect or dns-prefetch. This informs the browser that the app wants to establish a connection to the third-party origin as soon as possible.

    Use preconnect for most critical connections. For non-critical connections, use dns-prefetch. Check out the browser support for preconnect. You can use dns-prefetch as the fallback.

    Avoid Multiple Page Redirects

    Source: Giphy

    This section focuses on requesting resources that have been redirected multiple times. One must avoid multiple redirects on the final landing pages.

    A browser encounters this response from a server in case of HTTP-redirect:

    HTTP/1.1 301 Moved Permanently
    Location: /path/to/new/location

    A typical example of a redirect looks like this:

    example.com → www.example.com → m.example.com – very slow mobile experience.

    This eventually makes your page load more slowly.

    Solution

    Don’t leave them hanging!

    Source: Giphy

    Point all your flagged resources to their current location. It’ll help you optimize your pages’ Critical Rendering Path.

    Use Video Formats for Animated Content

    This section lists all the animated GIFs on your page, along with the potential savings. 

    Large GIFs are inefficient when delivering animated content. You can save a significant amount of bandwidth by using videos over GIFs.

    Solution

    Consider using MPEG4 or WebM videos instead of GIFs. Many tools can convert a GIF into a video, such as FFmpeg.

    Use the code below to replicate a GIF’s behavior using MPEG4 and WebM. It’ll be played silent and automatically in an endless loop, just like a GIF. The code ensures that the unsupported format has a fallback.

    <video autoplay loop muted playsinline>  
      <source src="my-funny-animation.webm" type="video/webm">
      <source src="my-funny-animation.mp4" type="video/mp4">
    </video>

    Note: Do not use video formats for a small batch of GIF animations. It’s not worth doing it. It comes in handy when your website makes heavy use of animated content.

    Final Thoughts

    I found a great result in my app’s performance after trying out the techniques above.

    Source: Giphy

    While they may not all fit your app, try it and see what works and what doesn’t. I have compiled a list of some resources that will help you enhance performance. Hopefully, they help.

    Do share your starting and final audit scores with me.

    Happy optimized coding!

    Source: Giphy

    Further Reading

    Learn more – web.dev

    Other opportunities to explore:

    1. Remove unused CSS
    2. Efficiently encode images
    3. Reduce server response times (TTFB)
    4. Preload key requests
    5. Reduce the impact of third-party code

  • A Comprehensive Tutorial to Implementing OpenTracing With Jaeger

    Introduction

    Recently, there has been a lot of discussion around OpenTracing. We’ll start this blog by introducing OpenTracing, explaining what it is and why it is gaining attention. Next, we will discuss distributed tracing system Jaeger and how it helps in troubleshooting microservices-based distributed systems. We will also set up Jaeger and learn to use it for monitoring and troubleshooting purposes.

    Drift to Microservice Architecture

    Microservice Architecture has now become the obvious choice for application developers. In the Microservice Architecture,  a monolithic application is broken down into a group of independently deployed services. In simple words,  an application is more like a collection of microservices. When we have millions of such intertwined microservices working together, it’s almost impossible to map the inter-dependencies of these services and understand the execution of a request.

    If a monolithic application fails then it is more feasible to do the root cause analysis and understand the path of a transaction using some logging frameworks. But in a microservice architecture, logging alone fails to deliver the complete picture.

    Is this service called first in the chain? How do I span all these services to get insight into the application? With questions like these, it becomes a significantly larger problem to debug a set of interdependent distributed services in comparison to a single monolithic application, making OpenTracing more and more popular.

    OpenTracing

    What is Distributed Tracing?

    Distributed tracing is a method used to monitor applications, mostly those built using the microservices architecture. Distributed tracing helps to highlight what causes poor performance and where failures occur.

    How OpenTracing Fits Into This?

    The OpenTracing API provides a standard, vendor neutral framework for instrumentation. This means that if a developer wants to try out a different distributed tracing system, then instead of repeating the whole instrumentation process for the new distributed tracing system, the developer can easily change the configuration of Tracer.

    OpenTracing uses basic terminologies, such as Span and Trace. You can read about them in detail here.

    OpenTracing is a way for services to “describe and propagate distributed traces without knowledge of the underlying OpenTracing implementation.

    Let us take the example of a service like renting a movie on any rental service like iTunes. A service like this requires many other microservices to check that the movie is available, proper payment credentials are received, and enough space exists on the viewer’s device for download. If either one of those microservice fail, then the entire transaction fails. In such a case, having logs just for the main rental service wouldn’t be very useful for debugging. However, if you were able to analyze each service you wouldn’t have to scratch your head to troubleshoot  which microservice failed and what made it fail.

    In real life, applications are even more complex and with the increasing complexity of applications, monitoring the applications has been a tedious task. Opentracing helps us to easily monitor:

    • Spans of services
    • Time taken by each service
    • Latency between the services
    • Hierarchy of services
    • Errors or exceptions during execution of each service.

    Jaeger: A Distributed Tracing System by Uber

    Jaeger, is released as an open source distributed tracing system by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including:

    • Distributed transaction monitoring
    • Performance and latency optimization
    • Root cause analysis
    • Service dependency analysis
    • Distributed context propagation

    Major Components of Jaeger

    1. Jaeger Client Libraries
    2. Agent
    3. Collector
    4. Query
    5. Ingester

    Running Jaeger in a Docker Container

    1.  First, install Jaeger Client on your machine:

    $ pip install jaeger-client

    2.  Now, let’s run Jaeger backend as an all-in-one Docker image. The image launches the Jaeger UI, collector, query, and agent:

    $ docker run -d -p6831:6831/udp -p16686:16686 jaegertracing/all-in-one:latest

    TIP:  To check if the docker container is running, use: Docker ps.

    Once the container starts, open http://localhost:16686/  to access the Jaeger UI. The container runs the Jaeger backend with an in-memory store, which is initially empty, so there is not much we can do with the UI right now since the store has no traces.

    Creating Traces on Jaeger UI

    1.   Create a Python program to create Traces:

    Let’s generate some traces using a simple python program. You can clone the Jaeger-Opentracing repository given below for a sample program that is used in this blog.

    import sys
    import time
    import logging
    import random
    from jaeger_client import Config
    from opentracing_instrumentation.request_context import get_current_span, span_in_context
    
    def init_tracer(service):
        logging.getLogger('').handlers = []
        logging.basicConfig(format='%(message)s', level=logging.DEBUG)    
        config = Config(
            config={
                'sampler': {
                    'type': 'const',
                    'param': 1,
                },
                'logging': True,
            },
            service_name=service,
        )
        return config.initialize_tracer()
    
    def booking_mgr(movie):
        with tracer.start_span('booking') as span:
            span.set_tag('Movie', movie)
            with span_in_context(span):
                cinema_details = check_cinema(movie)
                showtime_details = check_showtime(cinema_details)
                book_show(showtime_details)
    
    def check_cinema(movie):
        with tracer.start_span('CheckCinema', child_of=get_current_span()) as span:
            with span_in_context(span):
                num = random.randint(1,30)
                time.sleep(num)
                cinema_details = "Cinema Details"
                flags = ['false', 'true', 'false']
                random_flag = random.choice(flags)
                span.set_tag('error', random_flag)
                span.log_kv({'event': 'CheckCinema' , 'value': cinema_details })
                return cinema_details
    
    def check_showtime( cinema_details ):
        with tracer.start_span('CheckShowtime', child_of=get_current_span()) as span:
            with span_in_context(span):
                num = random.randint(1,30)
                time.sleep(num)
                showtime_details = "Showtime Details"
                flags = ['false', 'true', 'false']
                random_flag = random.choice(flags)
                span.set_tag('error', random_flag)
                span.log_kv({'event': 'CheckCinema' , 'value': showtime_details })
                return showtime_details
    
    def book_show(showtime_details):
        with tracer.start_span('BookShow',  child_of=get_current_span()) as span:
            with span_in_context(span):
                num = random.randint(1,30)
                time.sleep(num)
                Ticket_details = "Ticket Details"
                flags = ['false', 'true', 'false']
                random_flag = random.choice(flags)
                span.set_tag('error', random_flag)
                span.log_kv({'event': 'CheckCinema' , 'value': showtime_details })
                print(Ticket_details)
    
    assert len(sys.argv) == 2
    tracer = init_tracer('booking')
    movie = sys.argv[1]
    booking_mgr(movie)
    # yield to IOLoop to flush the spans
    time.sleep(2)
    tracer.close()

    The Python program takes a movie name as an argument and calls three functions that get the cinema details, movie showtime details, and finally book a movie ticket.

    It creates some random delays in all the functions to make it more interesting, as in reality the functions would take certain time to get the details. Also the function throws random errors to give us a feel of how the traces of a real-life application may look like in case of failures.

    Here is a brief description of how OpenTracing has been used in the program:

    • Initializing a tracer:
    def init_tracer(service):
       logging.getLogger('').handlers = []
       logging.basicConfig(format='%(message)s', level=logging.DEBUG)   
       config = Config(
           config={
               'sampler': {
                   'type': 'const',
                   'param': 1,
               },
               'logging': True,
           },
           service_name=service,
       )
       return config.initialize_tracer()

    • Using the tracer instance:
    tracer = init_tracer('booking')

    • Starting new child spans using start_span:  
    with tracer.start_span('CheckCinema', child_of=get_current_span()) as span:

    • Using Tags:
    span.set_tag('Movie', movie)

    • Using Logs:
    span.log_kv({'event': 'CheckCinema' , 'value': cinema_details })

    2. Run the python program:

    $ python booking-mgr.py <movie-name>
    
    Initializing Jaeger Tracer with UDP reporter
    Using sampler ConstSampler(True)
    opentracing.tracer initialized to <jaeger_client.tracer.Tracer object at 0x7f72ffa25b50>[app_name=booking]
    Reporting span cfe1cc4b355aacd9:8d6da6e9161f32ac:cfe1cc4b355aacd9:1 booking.CheckCinema
    Reporting span cfe1cc4b355aacd9:88d294b85345ac7b:cfe1cc4b355aacd9:1 booking.CheckShowtime
    Ticket Details
    Reporting span cfe1cc4b355aacd9:98cbfafca3aa0fe2:cfe1cc4b355aacd9:1 booking.BookShow
    Reporting span cfe1cc4b355aacd9:cfe1cc4b355aacd9:0:1 booking.booking

    Now, check your Jaeger UI, you can see a new service “booking” added. Select the service and click on “Find Traces” to see the traces of your service. Every time you run the program a new trace will be created.

    You can now compare the duration of traces through the graph shown above. You can also filter traces using  “Tags” section under “Find Traces”. For example, Setting “error=true” tag will filter out all the jobs that have errors, as shown:

    To view the detailed trace, you can select a specific trace instance and check details like the time taken by each service, errors during execution and logs.

    The above trace instance has four spans, the first representing the root span “booking”, the second is the “CheckCinema”, the third is the “CheckShowtime” and last is the “BookShow”. In this particular trace instance, both the “CheckCinema” and “CheckShowtime” have reported errors, marked by the error=true tag.

    Conclusion

    In this blog, we’ve described the importance and benefits of OpenTracing, one of the core pillars of modern applications. We also explored how distributed tracer Jaeger collect and store traces while revealing inefficient portions of our applications. It is fully compatible with OpenTracing API and has a number of clients for different programming languages including Java, Go, Node.js, Python, PHP, and more.

    References

    • https://www.jaegertracing.io/docs/1.9/
    • https://opentracing.io/docs/
  • BigQuery 101: All the Basics You Need to Know

    Google BigQuery is an enterprise data warehouse built using BigTable and Google Cloud Platform. It’s serverless and completely managed. BigQuery works great with all sizes of data, from a 100 row Excel spreadsheet to several Petabytes of data. Most importantly, it can execute a complex query on those data within a few seconds.

    We need to note before we proceed, BigQuery is not a transactional database. It takes around 2 seconds to run a simple query like ‘SELECT * FROM bigquery-public-data.object LIMIT 10’ on a 100 KB table with 500 rows. Hence, it shouldn’t be thought of as OLTP (Online Transaction Processing) database. BigQuery is for Big Data!

    BigQuery supports SQL-like query, which makes it user-friendly and beginner friendly. It’s accessible via its web UI, command-line tool, or client library (written in C#, Go, Java, Node.js, PHP, Python, and Ruby). You can also take advantage of its REST APIs and get our job` done by sending a JSON request.

    Now, let’s dive deeper to understand it better. Suppose you are a data scientist (or a startup which analyzes data) and you need to analyze terabytes of data. If you choose a tool like MySQL, the first step before even thinking about any query is to have an infrastructure in place, that can store this magnitude of data.

    Designing this setup itself will be a difficult task because you have to figure out what will be the RAM size, DCOS or Kubernetes, and other factors. And if you have streaming data coming, you will need to set up and maintain a Kafka cluster. In BigQuery, all you have to do is a bulk upload of your CSV/JSON file, and you are done. BigQuery handles all the backend for you. If you need streaming data ingestion, you can use Fluentd. Another advantage of this is that you can connect Google Analytics with BigQuery seamlessly.

    BigQuery is serverless, highly available, and petabyte scalable service which allows you to execute complex SQL queries quickly. It lets you focus on analysis rather than handling infrastructure. The idea of hardware is completely abstracted and not visible to us, not even as virtual machines.

    Architecture of Google BigQuery

    You don’t need to know too much about the underlying architecture of BigQuery. That’s actually the whole idea of it – you don’t need to worry about architecture and operation.

    However, understanding BigQuery Architecture helps us in controlling costs, optimizing query performance, and optimizing storage. BigQuery is built using the Google Dremel paper.

    Quoting an Abstract from the Google Dremel Paper –

    “Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.”

    Dremel was in production at Google since 2006. Google used it for the following tasks –

    • Analysis of crawled web documents.
    • Tracking install data for applications on Android Market.
    • Crash reporting for Google products.
    • OCR results from Google Books.
    • Spam analysis.
    • Debugging of map tiles on Google Maps.
    • Tablet migrations in managed Bigtable instances.
    • Results of tests run on Google’s distributed build system.
    • Disk I/O statistics for hundreds of thousands of disks.
    • Resource monitoring for jobs run in Google’s data centers.
    • Symbols and dependencies in Google’s codebase.

    BigQuery is much more than Dremel. Dremel is just a query execution engine, whereas Bigquery is based on interesting technologies like Borg (predecessor of Kubernetes) and Colossus. Colossus is the successor to the Google File System (GFS) as mentioned in Google Spanner Paper.

    How BigQuery Stores Data?

    BigQuery stores data in a columnar format – Capacitor (which is a successor of ColumnarIO). BigQuery achieves very high compression ratio and scan throughput. Unlike ColumnarIO, now on BigQuery, you can directly operate on compressed data without decompressing it.

    Columnar storage has the following advantages:

    • Traffic minimization – When you submit a query, the required column values on each query are scanned and only those are transferred on query execution. E.g., a query `SELECT title FROM Collection` would access the title column values only.
    • Higher compression ratio – Columnar storage can achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3.

    (Image source:  Google Dremel Paper)

    Columnar storage has the disadvantage of not working efficiently when updating existing records. That is why Dremel doesn’t support any update queries.  

    How the Query Gets Executed?

    BigQuery depends on Borg for data processing. Borg simultaneously instantiates hundreds of Dremel jobs across required clusters made up of thousands of machines. In addition to assigning compute capacity for Dremel jobs, Borg handles fault-tolerance as well.

    Now, how do you design/execute a query which can run on thousands of nodes and fetches the result? This challenge was overcome by using the Tree Architecture. This architecture forms a gigantically parallel distributed tree for pushing down a query to the tree and aggregating the results from the leaves at a blazingly fast speed.

    (Image source: Google Dremel Paper)

    BigQuery vs. MapReduce

    The key differences between BigQuery and MapReduce are –

    • Dremel is designed as an interactive data analysis tool for large datasets
    • MapReduce is designed as a programming framework to batch process large datasets

    Moreover, Dremel finishes most queries within seconds or tens of seconds and can even be used by non-programmers, whereas MapReduce takes much longer (sometimes even hours or days) to process a query.

    Following is a comparison on running MapReduce on a row and columnar DB:

    (Image source: Google Dremel Paper)

    Another important thing to note is that BigQuery is meant to analyze structured data (SQL) but in MapReduce, you can write logic for unstructured data as well.

    Comparing BigQuery and Redshift

    In Redshift, you need to allocate different instance types and create your own clusters. The benefit of this is that it lets you tune the compute/storage to meet your needs. However, you have to be aware of (virtualized) hardware limits and scale up/out based on that. Note that you are charged by the hour for each instance you spin up.

    In BigQuery, you just upload the data and query it. It is a truly managed service. You are charged by storage, streaming inserts, and queries.

    There are more similarities in both the data warehouses than the differences.

    A smart user will definitely take advantage of the hybrid cloud (GCE+AWS) and leverage different services offered by both the ecosystems. Check out your quintessential guide to AWS Athena here.

    Getting Started With Google BigQuery

    Following is a quick example to show how you can quickly get started with BigQuery:

    1. There are many public datasets available on bigquery, you are going to play with ‘bigquery-public-data:stackoverflow’ dataset. You can click on the “Add Data” button on the left panel and select datasets.

    2. Next, find a language that has the best community, based on the response time. You can write the following query to do that.

    WITH question_answers_join AS (
      SELECT *
        , GREATEST(1, TIMESTAMP_DIFF(answers.first, creation_date, minute)) minutes_2_answer
      FROM (
        SELECT id, creation_date, title
          , (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c
             FROM `bigquery-public-data.stackoverflow.posts_answers` 
             WHERE a.id=parent_id
          ) answers
          , SPLIT(tags, '|') tags
        FROM `bigquery-public-data.stackoverflow.posts_questions` a
        WHERE EXTRACT(year FROM creation_date) > 2014
      )
    )
    SELECT COUNT(*) questions, tag
      , ROUND(EXP(AVG(LOG(minutes_2_answer))), 2) mean_geo_minutes
      , APPROX_QUANTILES(minutes_2_answer, 100)[SAFE_OFFSET(50)] median
    FROM question_answers_join, UNNEST(tags) tag
    WHERE tag IN ('javascript', 'python', 'rust', 'java', 'scala', 'ruby', 'go', 'react', 'c', 'c++')
    AND answers.c > 0
    GROUP BY tag
    ORDER BY mean_geo_minutes

    3. Now you can execute the query and get results –

    You can see that C has the best community followed by JavaScript!

    How to do Machine Learning on BigQuery?

    Now that you have a sound understanding of BigQuery. It’s time for some real action.

    As discussed above, you can connect Google Analytics with BigQuery by going to the Google Analytics Admin panel, then enable BigQuery by clicking on PROPERTY column, click All Products, then click Link BigQuery. After that, you need to enter BigQuery ID (or project number) and then BigQuery will be linked to Google Analytics. Note – Right now BigQuery integration is only available to Google Analytics 360.

    Assuming that you already have uploaded your google analytics data, here is how you can create a logistic regression model. Here, you are predicting whether a website visitor will make a transaction or not.

    CREATE MODEL `velotio_tutorial.sample_model`
    OPTIONS(model_type='logistic_reg') AS
    SELECT
      IF(totals.transactions IS NULL, 0, 1) AS label,
      IFNULL(device.operatingSystem, "") AS os,
      device.isMobile AS is_mobile,
      IFNULL(geoNetwork.country, "") AS country,
      IFNULL(totals.pageviews, 0) AS pageviews
    FROM
      `bigquery-public-data.google_analytics_sample.ga_sessions_*`
    WHERE
      _TABLE_SUFFIX BETWEEN '20190401' AND '20180630'

    Create a model named ‘velotio_tutorial.sample_model’. Now set the ‘model_type’ as ‘logistic_reg’ because you want to train a logistic regression model. A logistic regression model splits input data into two classes and gives the probability that the data is in one of the classes. Usually, in “spam or not spam” type of problems, you use logistic regression. Here, the problem is similar – a transaction will be made or not.

    The above query gets the total number of page views, the country from where the session originated, the operating system of visitors device, the total number of e-commerce transactions within the session, etc.

    Now you just press run query to execute the query.

    Conclusion

    BigQuery is a query service that allows us to run SQL-like queries against multiple terabytes of data in a matter of seconds. If you have structured data, BigQuery is the best option to go for. It can help even a non-programmer to get the analytics right!

    Learn how to build an ETL Pipeline for MongoDB & Amazon Redshift using Apache Airflow.

    If you need help with using machine learning in product development for your organization, connect with experts at Velotio!

  • How to setup iOS app with Apple developer account and TestFlight from scratch

    In this article, we will discuss how to set up the Apple developer account, build an app (create IPA files), configure TestFlight, and deploy it to TestFlight for the very first time.

    There are tons of articles explaining how to configure and build an app or how to setup TestFlight or setup application for ad hoc distribution. However, most of them are either outdated or missing steps and can be misleading for someone who is doing it for the very first time.

    If you haven’t done this before, don’t worry, just traverse through the minute details of this article, follow every step correctly, and you will be able to set up your iOS application end-to-end, ready for TestFlight or ad hoc distribution within an hour.

    Prerequisites

    Before we start, please make sure, you have:

    • A React Native Project created and opened in the XCode
    • XCode set up on your Mac
    • An Apple developer account with access to create the Identifiers and Certificates, i.e. you have at least have a Developer or Admin access – https://developer.apple.com/account/
    • Access to App Store Connect with your apple developer account -https://appstoreconnect.apple.com/
    • Make sure you have an Apple developer account, if not, please get it created first.

    The Setup contains 4 major steps: 

    • Creating Certificates, Identifiers, and Profiles from your Apple Developer account
    • Configuring the iOS app using these Identifiers, Certificates, and Profiles in XCode
    • Setting up TestFlight and Internal Testers group on App Store Connect
    • Generating iOS builds, signing them, and uploading them to TestFlight on App Store Connect

    Certificates, Identifiers, and Profiles

    Before we do anything, we need to create:

    • Bundle Identifier, which is an app bundle ID and a unique app identifier used by the App Store
    • A Certificate – to sign the iOS app before submitting it to the App Store
    • Provisioning Profile – for linking bundle ID and certificates together

    Bundle Identifiers

    For the App Store to recognize your app uniquely, we need to create a unique Bundle Identifier.

    Go to https://developer.apple.com/account: you will see the Certificates, Identifiers & Profiles tab. Click on Identifiers. 

    Click the Plus icon next to Identifiers:

    Select the App IDs option from the list of options and click Continue:

    Select App from app types and click Continue

    On the next page, you will need to enter the app ID and select the required services your application can have if required (this is optional—you can enable them in the future when you actually implement them). 

    Keep those unselected for now as we don’t need them for this setup.

    Once filled with all the information, please click on continue and register your Bundle Identifier.

    Generating Certificate

    Certificates can be generated 2 ways:

    • By automatically managing certificates from Xcode
    • By manually generating them

    We will generate them manually.

    To create a certificate, we need a Certificate Signing Request form, which needs to be generated from your Mac’s KeyChain Access authority.

    Creating Certificate Signing Request:

    Open the KeyChain Access application and Click on the KeyChain Access Menu item at the left top of the screen, then select Preferences

    Select Certificate Assistance -> Request Certificate from Managing Authority

    Enter the required information like email address and name, then select the Save to Disk option.

    Click Continue and save this form to a place so you can easily upload it to your Apple developer account

    Now head back to the Apple developer account, click on Certificates. Again click on the + icon next to Certificates title and you will be taken to the new certificate form.

    Select the iOS Distribution (App Store and ad hoc) option. Here, you can select the required services this certificate will need from a list of options (for example, Apple Push Notification service). 

    As we don’t need any services, ignore it for now and click continue.

    On the next screen, upload the certificate signing request form we generated in the last step and click Continue.

    At this step, your certificate will be generated and will be available to download.

    NOTE: The certificate can be downloaded only once, so please download it and keep it in a secure location to use it in the future.

    Download your certificate and install it by clicking on the downloaded certificate file. The certificate will be installed on your mac and can be used for generating builds in the next steps.

    You can verify this by going back to the KeyChain Access app and seeing the newly installed certificate in the certificates list.

    Generating a Provisioning Profile

    Now link your identifier and certificate together by creating a provisioning profile.

    Let’s go back to the Apple developer account, select the profiles option, and select the + icon next to the Profiles title.

    You will be redirected to the new Profiles form page.

    Select Distribution Profile and click continue:

    Select the App ID we created in the first step and click Continue:

    Now, select the certificate we created in the previous step:

    Enter a Provisioning Profile name and click Generate:

    Once Profile is generated, it will be available to download, please download it and keep it at the same location where you kept Certificate for future usage.

    Configure App in XCode

    Now, we need to configure our iOS application using the bundle ID and the Apple developer account we used for generating the certificate and profiles.

    Open the <appname>.xcworkspace file in XCode and click on the app name on the left pan. It will open the app configuration page.

    Select the app from targets, go to signing and capabilities, and enter the bundle identifier. 

    Now, to automatically manage the provisioning profile, we need to download the provisioning profile we generated recently. 

    For this, we need to sign into XCode using your Apple ID.

    Select Preferences from the top left XCode Menu option, go to Accounts, and click on the + icon at the bottom.

    Select Apple ID from the account you want to add to the list, click continue and enter the Apple ID.

    It will prompt you to enter the password as well.

    Once successfully logged in, XCode will fetch all the provisioning profiles associated with this account. Verify that you see your project in the Teams section of this account page.

    Now, go back to the XCode Signing Capabilities page, select Automatically Manage Signing, and then select the required team from the Team dropdown.

    At this point, your application will be able to generate the Archives to upload it to either TestFlight or Sign them ad hoc to distribute it using other mediums (Diawi, etc.).

    Setup TestFlight

    TestFlight and App Store management are managed by the App Store Connect portal.

    Open the App Store Connect portal and log in to the application.

    After you log in, please make sure you have selected the correct team from the top right corner (you can check the team name just below the user name).

    Select My Apps from the list of options. 

    If this is the first time you are setting up an application on this team, you will see the + (Add app) option at the center of the page, but if your team has already set up applications, you will see the + icon right next to Apps Header.

    Click on the + icon and select New App Option:

    Enter the complete app details, like platform (iOS, MacOS OR tvOS), aApp name, bundle ID (the one we created), SKU, access type, and click the Create button.

    You should now be able to see your newly created application on the Apps menu. Select the app and go to TestFlight. You will see no builds there as we did not push any yet.

    Generate and upload the build to TestFlight

    At this point, we are fully ready to generate a build from XCode and push it to TestFlight. To do this, head back to XCode.

    On the top middle section, you will see your app name and right arrow. There might be an iPhone or other simulator selected. Pplease click on the options list and select Any iOS Device.

    Select the Product menu from the Menu list and click on the Archive option.

    Once the archive succeeds, XCode will open the Organizer window (you can also open this page from the Windows Menu list).

    Here, we sign our application archive (build) using the certificate we created and upload it to the App Store Connect TestFlight.

    On the Organizer window, you will see the recently generated build. Please select the build and click on Distribute Button from the right panel of the Organizer page.

    On the next page, select App Store Connect from the “Select a method of distribution” window and click Continue.

    NOTE: We are selecting the App Store Connect option as we want to upload a build to TestFlight, but if you want to distribute it privately using other channels, please select the Ad Hoc option.

    Select Upload from the “Select a Destination” options and click continue. This will prepare your build to submit it to App Store Connect TestFlight.

    For the first time, it will ask you how you want to sign the build, Automatically or Manually?

    Please Select Automatically and click the Next button.

    XCode may ask you to authenticate your certificate using your system password. Please authenticate it and wait until XCode uploads the build to TestFlight.

    Once the build is uploaded successfully, XCode will prompt you with the Success modal.

    Now, your app is uploaded to TestFlight and is being processed. This processing takes 5 to 15 minutes, at which point TestFlight makes it available for testing.

    Add Internal Testers and other teammates to TestFlight

    Once we are done with all the setup and uploaded the build to TestFlight, we need to add internal testers to TestFlight.

    This is a 2-step process. First, you need to add a user to App Store Connect and then add a user to TestFlight.

    Go to Users and Access

    Add a new User and App Store sends an invitation to the user

    Once the user accepts the invitation, go to TestFlight -> Internal Testing

    In the Internal Testing section, create a new Testing group if not added already and

    add the user to TestFlight testing group.

    Now, you should be able to configure the app, upload it to TestFlight, and add users to the TestFlight testing group.

    Hopefully, you enjoyed this article, and it helped in setting up iOS applications end-to-end quickly without getting too much confused. 

    Thanks.

  • A Beginner’s Guide to Edge Computing

    In the world of data centers with wings and wheels, there is an opportunity to lay some work off from the centralized cloud computing by taking less compute intensive tasks to other components of the architecture. In this blog, we will explore the upcoming frontier of the web – Edge Computing.

    What is the “Edge”?

    The ‘Edge’ refers to having computing infrastructure closer to the source of data. It is the distributed framework where data is processed as close to the originating data source possible. This infrastructure requires effective use of resources that may not be continuously connected to a network such as laptops, smartphones, tablets, and sensors. Edge Computing covers a wide range of technologies including wireless sensor networks, cooperative distributed peer-to-peer ad-hoc networking and processing, also classifiable as local cloud/fog computing, mobile edge computing, distributed data storage and retrieval, autonomic self-healing networks, remote cloud services, augmented reality, and more.

    Cloud Computing is expected to go through a phase of decentralization. Edge Computing is coming up with an ideology of bringing compute, storage and networking closer to the consumer.

    But Why?

    Legit question! Why do we even need Edge Computing? What are the advantages of having this new infrastructure?

    Imagine a case of a self-driving car where the car is sending a live stream continuously to the central servers. Now, the car has to take a crucial decision. The consequences can be disastrous if the car waits for the central servers to process the data and respond back to it. Although algorithms like YOLO_v2 have sped up the process of object detection the latency is at that part of the system when the car has to send terabytes to the central server and then receive the response and then act! Hence, we need the basic processing like when to stop or decelerate, to be done in the car itself.

    The goal of Edge Computing is to minimize the latency by bringing the public cloud capabilities to the edge. This can be achieved in two forms – custom software stack emulating the cloud services running on existing hardware, and the public cloud seamlessly extended to multiple point-of-presence (PoP) locations.

    Following are some promising reasons to use Edge Computing:

    1. Privacy: Avoid sending all raw data to be stored and processed on cloud servers.
    2. Real-time responsiveness: Sometimes the reaction time can be a critical factor.
    3. Reliability: The system is capable to work even when disconnected to cloud servers. Removes a single point of failure.

    To understand the points mentioned above, let’s take the example of a device which responds to a hot keyword. Example, Jarvis from Iron Man. Imagine if your personal Jarvis sends all of your private conversations to a remote server for analysis. Instead, It is intelligent enough to respond when it is called. At the same time, it is real-time and reliable.

    Intel CEO Brian Krzanich said in an event that autonomous cars will generate 40 terabytes of data for every eight hours of driving. Now with that flood of data, the time of transmission will go substantially up. In cases of self-driving cars, real-time or quick decisions are an essential need. Here edge computing infrastructure will come to rescue. These self-driving cars need to take decisions is split of a second whether to stop or not else consequences can be disastrous.

    Another example can be drones or quadcopters, let’s say we are using them to identify people or deliver relief packages then the machines should be intelligent enough to take basic decisions like changing the path to avoid obstacles locally.

    Forms of Edge Computing

    Device Edge:

    In this model, Edge Computing is taken to the customers in the existing environments. For example, AWS Greengrass and Microsoft Azure IoT Edge.

    Cloud Edge:

    This model of Edge Computing is basically an extension of the public cloud. Content Delivery Networks are classic examples of this topology in which the static content is cached and delivered through a geographically spread edge locations.

    Vapor IO is an emerging player in this category. They are attempting to build infrastructure for cloud edge. Vapor IO has various products like Vapor Chamber. These are self-monitored. They have sensors embedded in them using which they are continuously monitored and evaluated by Vapor Software, VEC(Vapor Edge Controller). They also have built OpenDCRE, which we will see later in this blog.

    The fundamental difference between device edge and cloud edge lies in the deployment and pricing models. The deployment of these models – device edge and cloud edge – are specific to different use cases. Sometimes, it may be an advantage to deploy both the models.

    Edges around you

    Edge Computing examples can be increasingly found around us:

    1. Smart street lights
    2. Automated Industrial Machines
    3. Mobile devices
    4. Smart Homes
    5. Automated Vehicles (cars, drones etc)

    Data Transmission is expensive. By bringing compute closer to the origin of data, latency is reduced as well as end users have better experience. Some of the evolving use cases of Edge Computing are Augmented Reality(AR) or Virtual Reality(VR) and the Internet of things. For example, the rush which people got while playing an Augmented Reality based pokemon game, wouldn’t have been possible if “real-timeliness” was not present in the game. It was made possible because the smartphone itself was doing AR not the central servers. Even Machine Learning(ML) can benefit greatly from Edge Computing. All the heavy-duty training of ML algorithms can be done on the cloud and the trained model can be deployed on the edge for near real-time or even real-time predictions. We can see that in today’s data-driven world edge computing is becoming a necessary component of it.

    There is a lot of confusion between Edge Computing and IOT. If stated simply, Edge Computing is nothing but the intelligent Internet of things(IOT) in a way. Edge Computing actually complements traditional IOT. In the traditional model of IOT, all the devices, like sensors, mobiles, laptops etc are connected to a central server. Now let’s imagine a case where you give the command to your lamp to switch off, for such simple task, data needs to be transmitted to the cloud, analyzed there and then lamp will receive a command to switch off. Edge Computing brings computing closer to your home, that is either the fog layer present between lamp and cloud servers is smart enough to process the data or the lamp itself.

    If we look at the below image, it is a standard IOT implementation where everything is centralized. While Edge Computing philosophy talks about decentralizing the architecture.

    The Fog  

    Sandwiched between edge layer and cloud layer, there is the Fog Layer. It bridges connection between other two layers.

    The difference between fog and edge computing is described in this article

    • Fog Computing – Fog computing pushes intelligence down to the local area network level of network architecture, processing data in a fog node or IoT gateway.
    • Edge computing pushes the intelligence, processing power and communication capabilities of an edge gateway or appliance directly into devices like programmable automation controllers (PACs).

    How do we manage Edge Computing?

    The Device Relationship Management or DRM refers to managing, monitoring the interconnected components over the internet. AWS IOT Core and AWS Greengrass, Nebbiolo Technologies have developed Fog Node and Fog OS, Vapor IO has OpenDCRE using which one can control and monitor the data centers.

    Following image (source – AWS) shows how to manage ML on Edge Computing using AWS infrastructure.

    AWS Greengrass makes it possible for users to use Lambda functions to build IoT devices and application logic. Specifically, AWS Greengrass provides cloud-based management of applications that can be deployed for local execution. Locally deployed Lambda functions are triggered by local events, messages from the cloud, or other sources.

    This GitHub repo demonstrates a traffic light example using two Greengrass devices, a light controller, and a traffic light.

    Conclusion

    We believe that next-gen computing will be influenced a lot by Edge Computing and will continue to explore new use-cases that will be made possible by the Edge.

    References

  • Setting Up A Single Sign On (SSO) Environment For Your App

    Single Sign On (SSO) makes it simple for users to begin using an application. Support for SSO is crucial for enterprise apps, as many corporate security policies mandate that all applications use certified SSO mechanisms. While the SSO experience is straightforward, the SSO standard is anything but straightforward. It’s easy to get confused when you’re surrounded by complex jargon, including SAML, OAuth 1.0, 1.0a, 2.0, OpenID, OpenID Connect, JWT, and tokens like refresh tokens, access tokens, bearer tokens, and authorization tokens. Standards documentation is too precise to allow generalization, and vendor literature can make you believe it’s too difficult to do it yourself.

    I’ve created SSO for a lot of applications in the past. Knowing your target market, norms, and platform are all crucial.

    Single Sign On

    Single Sign On is an authentication method that allows apps to securely authenticate users into numerous applications by using just one set of login credentials.

    This allows applications to avoid the hassle of storing and managing user information like passwords and also cuts down on troubleshooting login-related issues. With SSO configured, applications check with the SSO provider (Okta, Google, Salesforce, Microsoft) if the user’s identity can be verified.

    Types of SSO

    • Security Access Markup Language (SAML)
    • OpenID Connect (OIDC)
    • OAuth (specifically OAuth 2.0 nowadays)
    • Federated Identity Management (FIM)

    Security Assertion Markup Language – SAML

    SAML (Security Assertion Markup Language) is an open standard that enables identity providers (IdP) to send authorization credentials to service providers (SP). Meaning you can use one set of credentials to log in to many different websites. It’s considerably easier to manage a single login per user than to handle several logins to email, CRM software, Active Directory, and other systems.

    For standardized interactions between the identity provider and service providers, SAML transactions employ Extensible Markup Language (XML). SAML is the link between a user’s identity authentication and authorization to use a service.

    In our example implementation, we will be using SAML 2.0 as the standard for the authentication flow.

    Technical details

    • A Service Provider (SP) is the entity that provides the service, which is in the form of an application. Examples: Active Directory, Okta Inbuilt IdP, Salesforce IdP, Google Suite.
    • An Identity Provider (IdP) is the entity that provides identities, including the ability to authenticate a user. The user profile is normally stored in the Identity Provider typically and also includes additional information about the user such as first name, last name, job code, phone number, address, and so on. Depending on the application, some service providers might require a very simple profile (username, email), while others may need a richer set of user data (department, job code, address, location, and so on). Examples: Google – GDrive, Meet, Gmail.
    • The SAML sign-in flow initiated by the Identity Provider is referred to as an Identity Provider Initiated (IdP-initiated) sign-in. In this flow, the Identity Provider begins a SAML response that is routed to the Service Provider to assert the user’s identity, rather than the SAML flow being triggered by redirection from the Service Provider. When a Service Provider initiates the SAML sign-in process, it is referred to as an SP-initiated sign-in. When end-users try to access a protected resource, such as when the browser tries to load a page from a protected network share, this is often triggered.

    Configuration details

    • Certificate – To validate the signature, the SP must receive the IdP’s public certificate. On the SP side, the certificate is kept and used anytime a SAML response is received.
    • Assertion Consumer Service (ACS) Endpoint – The SP sign-in URL is sometimes referred to simply as the URL. This is the endpoint supplied by the SP for posting SAML responses. This information must be sent by the SP to the IdP.
    • IdP Sign-in URL – This is the endpoint where SAML requests are posted on the IdP side. This information must be obtained by the SP from the IdP.

    OpenID Connect – OIDC

    OIDC protocol is based on the OAuth 2.0 framework. OIDC authenticates the identity of a specific user, while OAuth 2.0 allows two applications to trust each other and exchange data.

    So, while the main flow appears to be the same, the labels are different.

    How are SAML and OIDC similar?

    The basic login flow for both is the same.

    1. A user tries to log into the application directly.

    2. The program sends the user’s login request to the IdP via the browser.

    3. The user logs in to the IdP or confirms that they are already logged in.

    4. The IdP verifies that the user has permission to use the program that initiated the request.

    5. Information about the user is sent from the IdP to the user’s browser.

    6. Their data is subsequently forwarded to the application.

    7. The application verifies that they have permission to use the resources.

    8. The user has been granted access to the program.

    Difference between SAML and OIDC

    1. SAML transmits user data in XML, while OpenID Connect transmits data in JSON.

    2. SAML calls the data it sends an assertion. OAuth2 calls the data it sends a claim.

    3. In SAML, the application or system the user is trying to get into is referred to as the Service Provider. In OIDC, it’s called the Relying Party.

    SAML vs. OIDC

    1. OpenID Connect is becoming increasingly popular. Because it interacts with RESTful API endpoints, it is easier to build than SAML and is easily available through APIs. This also implies that it is considerably more compatible with mobile apps.

    2. You won’t often have a choice between SAML and OIDC when configuring Single Sign On (SSO) for an application through an identity provider like OneLogin. If you do have a choice, it is important to understand not only the differences between the two, but also which one is more likely to be sustained over time. OIDC appears to be the clear winner at this time because developers find it much easier to work with as it is more versatile.

    Use Cases

    1. SAML with OIDC:

    – Log in with Salesforce: SAML Authentication where Salesforce was used as IdP and the web application as an SP.

    Key Reason:

    All users are centrally managed in Salesforce, so SAML was the preferred choice for authentication.

    – Log in with Okta: OIDC Authentication where Okta used IdP and the web application as an SP.

    Key Reason:

    Okta Active Directory (AD) is already used for user provisioning and de-provisioning of all internal users and employees. Okta AD enables them to integrate Okta with any on-premise AD.

    In both the implementation user provisioning and de-provisioning takes place at the IdP side.

    SP-initiated (From web application)

    IdP-initiated (From Okta Active Directory)

    2. Only OIDC login flow:

    • OIDC Authentication where Google, Salesforce, Office365, and Okta are used as IdP and the web application as SP.

    Why not use OAuth for SSO

    1. OAuth 2.0 is not a protocol for authentication. It explicitly states this in its documentation.

    2. With authentication, you’re basically attempting to figure out who the user is when they authenticated, and how they authenticated. These inquiries are usually answered with SAML assertions rather than access tokens and permission grants.

    OIDC vs. OAuth 2.0

    • OAuth 2.0 is a framework that allows a user of a service to grant third-party application access to the service’s data without revealing the user’s credentials (ID and password).
    • OpenID Connect is a framework on top of OAuth 2.0 where a third-party application can obtain a user’s identity information which is managed by a service. OpenID Connect can be used for SSO.
    • In OAuth flow, Authorization Server gives back Access Token only. In the OpenID flow, the Authorization server returns Access Code and ID Token. A JSON Web Token, or JWT, is a specially formatted string of characters that serves as an ID Token. The Client can extract information from the JWT, such as your ID, name, when you logged in, the expiration of the ID Token, and if the JWT has been tampered with.

    Federated Identity Management (FIM)

    Identity Federation, also known as federated identity management, is a system that allows users from different companies to utilize the same verification method for access to apps and other resources.

    In short, it’s what allows you to sign in to Spotify with your Facebook account.

    • Single Sign On (SSO) is a subset of the identity federation.
    • SSO generally enables users to use a single set of credentials to access multiple systems within a single organization, while FIM enables users to access systems across different organizations.

    How does FIM work?

    • To log in to their home network, users use the security domain to authenticate.
    • Users attempt to connect to a distant application that employs identity federation after authenticating to their home domain.
    • Instead of the remote application authenticating the user itself, the user is prompted to authenticate from their home authentication server.
    • The user’s home authentication server authorizes the user to the remote application and the user is permitted to access the app. The user’s home client is authenticated to the remote application, and the user is permitted access to the application.

    A user can log in to their home domain once, to their home domain; remote apps in other domains can then grant access to the user without an additional login process.

    Applications:

    • Auth0: Auth0 uses OpenID Connect and OAuth 2.0 to authenticate users and get their permission to access protected resources. Auth0 allows developers to design and deploy applications and APIs that easily handle authentication and authorization issues such as the OIDC/OAuth 2.0 protocol with ease.
    • AWS Cognito
    • User pools – In Amazon Cognito, a user pool is a user directory. Your users can sign in to your online or mobile app using Amazon Cognito or federate through a third-party identity provider using a user pool (IdP). All members of the user pool have a directory profile that you may access using an SDK, whether they sign indirectly or through a third party.
    • Identity pools – An identity pool allows your users to get temporary AWS credentials for services like Amazon S3 and DynamoDB.

    Conclusion:

    I hope you found the summary of my SSO research beneficial. The optimum implementation approach is determined by your unique situation, technological architecture, and business requirements.