Category: Services

  • Building Google Photos Alternative Using AWS Serverless

    Being an avid Google Photos user, I really love some of its features, such as album, face search, and unlimited storage. However, when Google announced the end of unlimited storage on June 1st, 2021, I started thinking about how I could create a cheaper solution that would meet my photo backup requirement.

    “Taking an image, freezing a moment, reveals how rich reality truly is.”

    – Anonymous

    Google offers 100 GB of storage for 130 INR. This storage can be used across various Google applications. However, I don’t use all the space in one go. For me, I snap photos randomly. Sometimes, I visit places and take random snaps with my DSLR and smartphone. So, in general, I upload approximately 200 photos monthly. The size of these photos varies in the range of 4MB to 30MB. On average, I may be using 4GB of monthly storage for backup on my external hard drive to keep raw photos, even the bad ones. Photos backed up on the cloud should be visually high-quality, and it’s good to have a raw copy available at the same time, so that you may do some lightroom changes (although I never touch them 😛). So, here is my minimal requirement:

    • Should support social authentication (Google sign-in preferred).
    • Photos should be stored securely in raw format.
    • Storage should be scaled with usage.
    • Uploading and downloading photos should be easy.
    • Web view for preview would be a plus.
    • Should have almost no operations headache and solution should be as cheap as possible 😉.

    Selecting Tech Stack

    To avoid operation headaches with servers going down, scaling, or maybe application crashing and overall monitoring, I opted for a serverless solution with AWS. The AWS S3 is infinite scalable storage and you only pay for the amount of storage you used. On top of that, you can opt for the S3 storage class, which is efficient and cost-effective.

    – Infrastructure Stack

    1. AWS API Gateway (http api)
    2. AWS Lambda (for processing images and API gateway queries)
    3. Dynamodb (for storing image metadata)
    4. AWS Cognito (for authentication)
    5. AWS S3 Bucket (for storage and web application hosting)
    6. AWS Certificate Manager (to use SSL certificate for a custom domain with API gateway)

    – Software Stack

    1. NodeJS
    2. ReactJS and Material-UI (front-end framework and UI)
    3. AWS Amplify (for simplifying auth flow with cognito)
    4. Sharp (high-speed nodejs library for converting images)
    5. Express and serversless-http
    6. Infinite Scroller (for gallery view)
    7. Serverless Framework (for ease of deployment and Infrastructure as Code)

    Create S3 Buckets:

    We will create three S3 buckets. Create one for hosting a frontend application (refer to architecture diagram, more on this discussed later in the build and hosting part). The second one is for temporarily uploading images. The third one is for actual backup and storage (enable server-side encryption on this bucket). A temporary upload bucket will process uploaded images. 

    During pre-processing, we will resize the original image into two different sizes. One is for thumbnail purposes (400px width), another one is for viewing purposes, but with reduced quality (webp format). Once images are resized, upload all three (raw, thumbnail, and webview) to the third S3 bucket and create a record in dynamodb. Set up object expiry policy on the temporary bucket for 1 day. This way, uploaded objects are automatically deleted from the temporary bucket.

    Setup trigger on the temporary bucket for uploaded images:

    We will need to set up an S3 PUT event, which will trigger our Lambda function to download and process images. We will filter the suffix jpg (and jpeg) for an event trigger, meaning that any file with extension .jpg and .jpeg uploaded to our temporary bucket will automatically invoke a lambda function with the event payload. The lambda function with the help of the event payload will download the uploaded file and perform processing. Your serverless function definition would look like:

    functions:
     lambda:
       handler: index.handler
       memorySize: 512
       timeout: 60
       layers:
         - {Ref: PhotoParserLibsLambdaLayer}
       events:
         - s3:
             bucket: your-temporary-bucket-name
             event: s3:ObjectCreated:*
             rules:
               - suffix: .jpg
             existing: true
         - s3:
             bucket: your-temporary-bucket-name
             event: s3:ObjectCreated:*
             rules:
               - suffix: .jpeg
             existing: true

    Notice that in the YAML events section, we set “existing:true”. This ensures that the bucket will not be created during the serverless deployment. However, if you plan not to manually create your s3 bucket, you can let the framework create a bucket for you.

    DynamoDB as metadatadb:

    AWS dynamodb is a key-value document db that is suitable for our use case. Dynamodb will help us retrieve the list of photos available in the time series. Dynamodb uses a primary key for uniquely identifying each record. A primary key can be composed of a hash key and range key (also called a sort key). A range key is optional. We will use a federated identity ID (discussed in setup authorization) as the hash key (partition key) and name it the username for attribute definition with the type string. We will use the timestamp attribute definition name as a range key with a type number. Range key will help us query results with time-series (Unix epoch). We can also use dynamodb secondary indexes to sort results more specifically. However, to keep the application simple, we’re going to opt-out of this feature for now. Your serverless resource definition would look like:

    resources:
     Resources:
       MetaDataDB:
         Type: AWS::DynamoDB::Table
         Properties:
           TableName: your-dynamodb-table-name
           AttributeDefinitions:
             - AttributeName: username
               AttributeType: S
             - AttributeName: timestamp
               AttributeType: N
           KeySchema:
             - AttributeName: username
               KeyType: HASH
             - AttributeName: timestamp
               KeyType: RANGE
           BillingMode: PAY_PER_REQUEST

    Finally, you also need to set up the IAM role so that the process image lambda function would have access to the S3 bucket and dynamodb. Here is the serverless definition for the IAM role.

    # you can add statements to the Lambda function's IAM Role here
     iam:
       role:
         statements:
         - Effect: "Allow"
           Action:
             - "s3:ListBucket"
           Resource:
             - arn:aws:s3:::your-temporary-bucket-name
             - arn:aws:s3:::your-actual-photo-bucket-name
         - Effect: "Allow"
           Action:
             - "s3:GetObject"
             - "s3:DeleteObject"
           Resource: arn:aws:s3:::your-temporary-bucket-name/*
         - Effect: "Allow"
           Action:
             - "s3:PutObject"
           Resource: arn:aws:s3:::your-actual-photo-bucket-name/*
         - Effect: "Allow"
           Action:
             - "dynamodb:PutItem"
           Resource:
             - Fn::GetAtt: [ MetaDataDB, Arn ]

    Setup Authentication:

    Okay, to set up a Cognito user pool, head to the Cognito console and create a user pool with below config:

    1. Pool Name: photobucket-users

    2. How do you want your end-users to sign in?

    • Select: Email Address or Phone Number
    • Select: Allow Email Addresses
    • Check: (Recommended) Enable case insensitivity for username input

    3. Which standard attributes are required?

    • email

    4. Keep the defaults for “Policies”

    5. MFA and Verification:

    • I opted to manually reset the password for each user (since this is internal app)
    • Disabled user verification

    6. Keep the default for Message Customizations, tags, and devices.

    7. App Clients :

    • App client name: myappclient
    • Let the refresh token, access token, and id token be default
    • Check all “Auth flow configurations”
    • Check enable token revocation

    8. Skip Triggers

    9. Review and create the pool

    Once created, goto app integration -> domain name. Create a domain Cognito subdomain of your choice and note this. Next, I plan to use the Google sign-in feature with Cognito Federation Identity Providers. Use this guide to set up a Google social identity with Cognito.

    Setup Authorization:

    Once the user identity is verified, we need to allow them to access the s3 bucket with limited permissions. Head to the Cognito console, select federated identities, and create a new identity pool. Follow these steps to configure:

    1. Identity pool name: photobucket_auth

    2. Keep Unauthenticated and Authentication flow settings unchecked.

    3. Authentication providers:

    • User Pool I: Enter the user pool ID obtained during authentication setup
    • App Client I: Enter the app client ID generated during the authentication setup. (Cognito user pool -> App Clients -> App client ID)

    4. Setup permissions:

    • Expand view details (Role Summary)
    • For authenticated identities: edit policy document and use the below JSON policy and skip unauthenticated identities with the default configuration.
    {
       "Version": "2012-10-17",
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "mobileanalytics:PutEvents",
                   "cognito-sync:*",
                   "cognito-identity:*"
               ],
               "Resource": [
                   "*"
               ]
           },
           {
               "Sid": "ListYourObjects",
               "Effect": "Allow",
               "Action": "s3:ListBucket",
               "Resource": [
                   "arn:aws:s3:::your-actual-photo-bucket-name"
               ],
               "Condition": {
                   "StringLike": {
                       "s3:prefix": [
                           "${cognito-identity.amazonaws.com:sub}/",
                           "${cognito-identity.amazonaws.com:sub}/*"
                       ]
                   }
               }
           },
           {
               "Sid": "ReadYourObjects",
               "Effect": "Allow",
               "Action": [
                   "s3:GetObject"
               ],
               "Resource": [
                   "arn:aws:s3:::your-actual-photo-bucket-name/${cognito-identity.amazonaws.com:sub}",
                   "arn:aws:s3:::your-actual-photo-bucket-name/${cognito-identity.amazonaws.com:sub}/*"
               ]
           }
       ]
    }

    ${cognito-identity.amazonaws.com:sub} is a special AWS variable. When a user is authenticated with a federated identity, each user is assigned a unique identity. What the above policy means is that any user who is authenticated should have access to objects prefixed by their own identity ID. This is how we intend users to gain authorization in a limited area within the S3 bucket.

    Copy the Identity Pool ID (from sample code section). You will need this in your backend to get the identity id of the authenticated user via JWT token.

    Amplify configuration for the frontend UI sign-in:

    This object helps you set up the minimal configuration for your application. This is all that we need to sign in via Cognito and access the S3 photo bucket.

    const awsconfig = {
       Auth : {
           identityPoolId: "idenity pool id created during authorization setup",
           region : "your aws region",
           identityPoolRegion: "same as above if cognito is in same region",
           userPoolId : "cognito user pool id created during authentication setup",
           userPoolWebClientId : "cognito app client id",
           cookieStorage : {
               domain : "https://your-app-domain-name", //this is very important
               secure: true
           },
           oauth: {
               domain : "{cognito domain name}.auth.{cognito region name}.amazoncognito.com",
               scope : ["profile","email","openid"],
               redirectSignIn: 'https://your-app-domain-name',
               redirectSignOut: 'https://your-app-domain-name',
               responseType : "token"
           }
       },
       Storage: {
           AWSS3 : {
               bucket: "your-actual-bucket-name",
               region: "region-of-your-bucket"
           }
       }
    };
    export default awsconfig;

    You can then use the below code to configure and sign in via social authentication.

    import Amplify, {Auth} from 'aws-amplify';
    import awsconfig from './aws-config';
    Amplify.configure(awsconfig);
    //once the amplify is configured you can use below call with onClick event of buttons or any other visual component to sign in.
    //Example
    <Button startIcon={<img alt="Sigin in With Google" src={logo} />} fullWidth variant="outlined" color="primary" onClick={() => Auth.federatedSignIn({provider: 'Google'})}>
       Sign in with Google
    </Button>

    Gallery View:

    When the application is loaded, we use the PhotoGallery component to load photos and view thumbnails on-page. The Photogallery component is a wrapper around the InfinityScoller component, which keeps loading images as the user scrolls. The idea here is that we query a max of 10 images in one go. Our backend returns a list of 10 images (just the map and metadata to the S3 bucket). We must load these images from the S3 bucket and then show thumbnails on-screen as a gallery view. When the user reaches the bottom of the screen or there is empty space left, the InfiniteScroller component loads 10 more images. This continues untill our backend replies with a stop marker.

    The key point here is that we need to send the JWT Token as a header to our backend service via an ajax call. The JWT Token is obtained post a sign-in from Amplify framework. An example of obtaininga JWT token:

    let authsession = await Auth.currentSession();
    let jwtToken = authsession.getIdToken().jwtToken;
    let photoList = await axios.get(url,{
       headers : {
           Authorization: jwtToken
       },
       responseType : "json"
    });

    An example of an infinite scroller component usage is given below. Note that “gallery” is JSX composed array of photo thumbnails. The “loadMore” method calls our ajax function to the server-side backend and updates the “gallery” variable and sets the “hasMore” variable to true/false so that the infinite scroller component can stop queering when there are no photos left to display on the screen.

    <InfiniteScroll
       loadMore={this.fetchPhotos}
       hasMore={this.state.hasMore}
       loader={<div style={{padding:"70px"}} key={0}><LinearProgress color="secondary" /></div>}
    >
       <div style={{ marginTop: "80px", position: "relative", textAlign: "center" }}>
           <div className="image-grid" style={{ marginTop: "30px" }}>
               {gallery}
           </div>
           {this.state.openLightBox ?
           <LightBox src={this.state.lightBoxImg} callback={this.closeLightBox} />
           : null}
       </div>
    </InfiniteScroll>

    The Lightbox component gives a zoom effect to the thumbnail. When the thumbnail is clicked, a higher resolution picture (webp version) is downloaded from the S3 bucket and shown on the screen. We use a storage object from the Amplify library. Downloaded content is a blob and must be converted into image data. To do so, we use the javascript native method, createObjectURL. Below is the sample code that downloads the object from the s3 bucket and then converts it into a viewable image for the HTML IMG tag.

    thumbClick = (index) => {
       const urlCreater = window.URL || window.webkitURL;
       try {
           this.setState({
               openLightBox: true
           });
           Storage.get(this.state.photoList[index].src,{download: true}).then(data=>{
               let image = urlCreater.createObjectURL(data.Body);
               this.setState({
                   lightBoxImg : image
               });
           });
              
       } catch (error) {
           console.log(error);
           this.setState({
               openLightBox: false,
               lightBoxImg : null
           });
       }
    };

    Uploading Photos:

    The S3 SDK lets you generate a pre-signed POST URL. Anyone who gets this URL will be able to upload objects to the S3 bucket directly without needing credentials. Of course, we can actually set up some boundaries, like a max object size, key of the uploaded object, etc. Refer to this AWS blog for more on pre-signed URLs. Here is the sample code to generate a pre-signed URL.

    let s3Params = {
       Bucket: "your-temporary-bucket-name,
       Conditions : [
           ["content-length-range",1,31457280]
       ],
       Fields : {
           key: "path/to/your/object"
       },
       Expires: 300 //in seconds
    };
    const s3 = new S3({region : process.env.AWSREGION });
    s3.createPresignedPost(s3Params)

    For a better UX, we can allow our users to upload more than one photo at a time. However, a pre-signed URL lets you upload a single object at a time. To overcome this, we generate multiple pre-signed URLs. Initially, we send a request to our backend asking to upload photos with expected keys. This request is originated once the user selects photos to upload. Our backend then generates pre-signed URLs for us. Our frontend React app then provides the illusion that all photos are being uploaded as a whole.

    When the upload is successful, the S3 PUT event is triggered, which we discussed earlier. The complete flow of the application is given in a sequence diagram. You can find the complete source code here in my GitHub repository.

    React Build Steps and Hosting:

    The ideal way to build the react app is to execute an npm run build. However, we take a slightly different approach. We are not using the S3 static website for serving frontend UI. For one reason, S3 static websites are non-SSL unless we use CloudFront. Therefore, we will make the API gateway our application’s entry point. Thus, the UI will also be served from the API gateway. However, we want to reduce calls made to the API gateway. For this reason, we will only deliver the index.html file hosted with the help API gateway/Lamda, and the rest of the static files (react supporting JS files) from S3 bucket.

    Your index.html should have all the reference paths pointed to the S3 bucket. The build mustexclusively specify that static files are located in a different location than what’s relative to the index.html file. Your S3 bucket needs to be public with the right bucket policy and CORS set so that the end-user can only retrieve files and not upload nasty objects. Those who are confused about how the S3 static website and S3 public bucket differ may refer to here. Below are the react build steps, bucket policy, and CORS.

    PUBLIC_URL=https://{your-static-bucket-name}.s3.{aws_region}.amazonaws.com/ npm run build
    //Bucket Policy
    {
       "Version": "2012-10-17",
       "Id": "http referer from your domain only",
       "Statement": [
           {
               "Sid": "Allow get requests originating from",
               "Effect": "Allow",
               "Principal": "*",
               "Action": "s3:GetObject",
               "Resource": "arn:aws:s3:::{your-static-bucket-name}/static/*",
               "Condition": {
                   "StringLike": {
                       "aws:Referer": [
                           "https://your-app-domain-name"
                       ]
                   }
               }
           }
       ]
    }
    //CORS
    [
       {
           "AllowedHeaders": [
               "*"
           ],
           "AllowedMethods": [
               "GET"
           ],
           "AllowedOrigins": [
               "https://your-app-domain-name"
           ],
           "ExposeHeaders": []
       }
    ]

    Once a build is complete, upload index.html to a lambda that serves your UI. Run the below shell commands to compress static contents and host them on our static S3 bucket.

    #assuming you are in your react app directory
    mkdir /tmp/s3uploads
    cp -ar build/static /tmp/s3uploads/
    cd /tmp/s3uploads
    #add gzip encoding to all the files
    gzip -9 `find ./ -type f`
    #remove .gz extension from compressed files
    for i in `find ./ -type f`
    do
       mv $i ${i%.*}
    done
    #sync your files to s3 static bucket and mention that these files are compressed with gzip encoding
    #so that browser will not treat them as regular files
    aws s3 --region $AWSREGION sync . s3://${S3_STATIC_BUCKET}/static/ --content-encoding gzip --delete --sse
    cd -
    rm -rf /tmp/s3uploads

    Our backend uses nodejs express framework. Since this is a serverless application, we need to wrap express with a serverless-http framework to work with lambda. Sample source code is given below, along with serverless framework resource definition. Notice that, except for the UI home endpoint ( “/” ), the rest of the API endpoints are authenticated with Cognito on the API gateway itself.

    const serverless = require("serverless-http");
    const express = require("express");
    const app = express();
    .
    .
    .
    .
    .
    .
    app.get("/",(req,res)=> {
     res.sendFile(path.join(__dirname + "/index.html"));
    });
    module.exports.uihome = serverless(app);

    provider:
     name: aws
     runtime: nodejs12.x
     lambdaHashingVersion: 20201221
     httpApi:
       authorizers:
         cognitoJWTAuth:
           identitySource: $request.header.Authorization
           issuerUrl: https://cognito-idp.{AWS_REGION}.amazonaws.com/{COGNITO_USER_POOL_ID}
           audience:
             - COGNITO_APP_CLIENT_ID
    .
    .
    .
    .
    .
    .
    .
    functions:
     react-serve-ui:
       handler: handler.uihome
       memorySize: 256
       timeout: 29
       layers:
         - {Ref: CommonLibsLambdaLayer}
       events:
         - httpApi:
             path: /prep/photoupload
             method: post
             authorizer:
               name: cognitoJWTAuth
         - httpApi:
             path: /list/photos
             method: get
             authorizer:
               name: cognitoJWTAuth
         - httpApi:
             path: /
             method: get

    Final Steps :

    Lastly, we will setup up a custom domain so that we don’t need to use the gibberish domain name generated by the API gateway and certificate for our custom domain. You don’t need to use route53 for this part. If you have an existing domain, you can create a subdomain and point it to the API gateway. First things first: head to the AWS ACM console and generate a certificate for the domain name. Once the request is generated, you need to validate your domain by creating a TXT record as per the ACM console. The ACM is a free service. Domain verification may take few minutes to several hours. Once you have the certificate ready, head back to the API gateway console. Navigate to “custom domain names” and click create.

    1. Enter your application domain name
    2. Check TLS 1.2 as TLS version
    3. Select Endpoint type as Regional
    4. Select ACM certificate from dropdown list
    5. Create domain name

    Select the newly created custom domain. Note the API gateway domain name from Domain Details -> Configuration tab. You will need this to map a CNAME/ALIAS record with your DNS provider. Click on the API mappings tab. Click configure API mappings. From the dropdown, select your API gateway, select stage as default, and click save. You are done here.

    Future Scope and Improvements :

    To improve application latency, we can use CloudFront as CDN. This way, our entry point could be S3, and we no longer need to use API gateway regional endpoint. We can also add AWS WAF as an added security in front of our API gateway to inspect incoming requests and payloads. We can also use Dynamodb secondary indexes so that we can efficiently search metadata in the table. Adding a lifecycle rule on raw photos which have not been accessed for more than a year can be transited to the S3 Glacier storage class. You can further add glacier deep storage transition to save more on storage costs.

  • ClickHouse – The Newest Data Store in Your Big Data Arsenal

    ClickHouse

    ClickHouse is an open-source column-oriented data warehouse for online analytical processing of queries (OLAP). It is fast, scalable, flexible, cost-efficient, and easy to run. It supports the best in the industry query performance while significantly reducing storage requirements through innovative use of columnar storage and compression.

    ClickHouse’s performance exceeds comparable column-oriented database management systems that are available on the market. ClickHouse is a database management system, not a single database. ClickHouse allows creating tables and databases at runtime, loading data, and running queries without reconfiguring and restarting the server.

    ClickHouse processes from hundreds of millions to over a billion rows of data across hundreds of node clusters. It utilizes all available hardware for processing queries to their fastest. The peak processing performance for a single query stands at more than two terabytes per second.

    What makes ClickHouse unique?

    • Data Storage & Compression: ClickHouse is designed to work on regular hard drives but uses SSD and additional RAM if available. Data compression in ClickHouse plays a crucial role in achieving excellent performance. It provides general-purpose compression codecs and some specialized codecs for specific kinds of data. These codecs have different CPU consumption and disk space and help ClickHouse outperform other databases.
    • High Performance: By using vector computation, engine data is processed by vectors which are parts of columns, and achieve high CPU efficiency. It supports parallel processing across multiple cores, turning large queries into parallelized naturally. ClickHouse also supports distributed query processing; data resides across shards which are used for parallel execution of the query.
    • Primary & Secondary Index: Data is sorted physically by the primary key allowing low latency extraction of specific values or ranges. The secondary index in ClickHouse enable the database to know that the query filtering conditions would skip some of the parts entirely. Therefore, these are also called data skipping indexes.
    • Support for Approximated Calculations: ClickHouse trades accuracy for performance by approximated calculations. It provides aggregate functions for an approximated estimate of several distinct values, medians, and quantiles. It retrieves proportionally fewer data from the disk to run queries based on the part of data to get approximated results.
    • Data Replication and Data Integrity Support: All the remaining duplicates retrieve their copies in the background after being written to any available replica. The system keeps identical data on several clones. Most failures are recovered automatically or semi-automatically in complex scenarios.

    But it can’t be all good, can it? there are some disadvantages to ClickHouse as well:

    • No full-fledged transactions.
    • Inability to efficiently and precisely change or remove previously input data. For example, to comply with GDPR, data could well be cleaned up or modified using batch deletes and updates.
    • ClickHouse is less efficient for point queries that retrieve individual rows by their keys due to the sparse index.

    ClickHouse against its contemporaries

    So with all these distinctive features, how does ClickHouse compare with other industry-leading data storage tools. Now, ClickHouse being general-purpose, has a variety of use cases, and it has its pros and cons, so here’s a high-level comparison against the best tools in their domain. Depending on the use case, each tool has its unique traits, and comparison around them would not be fair, but what we care about the most is performance, scalability, cost, and other key attributes that can be compared irrespective of the domain. So here we go:

    ClickHouse vs Snowflake:

    • With its decoupled storage & compute approach, Snowflake is able to segregate workloads and enhance performance. The search optimization service in Snowflake further enhances the performance for point lookups but has additional costs attached with it. ClickHouse, on the other hand, with local runtime and inherent support for multiple forms of indexing, drastically improves query performance.
    • Regarding scalability, ClickHouse being on-prem makes it slightly challenging to scale compared to Snowflake, which is cloud-based. Managing hardware manually by provisioning clusters and migrating is doable but tedious. But one possible solution to tackle is to deploy CH on the cloud, a very good option that is cheaper and, frankly, the most viable. 

    ClickHouse vs Redshift:

    • Redshift is a managed, scalable cloud data warehouse. It offers both provisioned and serverless options. Its RA3 nodes compute scalably and cache the necessary data. Still, even with that, its performance does not separate different workloads that are on the same data putting it on the lower end of the decoupled compute & storage cloud architectures. ClickHouse’s local runtime is one of the fastest. 
    • Both Redshift and ClickHouse are columnar, sort data, allowing read-only specific data. But deploying CH is cheaper, and although RS is tailored to be a ready-to-use tool, CH is better if you’re not entirely dependent on Redshift’s features like configuration, backup & monitoring.

    ClickHouse vs InfluxDB:

    • InfluxDB, written in Go, this open-source no-SQL is one of the most popular choices when it comes to dealing with time-series data and analysis. Despite being a general-purpose analytical DB, ClickHouse provides competitive write performance. 
    • ClickHouse’s data structures like AggregatingMergeTree allow real-time data to be stored in a pre-aggregated format which puts it on par in performance regarding TSDBs. It is significantly faster in heavy queries and comparable in the case of light queries.

    ClickHouse vs PostgreSQL:

    • Postgres is another DB that is very versatile and thus is widely used by the world for various use cases, just like ClickHouse. Postgres, however, is an OLTP DB, so unlike ClickHouse, analytics is not its primary aim, but it’s still used for analytics purposes to a certain extent.
    • In terms of transactional data, ClickHouse’s columnar nature puts it below Postgres, but when it comes to analytical capabilities, even after tuning Postgres to its max potential, for, e.g., by using materialized views, indexing, cache size, buffers, etc. ClickHouse is ahead.  

    ClickHouse vs Apache Druid:

    • Apache Druid is an open-source data store that is primarily used for OLAP. Both Druid & ClickHouse are very similar in terms of their approaches and use cases but differ in terms of their architecture. Druid is mainly used for real-time analytics with heavy ingestions and high uptime.
    • Unlike Druid, ClickHouse has a much simpler deployment. CH can be deployed on only one server, while Druid setup needs multiple types of nodes (master, broker, ingestion, etc.). ClickHouse, with its support for SQL-like nature, provides better flexibility. It is more performant when the deployment is small.

    To summarize the differences between ClickHouse and other data warehouses:

    ClickHouse Engines

    Depending on the type of your table (internal or external) ClickHouse provides an array of engines that help us connect to different data storages and also determine the way data is stored, accessed, and other interactions on it.

    These engines are mainly categorized into two types:

    Database Engines:

    These allow us to work with different databases & tables.
    ClickHouse uses the Atomic database engine to provide configurable table engines and dialects. The popular ones are PostgreSQL, MySQL, and so on.

    Table Engines:

    These determine 

    • how and where data is stored
    • where to read/write it from/to
    • which queries it supports
    • use of indexes
    • concurrent data access and so on.

    These engines are further classified into families based on the above parameters:

    MergeTree Engines:

    This is the most universal and functional table for high-load tasks. The engines of this family support quick data insertion with subsequent background data processing. These engines also support data replication, partitioning, secondary data-skipping indexes and some other features. Following are some of the popular engines in this family:

    • MergeTree
    • SummingMergeTree
    • AggregatingMergeTree

    MergeTree engines with indexing and partitioning support allow data to be processed at a tremendous speed. These can also be leveraged to form materialized views that store aggregated data further improving the performance.

    Log Engines:

    These are lightweight engines with minimum functionality. These work the best when the requirement is to quickly write into many small tables and read them later as a whole. This family consists of:

    • Log
    • StripeLog
    • TinyLog

    These engines append data to the disk in a sequential fashion and support concurrent reading. They do not support indexing, updating, or deleting and hence are only useful when the data is small, sequential, and immutable.

    Integration Engines:

    These are used for communicating with other data storage and processing systems. This support:

    • JDBC
    • MongoDB
    • HDFS
    • S3
    • Kafka and so on.

    Using these engines we can import and export data from external sources. With engines like Kafka we can ingest data directly from a topic to a table in ClickHouse and with the S3 engine, we work directly with S3 objects.

    Special Engines:

    ClickHouse offers some special engines that are specific to the use case. For example:

    • MaterializedView
    • Distributed
    • Merge
    • File and so on.

    These special engines have their own quirks for eg. with File we can export data to a file, update data in the table by updating the file, etc.

    Summary

    We learned that ClickHouse is a very powerful and versatile tool. One that has stellar performance is feature-packed, very cost-efficient, and open-source. We saw a high-level comparison of ClickHouse with some of the best choices in an array of use cases. Although it ultimately comes down to how specific and intense your use case is, ClickHouse and its generic nature measure up pretty well on multiple occasions.

    ClickHouse’s applicability in web analytics, network management, log analysis, time series analysis, asset valuation in financial markets, and security threat identification makes it tremendously versatile. With consistently solving business problems in a low latency response for petabytes of data, ClickHouse is indeed one of the faster data warehouses out there.

    Further Readings

  • Getting Started With Kubernetes Operators (Helm Based) – Part 1

    Introduction

    The concept of operators was introduced by CoreOs in the last quarter of  2016 and post the introduction of operator framework last year, operators are rapidly becoming the standard way of managing applications on Kubernetes especially the ones which are stateful in nature. In this blog post, we will learn what an operator is. Why they are needed and what problems do they solve. We will also create a helm based operator as an example.

    This is the first part of our Kubernetes Operator Series. In the second part, getting started with Kubernetes operators (Ansible based), and the third part, getting started with Kubernetes operators (Golang based), you can learn how to build Ansible and Golang based operators.

    What is an Operator?

    Whenever we deploy our application on Kubernetes we leverage multiple Kubernetes objects like deployment, service, role, ingress, config map, etc. As our application gets complex and our requirements become non-generic, managing our application only with the help of native Kubernetes objects becomes difficult and we often need to introduce manual intervention or some other form of automation to make up for it.

    Operators solve this problem by making our application first class Kubernetes objects that is we no longer deploy our application as a set of native Kubernetes objects but a custom object/resource of its kind, having a more domain-specific schema and then we bake the “operational intelligence” or the “domain-specific knowledge” into the controller responsible for maintaining the desired state of this object. For example, etcd operator has made etcd-cluster a first class object and for deploying the cluster we create an object of Etcd Cluster kind. With operators, we are able to extend Kubernetes functionalities for custom use cases and manage our applications in a Kubernetes specific way allowing us to leverage Kubernetes APIs and Kubectl tooling.

    Operators combine crds and custom controllers and intend to eliminate the requirement for manual intervention (human operator) while performing tasks like an upgrade, handling failure recovery, scaling in case of complex (often stateful) applications and make them more resilient and self-sufficient.

    How to Build Operators ?

    For building and managing operators we mostly leverage the Operator Framework which is an open source tool kit allowing us to build operators in a highly automated, scalable and effective way.  Operator framework comprises of three subcomponents:

    1. Operator SDK: Operator SDK is the most important component of the operator framework. It allows us to bootstrap our operator project in minutes. It exposes higher level APIs and abstraction and saves developers the time to dig deeper into kubernetes APIs and focus more on building the operational logic. It performs common tasks like getting the controller to watch the custom resource (cr) for changes etc as part of the project setup process.
    2. Operator Lifecycle Manager:  Operators also run on the same kubernetes clusters in which they manage applications and more often than not we create multiple operators for multiple applications. Operator lifecycle manager (OLM) provides us a declarative way to install, upgrade and manage all the operators and their dependencies in our cluster.
    3. Operator Metering:  Operator metering is currently an alpha project. It records historical cluster usage and can generate usage reports showing usage breakdown by pod or namespace over arbitrary time periods.

    Types of Operators

    Currently there are three different types of operator we can build:

    1. Helm based operators: Helm based operators allow us to use our existing Helm charts and build operators using them. Helm based operators are quite easy to build and are preferred to deploy a stateless application using operator pattern.
    2. Ansible based Operator: Ansible based operator allows us to use our existing ansible playbooks and roles and build operators using them. There are also easy to build and generally preferred for stateless applications.
    3. Go based operators: Go based operators are built to solve the most complex use cases and are generally preferred for stateful applications. In case of an golang based operator, we build the controller logic ourselves providing it with all our custom requirements. This type of operators is also relatively complex to build.

    Building a Helm based operator

    1. Let’s first install the operator sdk

    go get -d github.com/operator-framework/operator-sdk
    cd $GOPATH/src/github.com/operator-framework/operator-sdk
    git checkout master
    make dep
    make install

    Now we will have the operator-sdk binary in the $GOPATH/bin folder.      

    2.  Setup the project

    For building a helm based operator we can use an existing Helm chart. We will be using the book-store Helm chart which deploys a simple python app and mongodb instances. This app allows us to perform crud operations via. rest endpoints.

    Now we will use the operator-sdk to create our Helm based bookstore-operator project.

    operator-sdk new bookstore-operator --api-version=velotio.com/v1alpha1 --kind=BookStore --type=helm --helm-chart=book-store
      --helm-chart-repo=https://akash-gautam.github.io/helmcharts/

    In the above command, the bookstore-operator is the name of our operator/project. –kind is used to specify the kind of objects this operator will watch and –api-verison is used for versioning of this object. The operator sdk takes only this much information and creates the custom resource definition (crd) and also the custom resource (cr) of its type for us (remember we talked about high-level abstraction operator sdk provides). The above command bootstraps a project with below folder structure.

    bookstore-operator/
    |
    |- build/ # Contains the Dockerfile to build the operator image
    |- deploy/ # Contains the crd,cr and manifest files for deploying operator
    |- helm-charts/ # Contains the helm chart we used while creating the project
    |- watches.yaml # Specifies the resource the operator watches (maintains the state of)

    We had discussed the operator-sdk automates setting up the operator projects and that is exactly what we can observe here. Under the build folder, we have the Dockerfile to build our operator image. Under deploy folder we have a crd folder containing both the crd and the cr. This folder also has operator.yaml file using which we will run the operator in our cluster, along with this we have manifest files for role, rolebinding and service account file to be used while deploying the operator.  We have our book-store helm chart under helm-charts. In the watches.yaml file.

    ---
    - version: v1alpha1
      group: velotio.com
      kind: BookStore
      chart: /opt/helm/helm-charts/book-store

    We can see that the bookstore-operator watches events related to BookStore kind objects and executes the helm chart specified.

    If we take a look at the cr file under deploy/crds (velotio_v1alpha1_bookstore_cr.yaml) folder then we can see that it looks just like the values.yaml file of our book-store helm chart.

    apiVersion: velotio.com/v1alpha1
    kind: BookStore
    metadata:
      name: example-bookstore
    spec:
      # Default values copied from <project_dir>/helm-charts/book-store/values.yaml
      
      # Default values for book-store.
      # This is a YAML-formatted file.
      # Declare variables to be passed into your templates.
      
      replicaCount: 1
      
      image:
        app:
          repository: akash125/pyapp
          tag: latest
          pullPolicy: IfNotPresent
        mongodb:
          repository: mongo
          tag: latest
          pullPolicy: IfNotPresent
          
      service:
        app:
          type: LoadBalancer
          port: 80
          targetPort: 3000
        mongodb:
          type: ClusterIP
          port: 27017
          targetPort: 27017
      
      
      resources: {}
        # We usually recommend not to specify default resources and to leave this as a conscious
        # choice for the user. This also increases chances charts run on environments with little
        # resources, such as Minikube. If you do want to specify resources, uncomment the following
        # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
        # limits:
        #  cpu: 100m
        #  memory: 128Mi
        # requests:
        #  cpu: 100m
        #  memory: 128Mi
      
      nodeSelector: {}
      
      tolerations: []
      
      affinity: {}

    In the case of Helm charts, we use the values.yaml file to pass the parameter to our Helm releases, Helm based operator converts all these configurable parameters into the spec of our custom resource. This allows us to express the values.yaml with a custom resource (CR) which, as a native Kubernetes object, enables the benefits of RBAC applied to it and an audit trail. Now when we want to update out deployed we can simply modify the CR and apply it, and the operator will ensure that the changes we made are reflected in our app.

    For each object of  `BookStore` kind  the bookstore-operator will perform the following actions:

    1. Create the bookstore app deployment if it doesn’t exists.
    2. Create the bookstore app service if it doesn’t exists.
    3. Create the mongodb deployment if it doesn’t exists.
    4. Create the mongodb service if it doesn’t exists.
    5. Ensure deployments and services match their desired configurations like the replica count, image tag, service port etc.  

    3. Build the Bookstore-operator Image

    The Dockerfile for building the operator image is already in our build folder we need to run the below command from the root folder of our operator project to build the image.

    operator-sdk build akash125/bookstore-operator:v0.0.1

    4. Run the Bookstore-operator

    As we have our operator image ready we can now go ahead and run it. The deployment file (operator.yaml under deploy folder) for the operator was created as a part of our project setup we just need to set the image for this deployment to the one we built in the previous step.

    After updating the image in the operator.yaml we are ready to deploy the operator.

    kubectl create -f deploy/service_account.yaml
    kubectl create -f deploy/role.yaml
    kubectl create -f deploy/role_binding.yaml
    kubectl create -f deploy/operator.yaml

    Note: The role created might have more permissions then actually required for the operator so it is always a good idea to review it and trim down the permissions in production setups.

    Verify that the operator pod is in running state.

    5. Deploy the Bookstore App

    Now we have the bookstore-operator running in our cluster we just need to create the custom resource for deploying our bookstore app.

    First, we can create bookstore cr we need to register its crd.

    kubectl apply -f deploy/crds/velotio_v1alpha1_bookstore_crd.yaml

    Now we can create the bookstore object.

    kubectl apply -f deploy/crds/velotio_v1alpha1_bookstore_cr.yaml

    Now we can see that our operator has deployed out book-store app.

    Now let’s grab the external IP of the app and make some requests to store details of books.

    Let’s hit the external IP on the browser and see if it lists the books we just stored:

    The bookstore operator build is available here.

    Conclusion

    Since its early days Kubernetes was believed to be a great tool for managing stateless application but the managing stateful applications on Kubernetes was always considered difficult. Operators are a big leap towards managing stateful applications and other complex distributed, multi (poly) cloud workloads with the same ease that we manage the stateless applications. In this blog post, we learned the basics of Kubernetes operators and build a simple helm based operator. In the next installment of this blog series, we will build an Ansible based Kubernetes operator and then in the last blog we will build a full-fledged Golang based operator for managing stateful workloads.

    Related Reads:

  • How to Make Your Terminal More Productive with Z-Shell (ZSH)

    When working with servers or command-line-based applications, we spend most of our time on the command line. A good-looking and productive terminal is better in many aspects than a GUI (Graphical User Interface) environment since the command line takes less time for most use cases. Today, we’ll look at some of the features that make a terminal cool and productive.

    You can use the following steps on Ubuntu 20.04. if you are using a different operating system, your commands will likely differ. If you’re using Windows, you can choose between Cygwin, WSL, and Git Bash.

    Prerequisites

    Let’s upgrade the system and install some basic tools needed.

    sudo apt update && sudo apt upgrade
    sudo apt install build-essential curl wget git

    Z-Shell (ZSH)

    Zsh is an extended Bourne shell with many improvements, including some features of Bash and other shells.

    Let’s install Z-Shell:

    sudo apt install zsh

    Make it our default shell for our terminal:

    chsh -s $(which zsh)

    Now restart the system and open the terminal again to be welcomed by ZSH. Unlike other shells like Bash, ZSH requires some initial configuration, so it asks for some configuration options the first time we start it and saves them in a file called .zshrc in the home directory (/home/user) where the user is the current system user.

    For now, we’ll skip the manual work and get a head start with the default configuration. Press 2, and ZSH will populate the .zshrc file with some default options. We can change these later.  

    The initial configuration setup can be run again as shown in the below image

    Oh-My-ZSH

    Oh-My-ZSH is a community-driven, open-source framework to manage your ZSH configuration. It comes with many plugins and helpers. It can be installed with one single command as below.

    Installation

    sh -c "$(wget https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh -O -)"

    It’d take a backup of our existing .zshrc in a file zshrc.pre-oh-my-zsh, so whenever you uninstall it, the backup would be restored automatically.

    Font

    A good terminal needs some good fonts, we’d use Terminess nerd font to make our terminal look awesome, which can be downloaded here. Once downloaded, extract and move them to ~/.local/share/fonts to make them available for the current user or to /usr/share/fonts to be available for all the users.

    tar -xvf Terminess.zip
    mv *.ttf ~/.local/share/fonts 

    Once the font is installed, it will look like:

    Among all the things Oh-My-ZSH provides, 2 things are community favorites, plugins, and themes.

    Theme

    My go-to ZSH theme is powerlevel10k because it’s flexible, provides everything out of the box, and is easy to install with one command as shown below:

    git clone --depth=1 https://github.com/romkatv/powerlevel10k.git ${ZSH_CUSTOM:-$HOME/.oh-my-zsh/custom}/themes/powerlevel10k

    To set this theme in .zshrc:

    Close the terminal and start it again. Powerlevel10k will welcome you with the initial setup, go through the setup with the options you want. You can run this setup again by executing the below command:

    p10k configure

    Tools and plugins we can’t live without

    Plugins can be added to the plugins array in the .zshrc file. For all the plugins you want to use from the below list, add those to the plugins array in the .zshrc file like so:

    ZSH-Syntax-Highlighting

    This enables the highlighting of commands as you type and helps you catch syntax errors before you execute them:

    As you can see, “ls” is in green but “lss” is in red.

    Execute below command to install it:

    git clone https://github.com/zsh-users/zsh-syntax-highlighting.git ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-syntax-highlighting

    ZSH Autosuggestions

    This suggests commands as you type based on your history:

    The below command is how you can install it by cloning the git repo:

    git clone https://github.com/zsh-users/zsh-autosuggestions ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-autosuggestions

    ZSH Completions

    For some extra ZSH completion scripts, execute below command

    git clone https://github.com/zsh-users/zsh-completions ${ZSH_CUSTOM:=~/.oh-my-zsh/custom}/plugins/zsh-completions 

    autojump

    It’s a faster way of navigating the file system; it works by maintaining a database of directories you visit the most. More details can be found here.

    sudo apt install autojump 

    You can also use the plugin Z as an alternative if you’re not able to install autojump or for any other reason.

    Internal Plugins

    Some plugins come installed with oh-my-zsh, and they can be included directly in .zshrc file without any installation.

    copyfile

    It copies the content of a file to the clipboard.

    copyfile test.txt

    copypath

    It copies the absolute path of the current directory to the clipboard.

    copybuffer

    This plugin copies the command that is currently typed in the command prompt to the clipboard. It works with the keyboard shortcut CTRL + o.

    sudo

    Sometimes, we forget to prefix a command with sudo, but that can be done in just a second with this plugin. When you hit the ESC key twice, it will prefix the command you’ve typed in the terminal with sudo.

    web-search

    This adds some aliases for searching with Google, Wikipedia, etc. For example, if you want to web-search with Google, you can execute the below command:

    google oh my zsh

    Doing so will open this search in Google:

    More details can be found here.

    Remember, you’d have to add each of these plugins in the .zshrc file as well. So, in the end, this is how the plugins array in .zshrc file should look like:

    plugins=(
            zsh-autosuggestions
            zsh-syntax-highlighting
            zsh-completions
            autojump
            copyfile
            copydir
            copybuffer
            history
            dirhistory
            sudo
            web-search
            git
    ) 

    You can add more plugins, like docker, heroku, kubectl, npm, jsontools, etc., if you’re a developer. There are plugins for system admins as well or for anything else you need. You can explore them here.

    Enhancd

    Enhancd is the next-gen method to navigate file system with cli. It works with a fuzzy finder, we’ll install it fzf for this purpose.

    sudo apt install fzf

    Enhancd can be installed with the zplug plugin manager for ZSH, so first we’ll install zplug with the below command:

    $ curl -sL --proto-redir -all,https https://raw.githubusercontent.com/zplug/installer/master/installer.zsh | zsh

    Append the following to .zshrc:

    source ~/.zplug/init.zsh
    zplug load

    Now close your terminal, open it again, and use zplug to install enhanced

    zplug "b4b4r07/enhancd", use:init.sh

    Aliases

    As a developer, I need to execute git commands many times a day, typing each command every time is too cumbersome, so we can use aliases for them. Aliases need to be added .zshrc, and here’s how we can add them.

    alias gs='git status'
    alias ga='git add .'
    alias gf='git fetch'
    alias gr='git rebase'
    alias gp='git push'
    alias gd='git diff'
    alias gc='git commit'
    alias gh='git checkout'
    alias gst='git stash'
    alias gl='git log --oneline --graph'

    You can add these anywhere in the .zshrc file.

    Colorls

    Another tool that makes you say wow is Colorls. This tool colorizes the output of the ls command. This is how it looks once you install it:

    It works with ruby, below is how you can install both ruby and colors:

    sudo apt install ruby ruby-dev ruby-colorize
    sudo gem install colorls

    Now, restart your terminal and execute the command colors in your terminal to see the magic!

    Bonus – We can add some aliases as well if we want the same output of Colorls when we execute the command ls. Note that we’re adding another alias for ls to make it available as well.

    alias cl='ls'
    alias ls='colorls'
    alias la='colorls -a'
    alias ll='colorls -l'
    alias lla='colorls -la'

    These are the tools and plugins I can’t live without now, Let me know if I’ve missed anything.

    Automation

    Do you wanna repeat this process again, if let’s say, you’ve bought a new laptop and want the same setup?

    You can automate all of this if your answer is no, and that’s why I’ve created Project Automator. This project does a lot more than just setting up a terminal: it works with Arch Linux as of now but you can take the parts you need and make it work with almost any *nix system you like.

    Explaining how it works is beyond the scope of this article, so I’ll have to leave you guys here to explore it on your own.

    Conclusion

    We need to perform many tasks on our systems, and using a GUI(Graphical User Interface) tool for a task can consume a lot of your time, especially if you repeat the same task on a daily basis like converting a media stream, setting up tools on a system, etc.

    Using a command-line tool can save you a lot of time and you can automate repetitive tasks with scripting. It can be a great tool for your arsenal.

  • Deploy Serverless, Event-driven Python Applications Using Zappa

    Introduction

    Zappa is a  very powerful open source python project which lets you build, deploy and update your WSGI app hosted on AWS Lambda + API Gateway easily.This blog is a detailed step-by-step focusing on challenges faced while deploying Django application on AWS Lambda using Zappa as a deployment tool.

    Building Your Application

    If you do not have a Django application already you can build one by cloning this GitHub repository.

    $ git clone https://github.com/velotiotech/django-zappa-sample.git    

    Cloning into 'django-zappa-sample'...
    remote: Counting objects: 18, done.
    remote: Compressing objects: 100% (13/13), done.
    remote: Total 18 (delta 1), reused 15 (delta 1), pack-reused 0
    Unpacking objects: 100% (18/18), done.
    Checking connectivity... done.

    Once you have cloned the repository you will need a virtual environment which provides an isolated Python environment for your application. I prefer virtualenvwrapper to create one.

    Command :

    $ mkvirtualenv django_zappa_sample 

    Installing setuptools, pip, wheel...done.
    virtualenvwrapper.user_scripts creating /home/velotio/Envs/django_zappa_sample/bin/predeactivate
    virtualenvwrapper.user_scripts creating /home/velotio/Envs/django_zappa_sample/bin/postdeactivate
    virtualenvwrapper.user_scripts creating /home/velotio/Envs/django_zappa_sample/bin/preactivate
    virtualenvwrapper.user_scripts creating /home/velotio/Envs/django_zappa_sample/bin/postactivate
    virtualenvwrapper.user_scripts creating /home/velotio/Envs/django_zappa_sample/bin/get_env_details

    Install dependencies from requirements.txt.

    $ pip install -r requirements.txt

    Collecting Django==1.11.11 (from -r requirements.txt (line 1))
      Downloading https://files.pythonhosted.org/packages/d5/bf/2cd5eb314aa2b89855c01259c94dc48dbd9be6c269370c1f7ae4979e6e2f/Django-1.11.11-py2.py3-none-any.whl (6.9MB)
        100% |████████████████████████████████| 7.0MB 772kB/s 
    Collecting zappa==0.45.1 (from -r requirements.txt (line 2))
    Collecting pytz (from Django==1.11.11->-r requirements.txt (line 1))
      Downloading https://files.pythonhosted.org/packages/dc/83/15f7833b70d3e067ca91467ca245bae0f6fe56ddc7451aa0dc5606b120f2/pytz-2018.4-py2.py3-none-any.whl (510kB)
        100% |████████████████████████████████| 512kB 857kB/s 
    Collecting future==0.16.0 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting toml>=0.9.3 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting docutils>=0.12 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/50/09/c53398e0005b11f7ffb27b7aa720c617aba53be4fb4f4f3f06b9b5c60f28/docutils-0.14-py2-none-any.whl
    Collecting PyYAML==3.12 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting futures==3.1.1 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/a6/1c/72a18c8c7502ee1b38a604a5c5243aa8c2a64f4bba4e6631b1b8972235dd/futures-3.1.1-py2-none-any.whl
    Requirement already satisfied: wheel>=0.30.0 in /home/velotio/Envs/django_zappa_sample/lib/python2.7/site-packages (from zappa==0.45.1->-r requirements.txt (line 2)) (0.31.1)
    Collecting base58==0.2.4 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting durationpy==0.5 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting kappa==0.6.0 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/ed/cf/a8aa5964557c8a4828da23d210f8827f9ff190318838b382a4fb6f118f5d/kappa-0.6.0-py2-none-any.whl
    Collecting Werkzeug==0.12 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/ae/c3/f59f6ade89c811143272161aae8a7898735e7439b9e182d03d141de4804f/Werkzeug-0.12-py2.py3-none-any.whl
    Collecting boto3>=1.4.7 (from zappa==0.45.1->-r requirements.txt (line 2))
      Downloading https://files.pythonhosted.org/packages/cd/a3/4d1caf76d8f5aac8ab1ffb4924ecf0a43df1572f6f9a13465a482f94e61c/boto3-1.7.24-py2.py3-none-any.whl (128kB)
        100% |████████████████████████████████| 133kB 1.1MB/s 
    Collecting six>=1.11.0 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl
    Collecting tqdm==4.19.1 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/c0/d3/7f930cbfcafae3836be39dd3ed9b77e5bb177bdcf587a80b6cd1c7b85e74/tqdm-4.19.1-py2.py3-none-any.whl
    Collecting argcomplete==1.9.2 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/0f/ee/625763d848016115695942dba31a9937679a25622b6f529a2607d51bfbaa/argcomplete-1.9.2-py2.py3-none-any.whl
    Collecting hjson==3.0.1 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting troposphere>=1.9.0 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting python-dateutil==2.6.1 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/4b/0d/7ed381ab4fe80b8ebf34411d14f253e1cf3e56e2820ffa1d8844b23859a2/python_dateutil-2.6.1-py2.py3-none-any.whl
    Collecting botocore>=1.7.19 (from zappa==0.45.1->-r requirements.txt (line 2))
      Downloading https://files.pythonhosted.org/packages/65/98/12aa979ca3215d69111026405a9812d7bb0c9ae49e2800b00d3bd794705b/botocore-1.10.24-py2.py3-none-any.whl (4.2MB)
        100% |████████████████████████████████| 4.2MB 768kB/s 
    Collecting requests>=2.10.0 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/49/df/50aa1999ab9bde74656c2919d9c0c085fd2b3775fd3eca826012bef76d8c/requests-2.18.4-py2.py3-none-any.whl
    Collecting jmespath==0.9.3 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/b7/31/05c8d001f7f87f0f07289a5fc0fc3832e9a57f2dbd4d3b0fee70e0d51365/jmespath-0.9.3-py2.py3-none-any.whl
    Collecting wsgi-request-logger==0.4.6 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting lambda-packages==0.19.0 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting python-slugify==1.2.4 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/9f/77/ab7134b731d0e831cf82861c1ab0bb318e80c41155fa9da18958f9d96057/python_slugify-1.2.4-py2.py3-none-any.whl
    Collecting placebo>=0.8.1 (from kappa==0.6.0->zappa==0.45.1->-r requirements.txt (line 2))
    Collecting click>=5.1 (from kappa==0.6.0->zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/34/c1/8806f99713ddb993c5366c362b2f908f18269f8d792aff1abfd700775a77/click-6.7-py2.py3-none-any.whl
    Collecting s3transfer<0.2.0,>=0.1.10 (from boto3>=1.4.7->zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/d7/14/2a0004d487464d120c9fb85313a75cd3d71a7506955be458eebfe19a6b1d/s3transfer-0.1.13-py2.py3-none-any.whl
    Collecting cfn-flip>=0.2.5 (from troposphere>=1.9.0->zappa==0.45.1->-r requirements.txt (line 2))
    Collecting certifi>=2017.4.17 (from requests>=2.10.0->zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/7c/e6/92ad559b7192d846975fc916b65f667c7b8c3a32bea7372340bfe9a15fa5/certifi-2018.4.16-py2.py3-none-any.whl
    Collecting chardet<3.1.0,>=3.0.2 (from requests>=2.10.0->zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
    Collecting idna<2.7,>=2.5 (from requests>=2.10.0->zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/27/cc/6dd9a3869f15c2edfab863b992838277279ce92663d334df9ecf5106f5c6/idna-2.6-py2.py3-none-any.whl
    Collecting urllib3<1.23,>=1.21.1 (from requests>=2.10.0->zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/63/cb/6965947c13a94236f6d4b8223e21beb4d576dc72e8130bd7880f600839b8/urllib3-1.22-py2.py3-none-any.whl
    Collecting Unidecode>=0.04.16 (from python-slugify==1.2.4->zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/59/ef/67085e30e8bbcdd76e2f0a4ad8151c13a2c5bce77c85f8cad6e1f16fb141/Unidecode-1.0.22-py2.py3-none-any.whl
    Installing collected packages: pytz, Django, future, toml, docutils, PyYAML, futures, base58, durationpy, jmespath, six, python-dateutil, botocore, s3transfer, boto3, placebo, click, kappa, Werkzeug, tqdm, argcomplete, hjson, cfn-flip, troposphere, certifi, chardet, idna, urllib3, requests, wsgi-request-logger, lambda-packages, Unidecode, python-slugify, zappa
    Successfully installed Django-1.11.11 PyYAML-3.12 Unidecode-1.0.22 Werkzeug-0.12 argcomplete-1.9.2 base58-0.2.4 boto3-1.7.24 botocore-1.10.24 certifi-2018.4.16 cfn-flip-1.0.3 chardet-3.0.4 click-6.7 docutils-0.14 durationpy-0.5 future-0.16.0 futures-3.1.1 hjson-3.0.1 idna-2.6 jmespath-0.9.3 kappa-0.6.0 lambda-packages-0.19.0 placebo-0.8.1 python-dateutil-2.6.1 python-slugify-1.2.4 pytz-2018.4 requests-2.18.4 s3transfer-0.1.13 six-1.11.0 toml-0.9.4 tqdm-4.19.1 troposphere-2.2.1 urllib3-1.22 wsgi-request-logger-0.4.6 zappa-0.45.1
    @velotiotech

    Now if you run the server directly it will log a warning as the database is not set up yet.

    $ python manage.py runserver  

    Performing system checks...
    
    System check identified no issues (0 silenced).
    
    You have 13 unapplied migration(s). Your project may not work properly until you apply the migrations for app(s): admin, auth, contenttypes, sessions.
    Run 'python manage.py migrate' to apply them.
    
    May 20, 2018 - 14:47:32
    Django version 1.11.11, using settings 'django_zappa_sample.settings'
    Starting development server at http://127.0.0.1:8000/
    Quit the server with CONTROL-C.

    Also trying to access admin page (http://localhost:8000/admin/) will throw an “OperationalError” exception with below log at server end.

    Internal Server Error: /admin/
    Traceback (most recent call last):
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/core/handlers/exception.py", line 41, in inner
        response = get_response(request)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
        response = self.process_exception_by_middleware(e, request)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
        response = wrapped_callback(request, *callback_args, **callback_kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/admin/sites.py", line 242, in wrapper
        return self.admin_view(view, cacheable)(*args, **kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/utils/decorators.py", line 149, in _wrapped_view
        response = view_func(request, *args, **kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/views/decorators/cache.py", line 57, in _wrapped_view_func
        response = view_func(request, *args, **kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/admin/sites.py", line 213, in inner
        if not self.has_permission(request):
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/admin/sites.py", line 187, in has_permission
        return request.user.is_active and request.user.is_staff
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/utils/functional.py", line 238, in inner
        self._setup()
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/utils/functional.py", line 386, in _setup
        self._wrapped = self._setupfunc()
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/auth/middleware.py", line 24, in <lambda>
        request.user = SimpleLazyObject(lambda: get_user(request))
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/auth/middleware.py", line 12, in get_user
        request._cached_user = auth.get_user(request)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/auth/__init__.py", line 211, in get_user
        user_id = _get_user_session_key(request)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/auth/__init__.py", line 61, in _get_user_session_key
        return get_user_model()._meta.pk.to_python(request.session[SESSION_KEY])
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/sessions/backends/base.py", line 57, in __getitem__
        return self._session[key]
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/sessions/backends/base.py", line 207, in _get_session
        self._session_cache = self.load()
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/sessions/backends/db.py", line 35, in load
        expire_date__gt=timezone.now()
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/manager.py", line 85, in manager_method
        return getattr(self.get_queryset(), name)(*args, **kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/query.py", line 374, in get
        num = len(clone)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/query.py", line 232, in __len__
        self._fetch_all()
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/query.py", line 1118, in _fetch_all
        self._result_cache = list(self._iterable_class(self))
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/query.py", line 53, in __iter__
        results = compiler.execute_sql(chunked_fetch=self.chunked_fetch)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 899, in execute_sql
        raise original_exception
    OperationalError: no such table: django_session
    [20/May/2018 14:59:23] "GET /admin/ HTTP/1.1" 500 153553
    Not Found: /favicon.ico

    In order to fix this you need to run the migration into your database so that essential tables like auth_user, sessions, etc are created before any request is made to the server.

    $ python manage.py migrate 

    Operations to perform:
      Apply all migrations: admin, auth, contenttypes, sessions
    Running migrations:
      Applying contenttypes.0001_initial... OK
      Applying auth.0001_initial... OK
      Applying admin.0001_initial... OK
      Applying admin.0002_logentry_remove_auto_add... OK
      Applying contenttypes.0002_remove_content_type_name... OK
      Applying auth.0002_alter_permission_name_max_length... OK
      Applying auth.0003_alter_user_email_max_length... OK
      Applying auth.0004_alter_user_username_opts... OK
      Applying auth.0005_alter_user_last_login_null... OK
      Applying auth.0006_require_contenttypes_0002... OK
      Applying auth.0007_alter_validators_add_error_messages... OK
      Applying auth.0008_alter_user_username_max_length... OK
      Applying sessions.0001_initial... OK

    NOTE: Use DATABASES from project settings file to configure your database that you would want your Django application to use once hosted on AWS Lambda. By default, its configured to create a local SQLite database file as backend.

    You can run the server again and it should now load the admin panel of your website.

    Do verify if you have the zappa python package into your virtual environment before moving forward.

    Configuring Zappa Settings

    Deploying with Zappa is simple as it only needs a configuration file to run and rest will be managed by Zappa. To create this configuration file run from your project root directory –

    $ zappa init 

    ███████╗ █████╗ ██████╗ ██████╗  █████╗
    ╚══███╔╝██╔══██╗██╔══██╗██╔══██╗██╔══██╗
      ███╔╝ ███████║██████╔╝██████╔╝███████║
     ███╔╝  ██╔══██║██╔═══╝ ██╔═══╝ ██╔══██║
    ███████╗██║  ██║██║     ██║     ██║  ██║
    ╚══════╝╚═╝  ╚═╝╚═╝     ╚═╝     ╚═╝  ╚═╝
    
    Welcome to Zappa!
    
    Zappa is a system for running server-less Python web applications on AWS Lambda and AWS API Gateway.
    This `init` command will help you create and configure your new Zappa deployment.
    Let's get started!
    
    Your Zappa configuration can support multiple production stages, like 'dev', 'staging', and 'production'.
    What do you want to call this environment (default 'dev'): 
    
    AWS Lambda and API Gateway are only available in certain regions. Let's check to make sure you have a profile set up in one that will work.
    We found the following profiles: default, and hdx. Which would you like us to use? (default 'default'): 
    
    Your Zappa deployments will need to be uploaded to a private S3 bucket.
    If you don't have a bucket yet, we'll create one for you too.
    What do you want call your bucket? (default 'zappa-108wqhyn4'): django-zappa-sample-bucket
    
    It looks like this is a Django application!
    What is the module path to your projects's Django settings?
    We discovered: django_zappa_sample.settings
    Where are your project's settings? (default 'django_zappa_sample.settings'): 
    
    You can optionally deploy to all available regions in order to provide fast global service.
    If you are using Zappa for the first time, you probably don't want to do this!
    Would you like to deploy this application globally? (default 'n') [y/n/(p)rimary]: n
    
    Okay, here's your zappa_settings.json:
    
    {
        "dev": {
            "aws_region": "us-east-1", 
            "django_settings": "django_zappa_sample.settings", 
            "profile_name": "default", 
            "project_name": "django-zappa-sa", 
            "runtime": "python2.7", 
            "s3_bucket": "django-zappa-sample-bucket"
        }
    }
    
    Does this look okay? (default 'y') [y/n]: y
    
    Done! Now you can deploy your Zappa application by executing:
    
    	$ zappa deploy dev
    
    After that, you can update your application code with:
    
    	$ zappa update dev
    
    To learn more, check out our project page on GitHub here: https://github.com/Miserlou/Zappa
    and stop by our Slack channel here: https://slack.zappa.io
    
    Enjoy!,
     ~ Team Zappa!

    You can verify zappa_settings.json generated at your project root directory.

    TIP: The virtual environment name should not be the same as the Zappa project name, as this may cause errors.

    Additionally, you could specify other settings in  zappa_settings.json file as per requirement using Advanced Settings.

    Now, you’re ready to deploy!

    IAM Permissions

    In order to deploy the Django Application to Lambda/Gateway, setup an IAM role (eg. ZappaLambdaExecutionRole) with the following permissions:

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Action": [
    "iam:AttachRolePolicy",
    "iam:CreateRole",
    "iam:GetRole",
    "iam:PutRolePolicy"
    ],
    "Resource": [
    "*"
    ]
    },
    {
    "Effect": "Allow",
    "Action": [
    "iam:PassRole"
    ],
    "Resource": [
    "arn:aws:iam:::role/*-ZappaLambdaExecutionRole"
    ]
    },
    {
    "Effect": "Allow",
    "Action": [
    "apigateway:DELETE",
    "apigateway:GET",
    "apigateway:PATCH",
    "apigateway:POST",
    "apigateway:PUT",
    "events:DeleteRule",
    "events:DescribeRule",
    "events:ListRules",
    "events:ListTargetsByRule",
    "events:ListRuleNamesByTarget",
    "events:PutRule",
    "events:PutTargets",
    "events:RemoveTargets",
    "lambda:AddPermission",
    "lambda:CreateFunction",
    "lambda:DeleteFunction",
    "lambda:GetFunction",
    "lambda:GetPolicy",
    "lambda:ListVersionsByFunction",
    "lambda:RemovePermission",
    "lambda:UpdateFunctionCode",
    "lambda:UpdateFunctionConfiguration",
    "cloudformation:CreateStack",
    "cloudformation:DeleteStack",
    "cloudformation:DescribeStackResource",
    "cloudformation:DescribeStacks",
    "cloudformation:ListStackResources",
    "cloudformation:UpdateStack",
    "logs:DescribeLogStreams",
    "logs:FilterLogEvents",
    "route53:ListHostedZones",
    "route53:ChangeResourceRecordSets",
    "route53:GetHostedZone",
    "s3:CreateBucket",
    ],
    "Resource": [
    "*"
    ]
    },
    {
    "Effect": "Allow",
    "Action": [
    "s3:ListBucket"
    ],
    "Resource": [
    "arn:aws:s3:::"
    ]
    },
    {
    "Effect": "Allow",
    "Action": [
    "s3:DeleteObject",
    "s3:GetObject",
    "s3:PutObject",
    "s3:CreateMultipartUpload",
    "s3:AbortMultipartUpload",
    "s3:ListMultipartUploadParts",
    "s3:ListBucketMultipartUploads"
    ],
    "Resource": [
    "arn:aws:s3:::/*"
    ]
    }
    ]
    }

    Deploying Django Application

    Before deploying the application, ensure that the IAM role is set in the config JSON as follows:

    {
    "dev": {
    ...
    "manage_roles": false, // Disable Zappa client managing roles.
    "role_name": "MyLambdaRole", // Name of your Zappa execution role. Optional, default: --ZappaExecutionRole.
    "role_arn": "arn:aws:iam::12345:role/app-ZappaLambdaExecutionRole", // ARN of your Zappa execution role. Optional.
    ...
    },
    ...
    }

    Once your settings are configured, you can package and deploy your application to a stage called “dev” with a single command:

    $ zappa deploy dev

    Calling deploy for stage dev..
    Downloading and installing dependencies..
    Packaging project as zip.
    Uploading django-zappa-sa-dev-1526831069.zip (10.9MiB)..
    100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [01:02<00:00, 75.3KB/s]
    Scheduling..
    Scheduled django-zappa-sa-dev-zappa-keep-warm-handler.keep_warm_callback with expression rate(4 minutes)!
    Uploading django-zappa-sa-dev-template-1526831157.json (1.6KiB)..
    100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.60K/1.60K [00:02<00:00, 792B/s]
    Waiting for stack django-zappa-sa-dev to create (this can take a bit)..
    100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:11<00:00,  2.92s/res]
    Deploying API Gateway..
    Deployment complete!: https://akg59b222b.execute-api.us-east-1.amazonaws.com/dev

    You should see that your Zappa deployment completed successfully with URL to API gateway created for your application.

    Troubleshooting

    1. If you are seeing the following error while deployment, it’s probably because you do not have sufficient privileges to run deployment on AWS Lambda. Ensure your IAM role has all the permissions as described above or set “manage_roles” to true so that Zappa can create and manage the IAM role for you.

    Calling deploy for stage dev..
    Creating django-zappa-sa-dev-ZappaLambdaExecutionRole IAM Role..
    Error: Failed to manage IAM roles!
    You may lack the necessary AWS permissions to automatically manage a Zappa execution role.
    To fix this, see here: https://github.com/Miserlou/Zappa#using-custom-aws-iam-roles-and-policies

    2. The below error will be caused as you have not listed “events.amazonaws.com” as Trusted Entity for your IAM Role. You can add the same or set “keep_warm” parameter to false in your Zappa settings file. Your Zappa deployment was partially deployed as it got terminated abnormally.

    Downloading and installing dependencies..
    100%|████████████████████████████████████████████| 44/44 [00:05<00:00, 7.92pkg/s]
    Packaging project as zip..
    Uploading django-zappa-sample-dev-1482817370.zip (8.8MiB)..
    100%|█████████████████████████████████████████| 9.22M/9.22M [00:17<00:00, 527KB/s]
    Scheduling...
    Oh no! An error occurred! :(
    
    ==============
    
    Traceback (most recent call last):
    Traceback (most recent call last):
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 2610, in handle
        sys.exit(cli.handle())
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 505, in handle
        self.dispatch_command(self.command, stage)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 539, in dispatch_command
        self.deploy(self.vargs['zip'])
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 800, in deploy
        self.zappa.add_binary_support(api_id=api_id, cors=self.cors)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/core.py", line 1490, in add_binary_support
        restApiId=api_id
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 314, in _api_call
        return self._make_api_call(operation_name, kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 612, in _make_api_call
        raise error_class(parsed_response, operation_name)
    ClientError: An error occurred (ValidationError) when calling the PutRole operation: Provided role 'arn:aws:iam:484375727565:role/lambda_basic_execution' cannot be assumed by principal
    'events.amazonaws.com'.
    
    ==============
    
    Need help? Found a bug? Let us know! :D
    File bug reports on GitHub here: https://github.com/Miserlou/Zappa
    And join our Slack channel here: https://slack.zappa.io
    Love!,
    ~ Team Zappa!

    3. Adding the parameter and running zappa update will cause above error. As you can see it says “Stack django-zappa-sa-dev does not exists” as the previous deployment was unsuccessful. To fix this, delete the Lambda function from console and rerun the deployment.

    Downloading and installing dependencies..
    100%|████████████████████████████████████████████| 44/44 [00:05<00:00, 7.92pkg/s]
    Packaging project as zip..
    Uploading django-zappa-sample-dev-1482817370.zip (8.8MiB)..
    100%|█████████████████████████████████████████| 9.22M/9.22M [00:17<00:00, 527KB/s]
    Updating Lambda function code..
    Updating Lambda function configuration..
    Uploading djangoo-zapppa-sample-dev-template-1482817403.json (1.5KiB)..
    100%|████████████████████████████████████████| 1.56K/1.56K [00:00<00:00, 6.56KB/s]
    CloudFormation stack missing, re-deploy to enable updates
    ERROR:Could not get API ID.
    Traceback (most recent call last):
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 2610, in handle
        sys.exit(cli.handle())
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 505, in handle
        self.dispatch_command(self.command, stage)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 539, in dispatch_command
        self.deploy(self.vargs['zip'])
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 800, in deploy
        self.zappa.add_binary_support(api_id=api_id, cors=self.cors)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/core.py", line 1490, in add_binary_support
        restApiId=api_id
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 314, in _api_call
        return self._make_api_call(operation_name, kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 612, in _make_api_call
        raise error_class(parsed_response, operation_name)
    ClientError: An error occurred (ValidationError) when calling the DescribeStackResource operation: Stack 'django-zappa-sa-dev' does not exist
    Deploying API Gateway..
    Oh no! An error occurred! :(
    
    ==============
    
    Traceback (most recent call last):
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 1847, in handle
    sys.exit(cli.handle())
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 345, in handle
    self.dispatch_command(self.command, environment)
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 379, in dispatch_command
    self.update()
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 605, in update
    endpoint_url = self.deploy_api_gateway(api_id)
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 1816, in deploy_api_gateway
    cloudwatch_metrics_enabled=self.zappa_settings[self.api_stage].get('cloudwatch_metrics_enabled', False),
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/zappa.py", line 1014, in deploy_api_gateway
    variables=variables or {}
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 251, in _api_call
    return self._make_api_call(operation_name, kwargs)
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 513, in _make_api_call
    api_params, operation_model, context=request_context)
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 566, in _convert_to_request_dict
    api_params, operation_model)
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/validate.py", line 270, in serialize_to_request
    raise ParamValidationError(report=report.generate_report())
    ParamValidationError: Parameter validation failed:
    Invalid type for parameter restApiId, value: None, type: <type 'NoneType'>, valid types: <type 'basestring'>
    
    ==============
    
    Need help? Found a bug? Let us know! :D
    File bug reports on GitHub here: https://github.com/Miserlou/Zappa
    And join our Slack channel here: https://slack.zappa.io
    Love!,
    ~ Team Zappa!

    4.  If you run into any distribution error, please try down-grading your pip version to 9.0.1.

    $ pip install pip==9.0.1   

    Calling deploy for stage dev..
    Downloading and installing dependencies..
    Oh no! An error occurred! :(
    
    ==============
    
    Traceback (most recent call last):
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 2610, in handle
        sys.exit(cli.handle())
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 505, in handle
        self.dispatch_command(self.command, stage)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 539, in dispatch_command
        self.deploy(self.vargs['zip'])
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 709, in deploy
        self.create_package()
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 2171, in create_package
        disable_progress=self.disable_progress
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/core.py", line 595, in create_lambda_zip
        installed_packages = self.get_installed_packages(site_packages, site_packages_64)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/core.py", line 751, in get_installed_packages
        pip.get_installed_distributions()
    AttributeError: 'module' object has no attribute 'get_installed_distributions'
    
    ==============
    
    Need help? Found a bug? Let us know! :D
    File bug reports on GitHub here: https://github.com/Miserlou/Zappa
    And join our Slack channel here: https://slack.zappa.io
    Love!,
     ~ Team Zappa!

    or,

    If you run into NotFoundException(Invalid REST API Identifier issue) please try undeploying the Zappa stage and retry again.

    Calling deploy for stage dev..
    Downloading and installing dependencies..
    Packaging project as zip.
    Uploading django-zappa-sa-dev-1526830532.zip (10.9MiB)..
    100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:42<00:00, 331KB/s]
    Scheduling..
    Scheduled django-zappa-sa-dev-zappa-keep-warm-handler.keep_warm_callback with expression rate(4 minutes)!
    Uploading django-zappa-sa-dev-template-1526830690.json (1.6KiB)..
    100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.60K/1.60K [00:01<00:00, 801B/s]
    Oh no! An error occurred! :(
    
    ==============
    
    Traceback (most recent call last):
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 2610, in handle
        sys.exit(cli.handle())
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 505, in handle
        self.dispatch_command(self.command, stage)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 539, in dispatch_command
        self.deploy(self.vargs['zip'])
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 800, in deploy
        self.zappa.add_binary_support(api_id=api_id, cors=self.cors)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/core.py", line 1490, in add_binary_support
        restApiId=api_id
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 314, in _api_call
        return self._make_api_call(operation_name, kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 612, in _make_api_call
        raise error_class(parsed_response, operation_name)
    NotFoundException: An error occurred (NotFoundException) when calling the GetRestApi operation: Invalid REST API identifier specified 484375727565:akg59b222b
    
    ==============
    
    Need help? Found a bug? Let us know! :D
    File bug reports on GitHub here: https://github.com/Miserlou/Zappa
    And join our Slack channel here: https://slack.zappa.io
    Love!,
     ~ Team Zappa!

    TIP: To understand how your application works on serverless environment please visit this link.

    Post Deployment Setup

    Migrate database

    At this point, you should have an empty database for your Django application to fill up with a schema.

    $ zappa manage.py migrate dev

    Once you run above command the database migrations will be applied on the database as specified in your Django settings.

    Creating Superuser of Django Application

    You also might need to create a new superuser on the database. You could use the following command on your project directory.

    $ zappa invoke --raw dev "from django.contrib.auth.models import User; User.objects.create_superuser('username', 'username@yourdomain.com', 'password')"

    Alternatively,

    $ python manage createsuperuser

    Note that your application must be connected to the same database as this is run as standard Django administration command (not a Zappa command).

    Managing static files

    Your Django application will be having a dependency on static files, Django admin panel uses a combination of JS, CSS and image files.

    NOTE: Zappa is for running your application code, not for serving static web assets. If you plan on serving custom static assets in your web application (CSS/JavaScript/images/etc.), you’ll likely want to use a combination of AWS S3 and AWS CloudFront.

    You will need to add following packages to your virtual environment required for management of files to and from S3 django-storages and boto.

    $ pip install django-storages boto
    Add Django-Storage to your INSTALLED_APPS in settings.py
    INSTALLED_APPS = (
    ...,
    storages',
    )
    
    Configure Django-storage in settings.py as
    
    AWS_STORAGE_BUCKET_NAME = 'django-zappa-sample-bucket'
    AWS_S3_CUSTOM_DOMAIN = '%s.s3.amazonaws.com' % AWS_STORAGE_BUCKET_NAME
    STATIC_URL = "https://%s/" % AWS_S3_CUSTOM_DOMAIN
    STATICFILES_STORAGE = 'storages.backends.s3boto.S3BotoStorage'

    Once you have setup the Django application to serve your static files from AWS S3, run following command to upload the static file from your project to S3.

    $ python manage.py collectstatic --noinput

    or

    $ zappa update dev
    $ zappa manage dev "collectstatic --noinput"

    Check that at least 61 static files are moved to S3 bucket. Admin panel is built over  61 static files.

    NOTE: STATICFILES_DIR must be configured properly to collect your files from the appropriate location.

    Tip: You need to render static files in your templates by loading static path and using the same.  Example, {% static %}

    Setting Up API Gateway

    To connect to your Django application you also need to ensure you have API gateway setup for your AWS Lambda Function.  You need to have GET methods set up for all the URL resources used in your Django application. Alternatively, you can setup a proxy method to allow all subresources to be processed through one API method.

    Go to AWS Lambda function console and add API Gateway from ‘Add triggers’.

    1. Configure API, Deployment Stage, and Security for API Gateway. Click Save once it is done.

    2. Go to API Gateway console and,

    a. Recreate ANY method for / resource.

    i. Check `Use Lambda Proxy integration`

    ii. Set `Lambda Region` and `Lambda Function` and `Save` it.

    a. Recreate ANY method for /{proxy+} resource.

    i. Select `Lambda Function Proxy`

    ii. Set`Lambda Region` and `Lambda Function` and `Save` it.

    3. Click on Action and select Deploy API. Set Deployment Stage and click Deploy

    4. Ensure that GET and POST method for / and Proxy are set as Override for this method

    Setting Up Custom SSL Endpoint

    Optionally, you could also set up your own custom defined SSL endpoint with Zappa and install your certificate with your domain by running certify with Zappa. 

    $ zappa certify dev
    
    ...
    "certificate_arn": "arn:aws:acm:us-east-1:xxxxxxxxxxxx:certificate/xxxxxxxxxxxx-xxxxxx-xxxx-xxxx-xxxxxxxxxxxxxx",
    "domain": "django-zappa-sample.com"

    Now you are ready to launch your Django Application hosted on AWS Lambda.

    Additional Notes:

    •  Once deployed, you must run “zappa update <stage-name>” for updating your already hosted AWS Lambda function.</stage-name>
    • You can check server logs for investigation by running “zappa tail” command.
    • To un-deploy your application, simply run: `zappa undeploy <stage-name>`</stage-name>

    You’ve seen how to deploy Django application on AWS Lambda using Zappa. If you are creating your Django application for first time you might also want to read Edgar Roman’s Django Zappa Guide.

    Start building your Django application and let us know in the comments if you need any help during your application deployment over AWS Lambda.

  • Implementing Federated GraphQL Microservices using Apollo Federation

    Introduction

    GraphQL has revolutionized how a client queries a server. With the thin layer of GraphQL middleware, the client has the ability to query the data more comprehensively than what’s provided by the usual REST APIs.

    One of the key principles of GraphQL involves having a single data graph of the implementing services that will allow the client to have a unified interface to access more data and services through a single query. Having said that, it can be challenging to follow this principle for an enterprise-level application on a single, monolith GraphQL server.

    The Need for Federated Services

    James Baxley III, the Engineering Manager at Apollo, in his talk here, puts forward the rationale behind choosing an independently managed federated set of services very well.

    To summarize his point, let’s consider a very complex enterprise product. This product would essentially have multiple teams responsible for maintaining different modules of the product. Now, if we’re considering implementing a GraphQL layer at the backend, it would only make sense to follow the one graph principle of GraphQL: this says that to maximize the value of GraphQL, we should have a single unified data graph that’s operating at the data layer of this product. With that, it will be easier for a client to query a single graph and get all the data without having to query different graphs for different data portions.

    However, it would be challenging to have all of the huge enterprise data graphs’ layer logic residing on a single codebase. In addition, we want teams to be able to independently implement, maintain, and ship different schemas of the data graph on their own release cycles.

    Though there is only one graph, the implementation of that graph should be federated across multiple teams.

    Now, let’s consider a massive enterprise e-commerce platform as an example. The different schemas of the e-commerce platform look something like:

    Fig:- E-commerce platform set of schemas

    Considering the above example, it would be a chaotic task to maintain the graph implementation logic of all these schemas on a single code base. Another overhead that this would bring is having to scale a huge monolith that’s implementing all these services. 

    Thus, one solution is a federation of services for a single distributed data graph. Each service can be implemented independently by individual teams while maintaining their own release cycles and having their own iterations of their services. Also, a federated set of services would still follow the Onegraph principle of GraphQL, which will allow the client to query a single endpoint for fetching any part of the data graph.

    To further demonstrate the example above, let’s say the client asks for the top-five products, their reviews, and the vendor selling them. In a usual monolith GraphQL server, this query would involve writing a resolver that’s a mesh of the data sources of these individual schemas. It would be a task for teams to collaborate and come up with their individual implementations. Let’s consider a federated approach with separate services implementing products, reviews, and vendors. Each service is responsible for resolving only the part of the data graph that includes the schema and data source. This makes it extremely streamlined to allow different teams managing different schemas to collaborate easily.

    Another advantage would be handling the scaling of individual services rather than maintaining a compute-heavy monolith for a huge data graph. For example, the products service is used the most on the platform, and the vendors service is scarcely used. In case of a monolith approach, the scaling would’ve had to take place on the overall server. This is eliminated with federated services where we can independently maintain and scale individual services like the products service.

    Federated Implementation of GraphQL Services

    A monolith GraphQL server that implements a lot of services for different schemas can be challenging to scale. Instead of implementing the complete data graph on a single codebase, the responsibilities of different parts of the data graph can be split across multiple composable services. Each one will contain the implementation of only the part of the data graph it is responsible for. Apollo Federation allows this division of services and follows a declarative programming model to allow splitting of concerns.

    Architecture Overview

    This article will not cover the basics of GraphQL, such as writing resolvers and schemas. If you’re not acquainted with the basics of GraphQL and setting up a basic GraphQL server using Apollo, I would highly recommend reading about it here. Then, you can come back here to understand the implementation of federated services using Apollo Federation.

    Apollo Federation has two principal parts to it:

    • A collection of services that distinctly define separate GraphQL schemas
    • A gateway that builds the federated data graph and acts as a forefront to distinctly implement queries for different services
    Fig:- Apollo Federation Architecture

    Separation of Concerns

    The usual way of going about implementing federated services would be by splitting an existing monolith based on the existing schemas defined. Although this way seems like a clear approach, it will quickly cause problems when multiple Schemas are involved.

    To illustrate, this is a typical way to split services from a monolith based on the existing defined Schemas:

     

    In the example above, although the tweets field belongs to the User schema, it wouldn’t make sense to populate this field in the User service. The tweets field of a User should be declared and resolved in the Tweet service itself. Similarly, it wouldn’t be right to resolve the creator field inside the Tweet service.

    The reason behind this approach is the separation of concerns. The User service might not even have access to the Tweet datastore to be able to resolve the tweets field of a user. On the other hand, the Tweet service might not have access to the User datastore to resolve the creator field of the Tweet schema.

    Considering the above schemas, each service is responsible for resolving the respective field of each Schema it is responsible for.

    Implementation

    To illustrate an Apollo Federation, we’ll be considering a Nodejs server built with Typescript. The packages used are provided by the Apollo libraries.

    npm i --save apollo-server @apollo/federation @apollo/gateway

    Some additional libraries to help run the services in parallel:

    npm i --save nodemon ts-node concurrently

    Let’s go ahead and write the structure for the gateway service first. Let’s create a file gateway.ts:

    touch gateway.ts

    And add the following code snippet:

    import { ApolloServer } from 'apollo-server';
    import { ApolloGateway } from '@apollo/gateway';
    
    const gateway = new ApolloGateway({
      serviceList: [],
    });
    
    const server = new ApolloServer({ gateway, subscriptions: false });
    
    server.listen().then(({ url }) => {
      console.log(`Server ready at url: ${url}`);
    });

    Note the serviceList is an empty array for now since we’ve yet to implement the individual services. In addition, we pass the subscriptions: false option to the apollo server config because currently, Apollo Federation does not support subscriptions.

    Next, let’s add the User service in a separate file user.ts using:

    touch user.ts

    The code will go in the user service as follows:

    import { buildFederatedSchema } from '@apollo/federation';
    import { ApolloServer, gql } from 'apollo-server';
    import User from './datasources/models/User';
    import mongoStore from './mongoStore';
    
    const typeDefs = gql`
      type User @key(fields: "id") {
        id: ID!
        username: String!
      }
      extend type Query {
        users: [User]
        user(id: ID!): User
      }
      extend type Mutation {
        createUser(userPayload: UserPayload): User
      }
      input UserPayload {
        username: String!
      }
    `;
    
    const resolvers = {
      Query: {
        users: async () => {
          const allUsers = await User.find({});
          return allUsers;
        },
        user: async (_, { id }) => {
          const currentUser = await User.findOne({ _id: id });
          return currentUser;
        },
      },
      User: {
        __resolveReference: async (ref) => {
          const currentUser = await User.findOne({ _id: ref.id });
          return currentUser;
        },
      },
      Mutation: {
        createUser: async (_, { userPayload: { username } }) => {
          const user = new User({ username });
          const createdUser = await user.save();
          return createdUser;
        },
      },
    };
    
    mongoStore();
    
    const server = new ApolloServer({
      schema: buildFederatedSchema([{ typeDefs, resolvers }]),
    });
    
    server.listen({ port: 4001 }).then(({ url }) => {
      console.log(`User service ready at url: ${url}`);
    });

    Let’s break down the code that went into the User service.

    Consider the User schema definition:

    type User @key(fields: "id") {
       id: ID!
       username: String!
    }

    The @key directive helps other services understand the User schema is, in fact, an entity that can be extended within other individual services. The fields will help other services uniquely identify individual instances of the User schema based on the id.

    The Query and the Mutation types need to be extended by all implementing services according to the Apollo Federation documentation since they are always defined on a gateway level.

    As a side note, the User model imported from datasources/model/User

    import User from ‘./datasources/models/User’; is essentially a Mongoose ORM Model for MongoDB that will help in all the CRUD operations of a User entity in a MongoDB database. In addition, the mongoStore() function is responsible for establishing a connection to the MongoDB database server.

    The User model implementation internally in Mongoose ORM looks something like this:

    export const UserSchema = new Schema({
      username: {
        type: String,
      },
    });
    
    export default mongoose.model(
      'User',
      UserSchema
    );

    In the Query type, the users and the user(id: ID!) queries fetch a list or the details of individual users.

    In the resolvers, we define a __resolveReference function responsible for returning an instance of the User entity to all other implementing services, which just have a reference id of a User entity and need to return an instance of the User entity. The ref parameter is an object { id: ‘userEntityId’ } that contains the id of an instance of the User entity that may be passed down from other implementing services that need to resolve the reference of a User entity based on the reference id. Internally, we fire a mongoose .findOne query to return an instance of the User from the users database based on the reference id. To illustrate the resolver, 

    User: {
        __resolveReference: async (ref) => {
          const currentUser = User.findOne({ _id: ref.id });
          return currentUser;
        },
      },

    At the end of the file, we make sure the service is running on a unique port number 4001, which we pass as an option while running the apollo server. That concludes the User service.

    Next, let’s add the tweet service by creating a file tweet.ts using:

    touch tweet.ts

    The following code goes as a part of the tweet service:

    import { buildFederatedSchema } from '@apollo/federation';
    import { ApolloServer, gql } from 'apollo-server';
    import Tweet from './datasources/models/Tweet';
    import TweetAPI from './datasources/tweet';
    import mongoStore from './mongoStore';
    
    const typeDefs = gql`
      type Tweet {
        text: String
        id: ID!
        creator: User
      }
      extend type User @key(fields: "id") {
        id: ID! @external
        tweets: [Tweet]
      }
      extend type Query {
        tweet(id: ID!): Tweet
        tweets: [Tweet]
      }
      extend type Mutation {
        createTweet(tweetPayload: TweetPayload): Tweet
      }
      input TweetPayload {
        userId: String
        text: String
      }
    `;
    
    const resolvers = {
      Query: {
        tweet: async (_, { id }) => {
          const currentTweet = await Tweet.findOne({ _id: id });
          return currentTweet;
        },
        tweets: async () => {
          const tweetsList = await Tweet.find({});
          return tweetsList;
        },
      },
      Tweet: {
        creator: (tweet) => ({ __typename: 'User', id: tweet.userId }),
      },
      User: {
        tweets: async (user) => {
          const tweetsByUser = await Tweet.find({ userId: user.id });
          return tweetsByUser;
        },
      },
      Mutation: {
        createTweet: async (_, { tweetPayload: { text, userId } }) => {
          const newTweet = new Tweet({ text, userId });
          const createdTweet = await newTweet.save();
          return createdTweet;
        },
      },
    };
    
    mongoStore();
    
    const server = new ApolloServer({
      schema: buildFederatedSchema([{ typeDefs, resolvers }]),
    });
    
    server.listen({ port: 4002 }).then(({ url }) => {
      console.log(`Tweet service ready at url: ${url}`);
    });

    Let’s break down the Tweet service as well

    type Tweet {
       text: String
       id: ID!
       creator: User
    }

    The Tweet schema has the text field, which is the content of the tweet, a unique id of the tweet,  and a creator field, which is of the User entity type and resolves into the details of the user that created the tweet:

    extend type User @key(fields: "id") {
       id: ID! @external
       tweets: [Tweet]
    }

    We extend the User entity schema in this service, which has the id field with an @external directive. This helps the Tweet service understand that based on the given id field of the User entity schema, the instance of the User entity needs to be derived from another service (user service in this case).

    As we discussed previously, the tweets field of the extended User schema for the user entity should be resolved in the Tweet service since all the resolvers and access to the data sources with respect to the Tweets entity resides in this service.

    The Query and Mutation types of the Tweet service are pretty straightforward; we have a tweets and a tweet(id: ID!) queries to resolve a list or resolve an individual instance of the Tweet entity.

    Let’s further break down the resolvers:

    Tweet: {
       creator: (tweet) => ({ __typename: 'User', id: tweet.userId }),
    },

    To resolve the creator field of the Tweet entity, the Tweet service needs to tell the gateway that this field will be resolved by the User service. Hence, we pass the id of the User and a __typename for the gateway to be able to call the right service to resolve the User entity instance. In the User service earlier, we wrote a  __resolveReference resolver, which will resolve the reference of a User based on an id.

    User: {
       tweets: async (user) => {
           const tweetsByUser = await Tweet.find({ userId: user.id });
           return tweetsByUser;
       },
    },

    Now, we need to resolve the tweets field of the User entity extended in the Tweet service. We need to write a resolver where we get the parent user entity reference in the first argument of the resolver using which we can fire a Mongoose ORM query to return all the tweets created by the user given its id.

    At the end of the file, similar to the User service, we make sure the Tweet service runs on a different port by adding the port: 4002 option to the Apollo server config. That concludes both our implementing services.

    Now that we have our services ready, let’s update our gateway.ts file to reflect the added services:

    import { ApolloServer } from 'apollo-server';
    import { ApolloGateway } from '@apollo/gateway';
    
    const gateway = new ApolloGateway({
      serviceList: [
          { name: 'users', url: 'http://localhost:4001' },
          { name: 'tweets', url: 'http://localhost:4002' },
        ],
    });
    
    const server = new ApolloServer({ gateway, subscriptions: false });
    
    server.listen().then(({ url }) => {
      console.log(`Server ready at url: ${url}`);
    });

    We’ve added two services to the serviceList with a unique name to identify each service followed by the URL they are running on.

    Next, let’s make some small changes to the package.json file to make sure the services and the gateway run in parallel:

    "scripts": {
        "start": "concurrently -k npm:server:*",
        "server:gateway": "nodemon --watch 'src/**/*.ts' --exec 'ts-node' src/gateway.ts",
        "server:user": "nodemon --watch 'src/**/*.ts' --exec 'ts-node' src/user.ts",
        "server:tweet": "nodemon --watch 'src/**/*.ts' --exec 'ts-node' src/tweet.ts"
      },

    The concurrently library helps run 3 separate scripts in parallel. The server:* scripts spin up a dev server using nodemon to watch and reload the server for changes and ts-node to execute Typescript node.

    Let’s spin up our server:

    npm start

    On visiting the http://localhost:4000, you should see the GraphQL query playground running an Apollo server:

    Querying and Mutation from the Client

    Initially, let’s fire some mutations to create two users and some tweets by those users.

    Mutations

    Here we have created a user with the username “@elonmusk” that returns the id of the user. Fire the following mutations in the GraphQL playground:

     

    We will create another user named “@billgates” and take a note of the ID.

    Here is a simple mutation to create a tweet by the user “@elonmusk”. Now that we have two created users, let’s fire some mutations to create tweets by those users:

    Here is another mutation that creates a tweet by the user“@billgates”.

    After adding a couple of those, we are good to fire our queries, which will allow the gateway to compose the data by resolving fields through different services.

    Queries

    Initially, let’s list all the tweets along with their creator, which is of type User. The query will look something like:

    {
     tweets {
       text
       creator {
         username
       }
     }
    }

    When the gateway encounters a query asking for tweet data, it forwards that query to the Tweet service since the Tweet service that extends the Query type has a tweet query defined in it. 

    On encountering the creator field of the tweet schema, which is of the type User, the creator resolver within the Tweet service is invoked. This is essentially just passing a __typename and an id, which tells the gateway to resolve this reference from another service.

    In the User service, we have a __resolveReference function, which returns the complete instance of a user given it’s id passed from the Tweet service. It also helps all other implementing services that need the reference of a User entity resolved.

    On firing the query, the response should look something like:

    {
      "data": {
        "tweets": [
          {
            "text": "I own Tesla",
            "creator": {
              "username": "@elonmusk"
            }
          },
          {
            "text": "I own SpaceX",
            "creator": {
              "username": "@elonmusk"
            }
          },
          {
            "text": "I own PayPal",
            "creator": {
              "username": "@elonmusk"
            }
          },
          {
            "text": "I own Microsoft",
            "creator": {
              "username": "@billgates"
            }
          },
          {
            "text": "I own XBOX",
            "creator": {
              "username": "@billgates"
            }
          }
        ]
      }
    }

    Now, let’s try it the other way round. Let’s list all users and add the field tweets that will be an array of all the tweets created by that user. The query should look something like:

    {
     users {
       username
       tweets {
         text
       }
     }
    }

    When the gateway encounters the query of type users, it passes down that query to the user service. The User service is responsible for resolving the username field of the query.

    On encountering the tweets field of the users query, the gateway checks if any other implementing service has extended the User entity and has a resolver written within the service to resolve any additional fields of the type User.

    The Tweet service has extended the type User and has a resolver for the User type to resolve the tweets field, which will fetch all the tweets created by the user given the id of the user.

    On firing the query, the response should be something like:

    {
      "data": {
        "users": [
          {
            "username": "@elonmusk",
            "tweets": [
              {
                "text": "I own Tesla"
              },
              {
                "text": "I own SpaceX"
              },
              {
                "text": "I own PayPal"
              }
            ]
          },
          {
            "username": "@billgates",
            "tweets": [
              {
                "text": "I own Microsoft"
              },
              {
                "text": "I own XBOX"
              }
            ]
          }
        ]
      }
    }

    Conclusion

    To scale an enterprise data graph on a monolith GraphQL service brings along a lot of challenges. Having the ability to distribute our data graph into implementing services that can be individually maintained or scaled using Apollo Federation helps to quell any concerns.

    There are further advantages of federated services. Considering our example above, we could have two different kinds of datastores for the User and the Tweet service. While the User data could reside on a NoSQL database like MongoDB, the Tweet data could be on a SQL database like Postgres or SQL. This would be very easy to implement since each service is only responsible for resolving references only for the type they own.

    Final Thoughts

    One of the key advantages of having different services that can be maintained individually is the ability to deploy each service separately. In addition, this also enables deployment of different services independently to different platforms such as Firebase, Lambdas, etc.

    A single monolith GraphQL server deployed on an instance or a single serverless platform can have some challenges with respect to scaling an instance or handling high concurrency as mentioned above.

    By splitting out the services, we could have a separate serverless function for each implementing service that can be maintained or scaled individually and also a separate function on which the gateway can be deployed.

    One popular usage of GraphQL Federation can be seen in this Netflix Technology blog, where they’ve explained how they solved a bottleneck with the GraphQL APIs in Netflix Studio . What they did was create a federated GraphQL microservices architecture, along with a Schema store using Apollo Federation. This solution helped them create a unified schema but with distributed ownership and implementation.

  • A Step Towards Machine Learning Algorithms: Univariate Linear Regression

    These days the concept of Machine Learning is evolving rapidly. The understanding of it is so vast and open that everyone is having their independent thoughts about it. Here I am putting mine. This blog is my experience with the learning algorithms. In this blog, we will get to know the basic difference between Artificial Intelligence, Machine Learning, and Deep Learning. We will also get to know the foundation Machine Learning Algorithm i.e Univariate Linear Regression.

    Intermediate knowledge of Python and its library (Numpy, Pandas, MatPlotLib) is good to start. For Mathematics, a little knowledge of Algebra, Calculus and Graph Theory will help to understand the trick of the algorithm.

    A way to Artificial intelligence, Machine Learning, and Deep Learning

    These are the three buzzwords of today’s Internet world where we are seeing the future of the programming language. Specifically, we can say that this is the place where science domain meets with programming. Here we use scientific concepts and mathematics with a programming language to simulate the decision-making process. Artificial Intelligence is a program or the ability of a machine to make decisions more as humans do. Machine Learning is another program that supports Artificial Intelligence.  It helps the machine to observe the pattern and learn from it to make a decision. Here programming is helping in observing the patterns not in making decisions. Machine learning requires more and more information from various sources to observe all of the variables for any given pattern to make more accurate decisions. Here deep learning is supporting machine learning by creating a network (neural network) to fetch all required information and provide it to machine learning algorithms.

    What is Machine Learning

    Definition: Machine Learning provides machines with the ability to learn autonomously based on experiences, observations and analyzing patterns within a given data set without explicitly programming.

    This is a two-part process. In the first part, it observes and analyses the patterns of given data and makes a shrewd guess of a mathematical function that will be very close to the pattern. There are various methods for this. Few of them are Linear, Non-Linear, logistic, etc. Here we calculate the error function using the guessed mathematical function and the given data. In the second part we will minimize the error function. This minimized function is used for the prediction of the pattern.

    Here are the general steps to understand the process of Machine Learning:

    1. Plot the given dataset on x-y axis
    2. By looking into the graph, we will guess more close mathematical function
    3. Derive the Error function with the given dataset and guessed mathematical function
    4. Try to minimize an error function by using some algorithms
    5. Minimized error function will give us a more accurate mathematical function for the given patterns.

    Getting Started with the First Algorithms: Linear Regression with Univariable

    Linear Regression is a very basic algorithm or we can say the first and foundation algorithm to understand the concept of ML. We will try to understand this with an example of given data of prices of plots for a given area. This example will help us understand it better.

    movieID	title	userID	rating	timestamp
    0	1	Toy story	170	3.0	1162208198000
    1	1	Toy story	175	4.0	1133674606000
    2	1	Toy story	190	4.5	1057778398000
    3	1	Toy story	267	2.5	1084284499000
    4	1	Toy story	325	4.0	1134939391000
    5	1	Toy story	493	3.5	1217711355000
    6	1	Toy story	533	5.0	1050012402000
    7	1	Toy story	545	4.0	1162333326000
    8	1	Toy story	580	5.0	1162374884000
    9	1	Toy story	622	4.0	1215485147000
    10	1	Toy story	788	4.0	1188553740000

    With this data, we can easily determine the price of plots of the given area. But what if we want the price of the plot with area 5.0 * 10 sq mtr. There is no direct price of this in our given dataset. So how we can get the price of the plots with the area not given in the dataset. This we can do using Linear Regression.

    So at first, we will plot this data into a graph.

    The below graphs describe the area of plots (10 sq mtr) in x-axis and its prices in y-axis (Lakhs INR).

    Definition of Linear Regression

    The objective of a linear regression model is to find a relationship between one or more features (independent variables) and a continuous target variable(dependent variable). When there is only feature it is called Univariate Linear Regression and if there are multiple features, it is called Multiple Linear Regression.

    Hypothesis function:

    Here we will try to find the relation between price and area of plots. As this is an example of univariate, we can see that the price is only dependent on the area of the plot.

    By observing this pattern we can have our hypothesis function as below:

    f(x) = w * x + b

    where w is weightage and b is biased.

    For the different value set of (w,b) there can be multiple line possible but for one set of value, it will be close to this pattern.

    When we generalize this function for multivariable then there will be a set of values of w then these constants are also termed as model params.

    Note: There is a range of mathematical functions that relate to this pattern and selection of the function is totally up to us. But point to be taken care is that neither it should be under or overmatched and function must be continuous so that we can easily differentiate it or it should have global minima or maxima.

    Error for a point

    As our hypothesis function is continuous, for every Xi (area points) there will be one Yi  Predicted Price and Y will be the actual price.

    So the error at any point,

    Ei = Yi – Y = F(Xi) – Y

    These errors are also called as residuals. These residuals can be positive (if actual points lie below the predicted line) or negative (if actual points lie above the predicted line). Our motive is to minimize this residual for each of the points.

    Note: While observing the patterns it is possible that few points are very far from the pattern. For these far points, residuals will be much more so if these points are less in numbers than we can avoid these points considering that these are errors in the dataset. Such points are termed as outliers.

    Energy Functions

    As there are m training points, we can calculate the Average Energy function below

    E (w,b) =  1/m ( iΣm  (Ei) )

    and

    our motive is to minimize the energy functions

    min (E (w,b)) at point ( w,b )

    Little Calculus: For any continuous function, the points where the first derivative is zero are the points of either minima or maxima. If the second derivative is negative, it is the point of maxima and if it is positive, it is the point of minima.

    Here we will do the trick – we will convert our energy function into an upper parabola by squaring the error function. It will ensure that our energy function will have only one global minima (the point of our concern). It will simplify our calculation that where the first derivative of the energy function will be zero is the point that we need and the value of  (w,b) at that point will be our required point.

    So our final Energy function is

    E (w,b) =  1/2m ( iΣm  (Ei)2 )

    dividing by 2 doesn’t affect our result and at the time of derivation it will cancel out for e.g

    the first derivative of x2  is 2x.

    Gradient Descent Method

    Gradient descent is a generic optimization algorithm. It iteratively hit and trials the parameters of the model in order to minimize the energy function.

    In the above picture, we can see on the right side:

    1. w0 and w1 is the random initialization and by following gradient descent it is moving towards global minima.
    2. No of turns of the black line is the number of iterations so it must not be more or less.
    3. The distance between the turns is alpha i.e the learning parameter.

    By solving this left side equation we will be able to get model params at the global minima of energy functions.

    Points to consider at the time of Gradient Descent calculations:

    1. Random initialization: We start this algorithm at any random point that is set of random (w, b) value. By moving along this algorithm decide at which direction new trials have to be taken. As we know that it will be the upper parabola so by moving into the right direction (towards the global minima) we will get lesser value compared to the previous point.
    2. No of iterations: No of iteration must not be more or less. If it is lesser, we will not reach global minima and if it is more, then it will be extra calculations around the global minima.
    3. Alpha as learning parameters: when alpha is too small then gradient descent will be slow as it takes unnecessary steps to reach the global minima. If alpha is too big then it might overshoot the global minima. In this case it will neither converge nor diverge.

    Implementation of Gradient Descent in Python

    """ Method to read the csv file using Pandas and later use this data for linear regression. """
    """ Better run with Python 3+. """
    
    # Library to read csv file effectively
    import pandas
    import matplotlib.pyplot as plt
    import numpy as np
    
    # Method to read the csv file
    def load_data(file_name):
    	column_names = ['area', 'price']
    	# To read columns
    	io = pandas.read_csv(file_name,names=column_names, header=None)
    	x_val = (io.values[1:, 0])
    	y_val = (io.values[1:, 1])
    	size_array = len(y_val)
    	for i in range(size_array):
    		x_val[i] = float(x_val[i])
    		y_val[i] = float(y_val[i])
    		return x_val, y_val
    
    # Call the method for a specific file
    x_raw, y_raw = load_data('area-price.csv')
    x_raw = x_raw.astype(np.float)
    y_raw = y_raw.astype(np.float)
    y = y_raw
    
    # Modeling
    w, b = 0.1, 0.1
    num_epoch = 100
    converge_rate = np.zeros([num_epoch , 1], dtype=float)
    learning_rate = 1e-3
    for e in range(num_epoch):
    	# Calculate the gradient of the loss function with respect to arguments (model parameters) manually.
    	y_predicted = w * x_raw + b
    	grad_w, grad_b = (y_predicted - y).dot(x_raw), (y_predicted - y).sum()
    	# Update parameters.
    	w, b = w - learning_rate * grad_w, b - learning_rate * grad_b
    	converge_rate[e] = np.mean(np.square(y_predicted-y))
    
    print(w, b)
    print(f"predicted function f(x) = x * {w} + {b}" )
    calculatedprice = (10 * w) + b
    print(f"price of plot with area 10 sqmtr = 10 * {w} + {b} = {calculatedprice}")

    This is the basic implementation of Gradient Descent algorithms using numpy and Pandas. It is basically reading the area-price.csv file. Here we are normalizing the x-axis for better readability of data points over the graph. We have taken (w,b) as (0.1, 0.1) as random initialization. We have taken 100 as count of iterations and learning rate as .001.

    In every iteration, we are calculating w and b value and seeing it for converging rate.

    We can repeat this calculation for (w,b) for different values of random initialization, no of iterations and learning rate (alpha).

    Note: There is another python Library TensorFlow which is more preferable for such calculations. There are inbuilt functions of Gradient Descent in TensorFlow. But for better understanding, we have used library numpy and pandas here.

    RMSE (Root Mean Square Error)

    RMSE: This is the method to verify that our calculation of (w,b) is accurate at what extent. Below is the basic formula of calculation of RMSE where f is the predicted value and the observed value.

    Note: There is no absolute good or bad threshold value for RMSE, however, we can assume this based on our observed value. For an observed value ranges from 0 to 1000, the RMSE value of 0.7 is small, but if the range goes from 0 to 1, it is not that small.

    Conclusion

    As part of this article, we have seen a little introduction to Machine Learning and the need for it. Then with the help of a very basic example, we learned about one of the various optimization algorithms i.e. Linear Regression (for univariate only). This can be generalized for multivariate also. We then use the Gradient Descent Method for the calculation of the predicted data model in Linear Regression. We also learned the basic flow details of Gradient Descent. There is one example in python for displaying Linear Regression via Gradient Descent.

  • Publish APIs For Your Customers: Deploy Serverless Developer Portal For Amazon API Gateway

    Amazon API Gateway is a fully managed service that allows you to create, secure, publish, test and monitor your APIs. We often come across scenarios where customers of these APIs expect a platform to learn and discover APIs that are available to them (often with examples).

    The Serverless Developer Portal is one such application that is used for developer engagement by making your APIs available to your customers. Further, your customers can use the developer portal to subscribe to an API, browse API documentation, test published APIs, monitor their API usage, and submit their feedback.

    This blog is a detailed step-by-step guide for deploying the Serverless Developer Portal for APIs that are managed via Amazon API Gateway.

    Advantages

    The users of the Amazon API Gateway can be vaguely categorized as –

    API Publishers – They can use the Serverless Developer Portal to expose and secure their APIs for customers which can be integrated with AWS Marketplace for monetary benefits. Furthermore, they can customize the developer portal, including content, styling, logos, custom domains, etc. 

    API Consumers – They could be Frontend/Backend developers, third party customers, or simply students. They can explore available APIs, invoke the APIs, and go through the documentation to get an insight into how each API works with different requests. 

    Developer Portal Architecture

    We would need to establish a basic understanding of how the developer portal works. The Serverless Developer Portal is a serverless application built on microservice architecture using Amazon API Gateway, Amazon Cognito, AWS Lambda, Simple Storage Service and Amazon CloudFront. 

    The developer portal comprises multiple microservices and components as described in the following figure.

    Source: AWS

    There are a few key pieces in the above architecture –

    1. Identity Management: Amazon Cognito is basically the secure user directory of the developer portal responsible for user management. It allows you to configure triggers for registration, authentication, and confirmation, thereby giving you more control over the authentication process. 
    2. Business Logic: AWS Cloudfront is configured to serve your static content hosted in a private S3 bucket. The static content is built using the React JS framework which interacts with backend APIs dictating the business logic for various events. 
    3. Catalog Management: Developer portal uses catalog for rendering the APIs with Swagger specifications on the APIs page. The catalog file (catalog.json in S3 Artifact bucket) is updated whenever an API is published or removed. This is achieved by creating an S3 trigger on AWS Lambda responsible for studying the content of the catalog directory and generating a catalog for the developer portal.  
    4. API Key Creation: API Key is created for consumers at the time of registration. Whenever you subscribe to an API, associated Usage Plans are updated to your API key, thereby giving you access to those APIs as defined by the usage plan. Cognito User – API key mapping is stored in the DynamoDB table along with other registration related details.
    5. Static Asset Uploader: AWS Lambda (Static-Asset-Uploader) is responsible for updating/deploying static assets for the developer portal. Static assets include – content, logos, icons, CSS, JavaScripts, and other media files.

    Let’s move forward to building and deploying a simple Serverless Developer Portal.

    Building Your API

    Start with deploying an API which can be accessed using API Gateway from 

    https://<api-id>.execute-api.region.amazonaws.com/stage

    If you do not have any such API available, create a simple application by jumping to the section, “API Performance Across the Globe,” on this blog.

    Setup custom domain name

    For professional projects, I recommend that you create a custom domain name as they provide simpler and more intuitive URLs you can provide to your API users.

    Make sure your API Gateway domain name is updated in the Route53 record set created after you set up your custom domain name. 

    See more on Setting up custom domain names for REST APIs – Amazon API Gateway

    Enable CORS for an API Resource

    There are two ways you can enable CORS on a resource:

    1. Enable CORS Using the Console
    2. Enable CORS on a resource using the import API from Amazon API Gateway

    Let’s discuss the easiest way to do it using a console.

    1. Open API Gateway console.
    2. Select the API Gateway for your API from the list.
    3. Choose a resource to enable CORS for all the methods under that resource.
      Alternatively, you could choose a method under the resource to enable CORS for just this method.
    4. Select Enable CORS from the Actions drop-down menu.
    5. In the Enable CORS form, do the following:
      – Leave Access-Control-Allow-Headers and Access-Control-Allow-Origin header to default values.
      – Click on Enable CORS and replace existing CORS headers.
    6. Review the changes in Confirm method changes popup, choose Yes, overwrite existing values to apply your CORS settings.

    Once enabled, you can see a mock integration on the OPTIONS method for the selected resource. You must enable CORS for ${proxy} resources too. 

    To verify the CORS is enabled on API resource, try curl on OPTIONS method

    curl -v -X OPTIONS -H "Access-Control-Request-Method: POST" -H "Origin: http://example.com" https://api-id.execute-api.region.amazonaws.com/stage
    

    You should see the response OK in the header:

    < HTTP/1.1 200 OK
    < Content-Type: application/json
    < Content-Length: 0
    < Connection: keep-alive
    < Date: Mon, 13 Apr 2020 16:27:44 GMT
    < x-amzn-RequestId: a50b97b5-2437-436c-b99c-22e00bbe9430
    < Access-Control-Allow-Origin: *
    < Access-Control-Allow-Headers: Content-Type,X-Amz-Date,Authorization,X-Api-Key,X-Amz-Security-Token
    < x-amz-apigw-id: K7voBHDZIAMFu9g=
    < Access-Control-Allow-Methods: DELETE,GET,HEAD,OPTIONS,PATCH,POST,PUT
    < X-Cache: Miss from cloudfront
    < Via: 1.1 1c8c957c4a5bf1213bd57bd7d0ec6570.cloudfront.net (CloudFront)
    < X-Amz-Cf-Pop: BOM50-C1
    < X-Amz-Cf-Id: OmxFzV2-TH2BWPVyOohNrhNlJ-s1ZhYVKyoJaIrA_zyE9i0mRTYxOQ==

    Deploy Developer Portal

    There are two ways to deploy the developer portal for your API. 

    Using SAR

    An easy way will be to deploy api-gateway-dev-portal directly from AWS Serverless Application Repository. 

    Note -If you intend to upgrade your Developer portal to a major version then you need to refer to the Upgrading Instructions which is currently under development.

    Using AWS SAM

    1. Ensure that you have the latest AWS CLI and AWS SAM CLI installed and configured.
    2. Download or clone the API Gateway Serverless Developer Portal repository.
    3. Update the Cloudformation template file – cloudformation/template.yaml.

    Parameters you must configure and verify includes: 

    • ArtifactsS3BucketName
    • DevPortalSiteS3BucketName
    • DevPortalCustomersTableName
    • DevPortalPreLoginAccountsTableName
    • DevPortalAdminEmail
    • DevPortalFeedbackTableName
    • CognitoIdentityPoolName
    • CognitoDomainNameOrPrefix
    • CustomDomainName
    • CustomDomainNameAcmCertArn
    • UseRoute53Nameservers
    • AccountRegistrationMode

    You can view your template file in AWS Cloudformation Designer to get a better idea of all the components/services involved and how they are connected.

    See Developer portal settings for more information about parameters.

    1. Replace the static files in your project with the ones you would like to use.
      dev-portal/public/custom-content
      lambdas/static-asset-uploader/build
      api-logo contains the logos you would like to show on the API page (in png format). Portal checks for an api-id_stage.png file when rendering the API page. If not found, it chooses the default logo – default.png.
      content-fragments includes various markdown files comprising the content of the different pages in the portal. 
      Other static assets including favicon.ico, home-image.png and nav-logo.png that appear on your portal. 
    2. Let’s create a ZIP file of your code and dependencies, and upload it to Amazon S3. Running below command creates an AWS SAM template packaged.yaml, replacing references to local artifacts with the Amazon S3 location where the command uploaded the artifacts:
    sam package --template-file ./cloudformation/template.yaml --output-template-file ./cloudformation/packaged.yaml --s3-bucket {your-lambda-artifacts-bucket-name}

    1. Run the following command from the project root to deploy your portal, replace:
      – {your-template-bucket-name}
      with the name of your Amazon S3 bucket.
      – {custom-prefix}
      with a prefix that is globally unique.
      – {cognito-domain-or-prefix}
      with a unique string.
    sam deploy --template-file ./cloudformation/packaged.yaml --s3-bucket {your-template-bucket-name} --stack-name "{custom-prefix}-dev-portal" --capabilities CAPABILITY_NAMED_IAM

    Note: Ensure that you have required privileges to make deployments, as, during the deployment process, it attempts to create various resources such as AWS Lambda, Cognito User Pool, IAM roles, API Gateway, Cloudfront Distribution, etc. 

    After your developer portal has been fully deployed, you can get its URL by following.

    1. Open the AWS CloudFormation console.
    2. Select your stack you created above.
    3. Open the Outputs section. The URL for the developer portal is specified in the WebSiteURL property.

    Create Usage Plan

    Create a usage plan, to list your API under a subscribable APIs category allowing consumers to access the API using their API keys in the developer portal. Ensure that the API gateway stage is configured for the usage plan.

    Publishing an API

    Only Administrators have permission to publish an API. To create an Administrator account for your developer portal –

    1. Go to the WebSiteURL obtained after the successful deployment. 

    2. On the top right of the home page click on Register.

    Source: Github

    3. Fill the registration form and hit Sign up.

    4. Enter the confirmation code received on your email address provided in the previous step.

    5. Promote the user as Administrator by adding it to AdminGroup. 

    • Open Amazon Cognito User Pool console.
    • Select the User Pool created for your developer portal.
    • From the General Settings > Users and Groups page, select the User you want to promote as Administrator.
    • Click on Add to group and then select the Admin group from the dropdown and confirm.

    6. You will be required to log in again to log in as an Administrator. Click on the Admin Panel and choose the API you wish to publish from the APIs list.

    Setting up an account

    The signup process depends on the registration mode selected for the developer portal. 

    For request registration mode, you need to wait for the Administrator to approve your registration request.

    For invite registration mode, you can only register on the portal when invited by the portal administrator. 

    Subscribing an API

    1. Sign in to the developer portal.
    2. Navigate to the Dashboard page and Copy your API Key.
    3. Go to APIs Page to see a list of published APIs.
    4. Select an API you wish to subscribe to and hit the Subscribe button.

    Tips

    1. When a user subscribes to API, all the APIs published under that usage plan are accessible no matter whether they are published or not.
    2. Whenever you subscribe to an API, the catalog is exported from API Gateway resource documentation. You can customize the workflow or override the catalog swagger definition JSON in S3 bucket as defined in ArtifactsS3BucketName under /catalog/<apiid>_<stage>.json</stage></apiid>
    3. For backend APIs, CORS requests are allowed only from custom domain names selected for your developer portal.
    4. Ensure to set the CORS response header from the published API in order to invoke them from the developer portal.

    Summary

    You’ve seen how to deploy a Serverless Developer Portal and publish an API. If you are creating a serverless application for the first time, you might want to read more on Serverless Computing and AWS Gateway before you get started. 

    Start building your own developer portal. To know more on distributing your API Gateway APIs to your customers follow this AWS guide.

  • Using Packer and Terraform to Setup Jenkins Master-Slave Architecture

    Automation is everywhere and it is better to adopt it as soon as possible. Today, in this blog post, we are going to discuss creating the infrastructure. For this, we will be using AWS for hosting our deployment pipeline. Packer will be used to create AMI’s and Terraform will be used for creating the master/slaves. We will be discussing different ways of connecting the slaves and will also run a sample application with the pipeline.

    Please remember the intent of the blog is to accumulate all the different components together, this means some of the code which should be available in development code repo is also included here. Now that we have highlighted the required tools, 10000 ft view and intent of the blog. Let’s begin.

    Using Packer to Create AMI’s for Jenkins Master and Linux Slave

    Hashicorp has bestowed with some of the most amazing tools for simplifying our life. Packer is one of them. Packer can be used to create custom AMI from already available AMI’s. We just need to create a JSON file and pass installation script as part of creation and it will take care of developing the AMI for us. Install packer depending upon your requirement from Packer downloads page. For simplicity purpose, we will be using Linux machine for creating Jenkins Master and Linux Slave. JSON file for both of them will be same but can be separated if needed.

    Note: user-data passed from terraform will be different which will eventually differentiate their usage.

    We are using Amazon Linux 2 – JSON file for the same.

    {
      "builders": [
      {
        "ami_description": "{{user `ami-description`}}",
        "ami_name": "{{user `ami-name`}}",
        "ami_regions": [
          "us-east-1"
        ],
        "ami_users": [
          "XXXXXXXXXX"
        ],
        "ena_support": "true",
        "instance_type": "t2.medium",
        "region": "us-east-1",
        "source_ami_filter": {
          "filters": {
            "name": "amzn2-ami-hvm-2.0*x86_64*",
            "root-device-type": "ebs",
            "virtualization-type": "hvm"
          },
          "most_recent": true,
          "owners": [
            "amazon"
          ]
        },
        "sriov_support": "true",
        "ssh_username": "ec2-user",
        "tags": {
          "Name": "{{user `ami-name`}}"
        },
        "type": "amazon-ebs"
      }
    ],
    "post-processors": [
      {
        "inline": [
          "echo AMI Name {{user `ami-name`}}",
          "date",
          "exit 0"
        ],
        "type": "shell-local"
      }
    ],
    "provisioners": [
      {
        "script": "install_amazon.bash",
        "type": "shell"
      }
    ],
      "variables": {
        "ami-description": "Amazon Linux for Jenkins Master and Slave ({{isotime \"2006-01-02-15-04-05\"}})",
        "ami-name": "amazon-linux-for-jenkins-{{isotime \"2006-01-02-15-04-05\"}}",
        "aws_access_key": "",
        "aws_secret_key": ""
      }
    }

    As you can see the file is pretty simple. The only thing of interest here is the install_amazon.bash script. In this blog post, we will deploy a Node-based application which is running inside a docker container. Content of the bash file is as follows:

    #!/bin/bash
    
    set -x
    
    # For Node
    curl -sL https://rpm.nodesource.com/setup_10.x | sudo -E bash -
    
    # For xmlstarlet
    sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    
    sudo yum update -y
    
    sleep 10
    
    # Setting up Docker
    sudo yum install -y docker
    sudo usermod -a -G docker ec2-user
    
    # Just to be safe removing previously available java if present
    sudo yum remove -y java
    
    sudo yum install -y python2-pip jq unzip vim tree biosdevname nc mariadb bind-utils at screen tmux xmlstarlet git java-1.8.0-openjdk nc gcc-c++ make nodejs
    
    sudo -H pip install awscli bcrypt
    sudo -H pip install --upgrade awscli
    sudo -H pip install --upgrade aws-ec2-assign-elastic-ip
    
    sudo npm install -g @angular/cli
    
    sudo systemctl enable docker
    sudo systemctl enable atd
    
    sudo yum clean all
    sudo rm -rf /var/cache/yum/
    exit 0
    @velotiotech

    Now there are a lot of things mentioned let’s check them out. As mentioned earlier we will be discussing different ways of connecting to a slave and for one of them, we need xmlstarlet. Rest of the things are packages that we might need in one way or the other.

    Update ami_users with actual user value. This can be found on AWS console Under Support and inside of it Support Center.

    Validate what we have written is right or not by running packer validate amazon.json.

    Once confirmed, build the packer image by running packer build amazon.json.

    After completion check your AWS console and you will find a new AMI created in “My AMI’s”.

    It’s now time to start using terraform for creating the machines. 

    Prerequisite:

    1. Please make sure you create a provider.tf file.

    provider "aws" {
      region                  = "us-east-1"
      shared_credentials_file = "~/.aws/credentials"
      profile                 = "dev"
    }

    The ‘credentials file’ will contain aws_access_key_id and aws_secret_access_key.

    2.  Keep SSH keys handy for server/slave machines. Here is a nice article highlighting how to create it or else create them before hand on aws console and reference it in the code.

    3. VPC:

    # lookup for the "default" VPC
    data "aws_vpc" "default_vpc" {
      default = true
    }
    
    # subnet list in the "default" VPC
    # The "default" VPC has all "public subnets"
    data "aws_subnet_ids" "default_public" {
      vpc_id = "${data.aws_vpc.default_vpc.id}"
    }

    Creating Terraform Script for Spinning up Jenkins Master

    Creating Terraform Script for Spinning up Jenkins Master. Get terraform from terraform download page.

    We will need to set up the Security Group before setting up the instance.

    # Security Group:
    resource "aws_security_group" "jenkins_server" {
      name        = "jenkins_server"
      description = "Jenkins Server: created by Terraform for [dev]"
    
      # legacy name of VPC ID
      vpc_id = "${data.aws_vpc.default_vpc.id}"
    
      tags {
        Name = "jenkins_server"
        env  = "dev"
      }
    }
    
    ###############################################################################
    # ALL INBOUND
    ###############################################################################
    
    # ssh
    resource "aws_security_group_rule" "jenkins_server_from_source_ingress_ssh" {
      type              = "ingress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["<Your Public IP>/32", "172.0.0.0/8"]
      description       = "ssh to jenkins_server"
    }
    
    # web
    resource "aws_security_group_rule" "jenkins_server_from_source_ingress_webui" {
      type              = "ingress"
      from_port         = 8080
      to_port           = 8080
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "jenkins server web"
    }
    
    # JNLP
    resource "aws_security_group_rule" "jenkins_server_from_source_ingress_jnlp" {
      type              = "ingress"
      from_port         = 33453
      to_port           = 33453
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["172.31.0.0/16"]
      description       = "jenkins server JNLP Connection"
    }
    
    ###############################################################################
    # ALL OUTBOUND
    ###############################################################################
    
    resource "aws_security_group_rule" "jenkins_server_to_other_machines_ssh" {
      type              = "egress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins servers to ssh to other machines"
    }
    
    resource "aws_security_group_rule" "jenkins_server_outbound_all_80" {
      type              = "egress"
      from_port         = 80
      to_port           = 80
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins servers for outbound yum"
    }
    
    resource "aws_security_group_rule" "jenkins_server_outbound_all_443" {
      type              = "egress"
      from_port         = 443
      to_port           = 443
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins servers for outbound yum"
    }

    Now that we have a custom AMI and security groups for ourselves let’s use them to create a terraform instance.

    # AMI lookup for this Jenkins Server
    data "aws_ami" "jenkins_server" {
      most_recent      = true
      owners           = ["self"]
    
      filter {
        name   = "name"
        values = ["amazon-linux-for-jenkins*"]
      }
    }
    
    resource "aws_key_pair" "jenkins_server" {
      key_name   = "jenkins_server"
      public_key = "${file("jenkins_server.pub")}"
    }
    
    # lookup the security group of the Jenkins Server
    data "aws_security_group" "jenkins_server" {
      filter {
        name   = "group-name"
        values = ["jenkins_server"]
      }
    }
    
    # userdata for the Jenkins server ...
    data "template_file" "jenkins_server" {
      template = "${file("scripts/jenkins_server.sh")}"
    
      vars {
        env = "dev"
        jenkins_admin_password = "mysupersecretpassword"
      }
    }
    
    # the Jenkins server itself
    resource "aws_instance" "jenkins_server" {
      ami                    		= "${data.aws_ami.jenkins_server.image_id}"
      instance_type          		= "t3.medium"
      key_name               		= "${aws_key_pair.jenkins_server.key_name}"
      subnet_id              		= "${data.aws_subnet_ids.default_public.ids[0]}"
      vpc_security_group_ids 		= ["${data.aws_security_group.jenkins_server.id}"]
      iam_instance_profile   		= "dev_jenkins_server"
      user_data              		= "${data.template_file.jenkins_server.rendered}"
    
      tags {
        "Name" = "jenkins_server"
      }
    
      root_block_device {
        delete_on_termination = true
      }
    }
    
    output "jenkins_server_ami_name" {
        value = "${data.aws_ami.jenkins_server.name}"
    }
    
    output "jenkins_server_ami_id" {
        value = "${data.aws_ami.jenkins_server.id}"
    }
    
    output "jenkins_server_public_ip" {
      value = "${aws_instance.jenkins_server.public_ip}"
    }
    
    output "jenkins_server_private_ip" {
      value = "${aws_instance.jenkins_server.private_ip}"
    }

    As mentioned before, we will be discussing multiple ways in which we can connect the slaves to Jenkins master. But it is already known that every time a new Jenkins comes up, it generates a unique password. Now there are two ways to deal with this, one is to wait for Jenkins to spin up and retrieve that password or just directly edit the admin password while creating Jenkins master. Here we will be discussing how to change the password when configuring Jenkins. (If you need the script to retrieve Jenkins password as soon as it gets created than comment and I will share that with you as well).

    Below is the user data to install Jenkins master, configure its password and install required packages.

    #!/bin/bash
    
    set -x
    
    function wait_for_jenkins()
    {
      while (( 1 )); do
          echo "waiting for Jenkins to launch on port [8080] ..."
          
          nc -zv 127.0.0.1 8080
          if (( $? == 0 )); then
              break
          fi
    
          sleep 10
      done
    
      echo "Jenkins launched"
    }
    
    function updating_jenkins_master_password ()
    {
      cat > /tmp/jenkinsHash.py <<EOF
    import bcrypt
    import sys
    if not sys.argv[1]:
      sys.exit(10)
    plaintext_pwd=sys.argv[1]
    encrypted_pwd=bcrypt.hashpw(sys.argv[1], bcrypt.gensalt(rounds=10, prefix=b"2a"))
    isCorrect=bcrypt.checkpw(plaintext_pwd, encrypted_pwd)
    if not isCorrect:
      sys.exit(20);
    print "{}".format(encrypted_pwd)
    EOF
    
      chmod +x /tmp/jenkinsHash.py
      
      # Wait till /var/lib/jenkins/users/admin* folder gets created
      sleep 10
    
      cd /var/lib/jenkins/users/admin*
      pwd
      while (( 1 )); do
          echo "Waiting for Jenkins to generate admin user's config file ..."
    
          if [[ -f "./config.xml" ]]; then
              break
          fi
    
          sleep 10
      done
    
      echo "Admin config file created"
    
      admin_password=$(python /tmp/jenkinsHash.py ${jenkins_admin_password} 2>&1)
      
      # Please do not remove alter quote as it keeps the hash syntax intact or else while substitution, $<character> will be replaced by null
      xmlstarlet -q ed --inplace -u "/user/properties/hudson.security.HudsonPrivateSecurityRealm_-Details/passwordHash" -v '#jbcrypt:'"$admin_password" config.xml
    
      # Restart
      systemctl restart jenkins
      sleep 10
    }
    
    function install_packages ()
    {
    
      wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat-stable/jenkins.repo
      rpm --import https://jenkins-ci.org/redhat/jenkins-ci.org.key
      yum install -y jenkins
    
      # firewall
      #firewall-cmd --permanent --new-service=jenkins
      #firewall-cmd --permanent --service=jenkins --set-short="Jenkins Service Ports"
      #firewall-cmd --permanent --service=jenkins --set-description="Jenkins Service firewalld port exceptions"
      #firewall-cmd --permanent --service=jenkins --add-port=8080/tcp
      #firewall-cmd --permanent --add-service=jenkins
      #firewall-cmd --zone=public --add-service=http --permanent
      #firewall-cmd --reload
      systemctl enable jenkins
      systemctl restart jenkins
      sleep 10
    }
    
    function configure_jenkins_server ()
    {
      # Jenkins cli
      echo "installing the Jenkins cli ..."
      cp /var/cache/jenkins/war/WEB-INF/jenkins-cli.jar /var/lib/jenkins/jenkins-cli.jar
    
      # Getting initial password
      # PASSWORD=$(cat /var/lib/jenkins/secrets/initialAdminPassword)
      PASSWORD="${jenkins_admin_password}"
      sleep 10
    
      jenkins_dir="/var/lib/jenkins"
      plugins_dir="$jenkins_dir/plugins"
    
      cd $jenkins_dir
    
      # Open JNLP port
      xmlstarlet -q ed --inplace -u "/hudson/slaveAgentPort" -v 33453 config.xml
    
      cd $plugins_dir || { echo "unable to chdir to [$plugins_dir]"; exit 1; }
    
      # List of plugins that are needed to be installed 
      plugin_list="git-client git github-api github-oauth github MSBuild ssh-slaves workflow-aggregator ws-cleanup"
    
      # remove existing plugins, if any ...
      rm -rfv $plugin_list
    
      for plugin in $plugin_list; do
          echo "installing plugin [$plugin] ..."
          java -jar $jenkins_dir/jenkins-cli.jar -s http://127.0.0.1:8080/ -auth admin:$PASSWORD install-plugin $plugin
      done
    
      # Restart jenkins after installing plugins
      java -jar $jenkins_dir/jenkins-cli.jar -s http://127.0.0.1:8080 -auth admin:$PASSWORD safe-restart
    }
    
    ### script starts here ###
    
    install_packages
    
    wait_for_jenkins
    
    updating_jenkins_master_password
    
    wait_for_jenkins
    
    configure_jenkins_server
    
    echo "Done"
    exit 0
    

    There is a lot of stuff that has been covered here. But the most tricky bit is changing Jenkins password. Here we are using a python script which uses brcypt to hash the plain text in Jenkins encryption format and xmlstarlet for replacing that password in the actual location. Also, we are using xmstarlet to edit the JNLP port for windows slave. Do remember initial username for Jenkins is admin.

    Command to run: Initialize terraform – terraform init , Check and apply – terraform plan -> terraform apply

    After successfully running apply command go to AWS console and check for a new instance coming up. Hit the <public ip=””>:8080 and enter credentials as you had passed and you will have the Jenkins master for yourself ready to be used. </public>

    Note: I will be providing the terraform script and permission list of IAM roles for the user at the end of the blog.

    Creating Terraform Script for Spinning up Linux Slave and connect it to master

    We won’t be creating a new image here rather use the same one that we used for Jenkins master.

    VPC will be same and updated Security groups for slave are below:

    resource "aws_security_group" "dev_jenkins_worker_linux" {
      name        = "dev_jenkins_worker_linux"
      description = "Jenkins Server: created by Terraform for [dev]"
    
    # legacy name of VPC ID
      vpc_id = "${data.aws_vpc.default_vpc.id}"
    
      tags {
        Name = "dev_jenkins_worker_linux"
        env  = "dev"
      }
    }
    
    ###############################################################################
    # ALL INBOUND
    ###############################################################################
    
    # ssh
    resource "aws_security_group_rule" "jenkins_worker_linux_from_source_ingress_ssh" {
      type              = "ingress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["<Your Public IP>/32"]
      description       = "ssh to jenkins_worker_linux"
    }
    
    # ssh
    resource "aws_security_group_rule" "jenkins_worker_linux_from_source_ingress_webui" {
      type              = "ingress"
      from_port         = 8080
      to_port           = 8080
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "ssh to jenkins_worker_linux"
    }
    
    
    ###############################################################################
    # ALL OUTBOUND
    ###############################################################################
    
    resource "aws_security_group_rule" "jenkins_worker_linux_to_all_80" {
      type              = "egress"
      from_port         = 80
      to_port           = 80
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker to all 80"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_linux_to_all_443" {
      type              = "egress"
      from_port         = 443
      to_port           = 443
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker to all 443"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_linux_to_other_machines_ssh" {
      type              = "egress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker linux to jenkins server"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_linux_to_jenkins_server_8080" {
      type                     = "egress"
      from_port                = 8080
      to_port                  = 8080
      protocol                 = "tcp"
      security_group_id        = "${aws_security_group.dev_jenkins_worker_linux.id}"
      source_security_group_id = "${aws_security_group.jenkins_server.id}"
      description              = "allow jenkins workers linux to jenkins server"
    }

    Now that we have the required security groups in place it is time to bring into light terraform script for linux slave.

    data "aws_ami" "jenkins_worker_linux" {
      most_recent      = true
      owners           = ["self"]
    
      filter {
        name   = "name"
        values = ["amazon-linux-for-jenkins*"]
      }
    }
    
    resource "aws_key_pair" "jenkins_worker_linux" {
      key_name   = "jenkins_worker_linux"
      public_key = "${file("jenkins_worker.pub")}"
    }
    
    data "local_file" "jenkins_worker_pem" {
      filename = "${path.module}/jenkins_worker.pem"
    }
    
    data "template_file" "userdata_jenkins_worker_linux" {
      template = "${file("scripts/jenkins_worker_linux.sh")}"
    
      vars {
        env         = "dev"
        region      = "us-east-1"
        datacenter  = "dev-us-east-1"
        node_name   = "us-east-1-jenkins_worker_linux"
        domain      = ""
        device_name = "eth0"
        server_ip   = "${aws_instance.jenkins_server.private_ip}"
        worker_pem  = "${data.local_file.jenkins_worker_pem.content}"
        jenkins_username = "admin"
        jenkins_password = "mysupersecretpassword"
      }
    }
    
    # lookup the security group of the Jenkins Server
    data "aws_security_group" "jenkins_worker_linux" {
      filter {
        name   = "group-name"
        values = ["dev_jenkins_worker_linux"]
      }
    }
    
    resource "aws_launch_configuration" "jenkins_worker_linux" {
      name_prefix                 = "dev-jenkins-worker-linux"
      image_id                    = "${data.aws_ami.jenkins_worker_linux.image_id}"
      instance_type               = "t3.medium"
      iam_instance_profile        = "dev_jenkins_worker_linux"
      key_name                    = "${aws_key_pair.jenkins_worker_linux.key_name}"
      security_groups             = ["${data.aws_security_group.jenkins_worker_linux.id}"]
      user_data                   = "${data.template_file.userdata_jenkins_worker_linux.rendered}"
      associate_public_ip_address = false
    
      root_block_device {
        delete_on_termination = true
        volume_size = 100
      }
    
      lifecycle {
        create_before_destroy = true
      }
    }
    
    resource "aws_autoscaling_group" "jenkins_worker_linux" {
      name                      = "dev-jenkins-worker-linux"
      min_size                  = "1"
      max_size                  = "2"
      desired_capacity          = "2"
      health_check_grace_period = 60
      health_check_type         = "EC2"
      vpc_zone_identifier       = ["${data.aws_subnet_ids.default_public.ids}"]
      launch_configuration      = "${aws_launch_configuration.jenkins_worker_linux.name}"
      termination_policies      = ["OldestLaunchConfiguration"]
      wait_for_capacity_timeout = "10m"
      default_cooldown          = 60
    
      tags = [
        {
          key                 = "Name"
          value               = "dev_jenkins_worker_linux"
          propagate_at_launch = true
        },
        {
          key                 = "class"
          value               = "dev_jenkins_worker_linux"
          propagate_at_launch = true
        },
      ]
    }

    And now the final piece of code, which is user-data of slave machine.

    #!/bin/bash
    
    set -x
    
    function wait_for_jenkins ()
    {
        echo "Waiting jenkins to launch on 8080..."
    
        while (( 1 )); do
            echo "Waiting for Jenkins"
    
            nc -zv ${server_ip} 8080
            if (( $? == 0 )); then
                break
            fi
    
            sleep 10
        done
    
        echo "Jenkins launched"
    }
    
    function slave_setup()
    {
        # Wait till jar file gets available
        ret=1
        while (( $ret != 0 )); do
            wget -O /opt/jenkins-cli.jar http://${server_ip}:8080/jnlpJars/jenkins-cli.jar
            ret=$?
    
            echo "jenkins cli ret [$ret]"
        done
    
        ret=1
        while (( $ret != 0 )); do
            wget -O /opt/slave.jar http://${server_ip}:8080/jnlpJars/slave.jar
            ret=$?
    
            echo "jenkins slave ret [$ret]"
        done
        
        mkdir -p /opt/jenkins-slave
        chown -R ec2-user:ec2-user /opt/jenkins-slave
    
        # Register_slave
        JENKINS_URL="http://${server_ip}:8080"
    
        USERNAME="${jenkins_username}"
        
        # PASSWORD=$(cat /tmp/secret)
        PASSWORD="${jenkins_password}"
    
        SLAVE_IP=$(ip -o -4 addr list ${device_name} | head -n1 | awk '{print $4}' | cut -d/ -f1)
        NODE_NAME=$(echo "jenkins-slave-linux-$SLAVE_IP" | tr '.' '-')
        NODE_SLAVE_HOME="/opt/jenkins-slave"
        EXECUTORS=2
        SSH_PORT=22
    
        CRED_ID="$NODE_NAME"
        LABELS="build linux docker"
        USERID="ec2-user"
    
        cd /opt
        
        # Creating CMD utility for jenkins-cli commands
        jenkins_cmd="java -jar /opt/jenkins-cli.jar -s $JENKINS_URL -auth $USERNAME:$PASSWORD"
    
        # Waiting for Jenkins to load all plugins
        while (( 1 )); do
    
          count=$($jenkins_cmd list-plugins 2>/dev/null | wc -l)
          ret=$?
    
          echo "count [$count] ret [$ret]"
    
          if (( $count > 0 )); then
              break
          fi
    
          sleep 30
        done
    
        # Delete Credentials if present for respective slave machines
        $jenkins_cmd delete-credentials system::system::jenkins _ $CRED_ID
    
        # Generating cred.xml for creating credentials on Jenkins server
        cat > /tmp/cred.xml <<EOF
    <com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKey plugin="ssh-credentials@1.16">
      <scope>GLOBAL</scope>
      <id>$CRED_ID</id>
      <description>Generated via Terraform for $SLAVE_IP</description>
      <username>$USERID</username>
      <privateKeySource class="com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKey\$DirectEntryPrivateKeySource">
        <privateKey>${worker_pem}</privateKey>
      </privateKeySource>
    </com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKey>
    EOF
    
        # Creating credential using cred.xml
        cat /tmp/cred.xml | $jenkins_cmd create-credentials-by-xml system::system::jenkins _
    
        # For Deleting Node, used when testing
        $jenkins_cmd delete-node $NODE_NAME
        
        # Generating node.xml for creating node on Jenkins server
        cat > /tmp/node.xml <<EOF
    <slave>
      <name>$NODE_NAME</name>
      <description>Linux Slave</description>
      <remoteFS>$NODE_SLAVE_HOME</remoteFS>
      <numExecutors>$EXECUTORS</numExecutors>
      <mode>NORMAL</mode>
      <retentionStrategy class="hudson.slaves.RetentionStrategy\$Always"/>
      <launcher class="hudson.plugins.sshslaves.SSHLauncher" plugin="ssh-slaves@1.5">
        <host>$SLAVE_IP</host>
        <port>$SSH_PORT</port>
        <credentialsId>$CRED_ID</credentialsId>
      </launcher>
      <label>$LABELS</label>
      <nodeProperties/>
      <userId>$USERID</userId>
    </slave>
    EOF
    
      sleep 10
      
      # Creating node using node.xml
      cat /tmp/node.xml | $jenkins_cmd create-node $NODE_NAME
    }
    
    ### script begins here ###
    
    wait_for_jenkins
    
    slave_setup
    
    echo "Done"
    exit 0

    This will not only create a node on Jenkins master but also attach it.

    Command to run: Initialize terraform – terraform init, Check and apply – terraform plan -> terraform apply

    One drawback of this is, if by any chance slave gets disconnected or goes down, it will remain on Jenkins master as offline, also it will not manually attach itself to Jenkins master.

    Some solutions for them are:

    1. Create a cron job on the slave which will run user-data after a certain interval.

    2. Use swarm plugin.

    3. As we are on AWS, we can even use Amazon EC2 Plugin.

    Maybe in a future blog, we will cover using both of these plugins as well.

    Using Packer to create AMI’s for Windows Slave

    Windows AMI will also be created using packer. All the pointers for Windows will remain as it were for Linux.

    {
      "variables": {
        "ami-description": "Windows Server for Jenkins Slave ({{isotime \"2006-01-02-15-04-05\"}})",
        "ami-name": "windows-slave-for-jenkins-{{isotime \"2006-01-02-15-04-05\"}}",
        "aws_access_key": "",
        "aws_secret_key": ""
      },
    
      "builders": [
        {
          "ami_description": "{{user `ami-description`}}",
          "ami_name": "{{user `ami-name`}}",
          "ami_regions": [
            "us-east-1"
          ],
          "ami_users": [
            "XXXXXXXXXX"
          ],
          "ena_support": "true",
          "instance_type": "t3.medium",
          "region": "us-east-1",
          "source_ami_filter": {
            "filters": {
              "name": "Windows_Server-2016-English-Full-Containers-*",
              "root-device-type": "ebs",
              "virtualization-type": "hvm"
            },
            "most_recent": true,
            "owners": [
              "amazon"
            ]
          },
          "sriov_support": "true",
          "user_data_file": "scripts/SetUpWinRM.ps1",
          "communicator": "winrm",
          "winrm_username": "Administrator",
          "winrm_insecure": true,
          "winrm_use_ssl": true,
          "tags": {
            "Name": "{{user `ami-name`}}"
          },
          "type": "amazon-ebs"
        }
      ],
      "post-processors": [
      {
        "inline": [
          "echo AMI Name {{user `ami-name`}}",
          "date",
          "exit 0"
        ],
        "type": "shell-local"
      }
      ],
      "provisioners": [
        {
          "type": "powershell",
          "valid_exit_codes": [ 0, 3010 ],
          "scripts": [
            "scripts/disable-uac.ps1",
            "scripts/enable-rdp.ps1",
            "install_windows.ps1"
          ]
        },
        {
          "type": "windows-restart",
          "restart_check_command": "powershell -command \"& {Write-Output 'restarted.'}\""
        },
        {
          "type": "powershell",
          "inline": [
            "C:\\ProgramData\\Amazon\\EC2-Windows\\Launch\\Scripts\\InitializeInstance.ps1 -Schedule",
            "C:\\ProgramData\\Amazon\\EC2-Windows\\Launch\\Scripts\\SysprepInstance.ps1 -NoShutdown"
          ]
        }
      ]
    }

    Now when it comes to windows one should know that it does not behave the same way Linux does. For us to be able to communicate with this image an essential component required is WinRM. We set it up at the very beginning as part of user_data_file. Also, windows require user input for a lot of things and while automating it is not possible to provide it as it will break the flow of execution so we disable UAC and enable RDP so that we can connect to that machine from our local desktop for debugging if needed. And at last, we will execute install_windows.ps1 file which will set up our slave. Please note at the last we are calling two PowerShell scripts to generate random password every time a new machine is created. It is mandatory to have them or you will never be able to login into your machines.

    There are multiple user-data in the above code, let’s understand them in their order of appearance.

    SetUpWinRM.ps1:

    <powershell>
    
    write-output "Running User Data Script"
    write-host "(host) Running User Data Script"
    
    Set-ExecutionPolicy Unrestricted -Scope LocalMachine -Force -ErrorAction Ignore
    
    # Don't set this before Set-ExecutionPolicy as it throws an error
    $ErrorActionPreference = "stop"
    
    # Remove HTTP listener
    Remove-Item -Path WSMan:\Localhost\listener\listener* -Recurse
    
    $Cert = New-SelfSignedCertificate -CertstoreLocation Cert:\LocalMachine\My -DnsName "packer"
    New-Item -Path WSMan:\LocalHost\Listener -Transport HTTPS -Address * -CertificateThumbPrint $Cert.Thumbprint -Force
    
    # WinRM
    write-output "Setting up WinRM"
    write-host "(host) setting up WinRM"
    
    cmd.exe /c winrm quickconfig -q
    cmd.exe /c winrm set "winrm/config" '@{MaxTimeoutms="1800000"}'
    cmd.exe /c winrm set "winrm/config/winrs" '@{MaxMemoryPerShellMB="1024"}'
    cmd.exe /c winrm set "winrm/config/service" '@{AllowUnencrypted="true"}'
    cmd.exe /c winrm set "winrm/config/client" '@{AllowUnencrypted="true"}'
    cmd.exe /c winrm set "winrm/config/service/auth" '@{Basic="true"}'
    cmd.exe /c winrm set "winrm/config/client/auth" '@{Basic="true"}'
    cmd.exe /c winrm set "winrm/config/service/auth" '@{CredSSP="true"}'
    cmd.exe /c winrm set "winrm/config/listener?Address=*+Transport=HTTPS" "@{Port=`"5986`";Hostname=`"packer`";CertificateThumbprint=`"$($Cert.Thumbprint)`"}"
    cmd.exe /c netsh advfirewall firewall set rule group="remote administration" new enable=yes
    cmd.exe /c netsh firewall add portopening TCP 5986 "Port 5986"
    cmd.exe /c net stop winrm
    cmd.exe /c sc config winrm start= auto
    cmd.exe /c net start winrm
    
    </powershell>

    The content is pretty straightforward as it is just setting up WInRM. The only thing that matters here is the <powershell> and </powershell>. They are mandatory as packer will not be able to understand what is the type of script. Next, we come across disable-uac.ps1 & enable-rdp.ps1, and we have discussed their purpose before. The last user-data is the actual user-data that we need to install all the required packages in the AMI.

    Chocolatey: a blessing in disguise – Installing required applications in windows by scripting is a real headache as you have to write a lot of stuff just to install a single application but luckily for us we have chocolatey. It works as a package manager for windows and helps us install applications as we are installing packages on Linux. install_windows.ps1 has installation step for chocolatey and how it can be used to install other applications on windows.

    See, such a small script and you can get all the components to run your Windows application in no time (Kidding… This script actually takes around 20 minutes to run :P)

    Remaining user-data can be found here.

    Now that we have the image for ourselves let’s start with terraform script to make this machine a slave of your Jenkins master.

    Creating Terraform Script for Spinning up Windows Slave and Connect it to Master

    This time also we will first create the security groups and then create the slave machine from the same AMI that we developed above.

    resource "aws_security_group" "dev_jenkins_worker_windows" {
      name        = "dev_jenkins_worker_windows"
      description = "Jenkins Server: created by Terraform for [dev]"
    
      # legacy name of VPC ID
      vpc_id = "${data.aws_vpc.default_vpc.id}"
    
      tags {
        Name = "dev_jenkins_worker_windows"
        env  = "dev"
      }
    }
    
    ###############################################################################
    # ALL INBOUND
    ###############################################################################
    
    # ssh
    resource "aws_security_group_rule" "jenkins_worker_windows_from_source_ingress_webui" {
      type              = "ingress"
      from_port         = 8080
      to_port           = 8080
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "ssh to jenkins_worker_windows"
    }
    
    # rdp
    resource "aws_security_group_rule" "jenkins_worker_windows_from_rdp" {
      type              = "ingress"
      from_port         = 3389
      to_port           = 3389
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["<Your Public IP>/32"]
      description       = "rdp to jenkins_worker_windows"
    }
    
    ###############################################################################
    # ALL OUTBOUND
    ###############################################################################
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_all_80" {
      type              = "egress"
      from_port         = 80
      to_port           = 80
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker to all 80"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_all_443" {
      type              = "egress"
      from_port         = 443
      to_port           = 443
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker to all 443"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_jenkins_server_33453" {
      type              = "egress"
      from_port         = 33453
      to_port           = 33453
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["172.31.0.0/16"]
      description       = "allow jenkins worker windows to jenkins server"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_jenkins_server_8080" {
      type                     = "egress"
      from_port                = 8080
      to_port                  = 8080
      protocol                 = "tcp"
      security_group_id        = "${aws_security_group.dev_jenkins_worker_windows.id}"
      source_security_group_id = "${aws_security_group.jenkins_server.id}"
      description              = "allow jenkins workers windows to jenkins server"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_all_22" {
      type              = "egress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker windows to connect outbound from 22"
    }

    Once security groups are in place we move towards creating the terraform file for windows machine itself. Windows can’t connect to Jenkins master using SSH the method we used while connecting the Linux slave instead we have to use JNLP. A quick recap, when creating Jenkins master we used xmlstarlet to modify the JNLP port and also added rules in sg group to allow connection for JNLP. Also, we have opened the port for RDP so that if any issue occurs you can get in the machine and debug it.

    Terraform file:

    # Setting Up Windows Slave 
    data "aws_ami" "jenkins_worker_windows" {
      most_recent      = true
      owners           = ["self"]
    
      filter {
        name   = "name"
        values = ["windows-slave-for-jenkins*"]
      }
    }
    
    resource "aws_key_pair" "jenkins_worker_windows" {
      key_name   = "jenkins_worker_windows"
      public_key = "${file("jenkins_worker.pub")}"
    }
    
    data "template_file" "userdata_jenkins_worker_windows" {
      template = "${file("scripts/jenkins_worker_windows.ps1")}"
    
      vars {
        env         = "dev"
        region      = "us-east-1"
        datacenter  = "dev-us-east-1"
        node_name   = "us-east-1-jenkins_worker_windows"
        domain      = ""
        device_name = "eth0"
        server_ip   = "${aws_instance.jenkins_server.private_ip}"
        worker_pem  = "${data.local_file.jenkins_worker_pem.content}"
        jenkins_username = "admin"
        jenkins_password = "mysupersecretpassword"
      }
    }
    
    # lookup the security group of the Jenkins Server
    data "aws_security_group" "jenkins_worker_windows" {
      filter {
        name   = "group-name"
        values = ["dev_jenkins_worker_windows"]
      }
    }
    
    resource "aws_launch_configuration" "jenkins_worker_windows" {
      name_prefix                 = "dev-jenkins-worker-"
      image_id                    = "${data.aws_ami.jenkins_worker_windows.image_id}"
      instance_type               = "t3.medium"
      iam_instance_profile        = "dev_jenkins_worker_windows"
      key_name                    = "${aws_key_pair.jenkins_worker_windows.key_name}"
      security_groups             = ["${data.aws_security_group.jenkins_worker_windows.id}"]
      user_data                   = "${data.template_file.userdata_jenkins_worker_windows.rendered}"
      associate_public_ip_address = false
    
      root_block_device {
        delete_on_termination = true
        volume_size = 100
      }
    
      lifecycle {
        create_before_destroy = true
      }
    }
    
    resource "aws_autoscaling_group" "jenkins_worker_windows" {
      name                      = "dev-jenkins-worker-windows"
      min_size                  = "1"
      max_size                  = "2"
      desired_capacity          = "2"
      health_check_grace_period = 60
      health_check_type         = "EC2"
      vpc_zone_identifier       = ["${data.aws_subnet_ids.default_public.ids}"]
      launch_configuration      = "${aws_launch_configuration.jenkins_worker_windows.name}"
      termination_policies      = ["OldestLaunchConfiguration"]
      wait_for_capacity_timeout = "10m"
      default_cooldown          = 60
    
      #lifecycle {
      #  create_before_destroy = true
      #}
    
    
      ## on replacement, gives new service time to spin up before moving on to destroy
      #provisioner "local-exec" {
      #  command = "sleep 60"
      #}
    
      tags = [
        {
          key                 = "Name"
          value               = "dev_jenkins_worker_windows"
          propagate_at_launch = true
        },
        {
          key                 = "class"
          value               = "dev_jenkins_worker_windows"
          propagate_at_launch = true
        },
      ]
    }

    Finally, we reach the user-data for the terraform plan. It will download the required jar file, create a node on Jenkins and register itself as a slave.

    <powershell>
    
    function Wait-For-Jenkins {
    
      Write-Host "Waiting jenkins to launch on 8080..."
    
      Do {
      Write-Host "Waiting for Jenkins"
    
       Nc -zv ${server_ip} 8080
       If( $? -eq $true ) {
         Break
       }
       Sleep 10
    
      } While (1)
    
      Do {
       Write-Host "Waiting for JNLP"
          
       Nc -zv ${server_ip} 33453
       If( $? -eq $true ) {
        Break
       }
       Sleep 10
    
      } While (1)      
    
      Write-Host "Jenkins launched"
    }
    
    function Slave-Setup()
    {
      # Register_slave
      $JENKINS_URL="http://${server_ip}:8080"
    
      $USERNAME="${jenkins_username}"
      
      $PASSWORD="${jenkins_password}"
    
      $AUTH = -join ("$USERNAME", ":", "$PASSWORD")
      echo $AUTH
    
      # Below IP collection logic works for Windows Server 2016 edition and needs testing for windows server 2008 edition
      $SLAVE_IP=(ipconfig | findstr /r "[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*" | findstr "IPv4 Address").substring(39) | findstr /B "172.31"
      
      $NODE_NAME="jenkins-slave-windows-$SLAVE_IP"
      
      $NODE_SLAVE_HOME="C:\Jenkins\"
      $EXECUTORS=2
      $JNLP_PORT=33453
    
      $CRED_ID="$NODE_NAME"
      $LABELS="build windows"
      
      # Creating CMD utility for jenkins-cli commands
      # This is not working in windows therefore specify full path
      $jenkins_cmd = "java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth admin:$PASSWORD"
    
      Sleep 20
    
      Write-Host "Downloading jenkins-cli.jar file"
      (New-Object System.Net.WebClient).DownloadFile("$JENKINS_URL/jnlpJars/jenkins-cli.jar", "C:\Jenkins\jenkins-cli.jar")
    
      Write-Host "Downloading slave.jar file"
      (New-Object System.Net.WebClient).DownloadFile("$JENKINS_URL/jnlpJars/slave.jar", "C:\Jenkins\slave.jar")
    
      Sleep 10
    
      # Waiting for Jenkins to load all plugins
      Do {
      
        $count=(java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH list-plugins | Measure-Object -line).Lines
        $ret=$?
    
        Write-Host "count [$count] ret [$ret]"
    
        If ( $count -gt 0 ) {
            Break
        }
    
        sleep 30
      } While ( 1 )
    
      # For Deleting Node, used when testing
      Write-Host "Deleting Node $NODE_NAME if present"
      java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH delete-node $NODE_NAME
      
      # Generating node.xml for creating node on Jenkins server
      $NodeXml = @"
    <slave>
    <name>$NODE_NAME</name>
    <description>Windows Slave</description>
    <remoteFS>$NODE_SLAVE_HOME</remoteFS>
    <numExecutors>$EXECUTORS</numExecutors>
    <mode>NORMAL</mode>
    <retentionStrategy class="hudson.slaves.RetentionStrategy`$Always`"/>
    <launcher class="hudson.slaves.JNLPLauncher">
      <workDirSettings>
        <disabled>false</disabled>
        <internalDir>remoting</internalDir>
        <failIfWorkDirIsMissing>false</failIfWorkDirIsMissing>
      </workDirSettings>
    </launcher>
    <label>$LABELS</label>
    <nodeProperties/>
    </slave>
    "@
      $NodeXml | Out-File -FilePath C:\Jenkins\node.xml 
    
      type C:\Jenkins\node.xml
    
      # Creating node using node.xml
      Write-Host "Creating $NODE_NAME"
      Get-Content -Path C:\Jenkins\node.xml | java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH create-node $NODE_NAME
    
      Write-Host "Registering Node $NODE_NAME via JNLP"
      Start-Process java -ArgumentList "-jar C:\Jenkins\slave.jar -jnlpCredentials $AUTH -jnlpUrl $JENKINS_URL/computer/$NODE_NAME/slave-agent.jnlp"
    }
    
    ### script begins here ###
    
    Wait-For-Jenkins
    
    Slave-Setup
    
    echo "Done"
    </powershell>
    <persist>true</persist>

    Command to run: Initialize terraform – terraform init, Check and apply – terraform plan -> terraform apply

    Same drawbacks are applicable here and the same solutions will work here as well.

    Congratulations! You have a Jenkins master with Windows and Linux slave attached to it.

    IAM roles for reference

    Jenkins Master

    Linux Slave

    Windows Slave

    Bonus:

    If you want to associate IAM permissions to the user but cannot assign FULL ACCESS here is a curated list below for reference:

    Packer Policy

    Terraform Policy

    Conclusion:

    This blog tries to highlight one of the ways in which we can use packer and Terraform to create AMI’s which will serve as Jenkins master and slave. We not only covered their creation but also focused on how to associate security groups and checked some of the basic IAM roles that can be applied. Although we have covered almost all the possible scenarios but still depending on use case, the required changes would be very less and this can serve as a boiler plate code when beginning to plan your infrastructure on cloud.

  • Web Scraping: Introduction, Best Practices & Caveats

    Web scraping is a process to crawl various websites and extract the required data using spiders. This data is processed in a data pipeline and stored in a structured format. Today, web scraping is widely used and has many use cases:

    • Using web scraping, Marketing & Sales companies can fetch lead-related information.
    • Web scraping is useful for Real Estate businesses to get the data of new projects, resale properties, etc.
    • Price comparison portals, like Trivago, extensively use web scraping to get the information of product and price from various e-commerce sites.

    The process of web scraping usually involves spiders, which fetch the HTML documents from relevant websites, extract the needed content based on the business logic, and finally store it in a specific format. This blog is a primer to build highly scalable scrappers. We will cover the following items:

    1. Ways to scrape: We’ll see basic ways to scrape data using techniques and frameworks in Python with some code snippets.
    2. Scraping at scale: Scraping a single page is straightforward, but there are challenges in scraping millions of websites, including managing the spider code, collecting data, and maintaining a data warehouse. We’ll explore such challenges and their solutions to make scraping easy and accurate.
    3. Scraping Guidelines: Scraping data from websites without the owner’s permission can be deemed as malicious. Certain guidelines need to be followed to ensure our scrappers are not blacklisted. We’ll look at some of the best practices one should follow for crawling.

    So let’s start scraping. 

    Different Techniques for Scraping

    Here, we will discuss how to scrape a page and the different libraries available in Python.

    Note: Python is the most popular language for scraping.  

    1. Requests – HTTP Library in Python: To scrape the website or a page, first find out the content of the HTML page in an HTTP response object. The requests library from Python is pretty handy and easy to use. It uses urllib inside. I like ‘requests’ as it’s easy and the code becomes readable too.

    #Example showing how to use the requests library
    import requests
    r = requests.get("https://velotio.com") #Fetch HTML Page

    2. BeautifulSoup: Once you get the webpage, the next step is to extract the data. BeautifulSoup is a powerful Python library that helps you extract the data from the page. It’s easy to use and has a wide range of APIs that’ll help you extract the data. We use the requests library to fetch an HTML page and then use the BeautifulSoup to parse that page. In this example, we can easily fetch the page title and all links on the page. Check out the documentation for all the possible ways in which we can use BeautifulSoup.

    from bs4 import BeautifulSoup
    import requests
    r = requests.get("https://velotio.com") #Fetch HTML Page
    soup = BeautifulSoup(r.text, "html.parser") #Parse HTML Page
    print "Webpage Title:" + soup.title.string
    print "Fetch All Links:" soup.find_all('a')

    3. Python Scrapy Framework:

    Scrapy is a Python-based web scraping framework that allows you to create different kinds of spiders to fetch the source code of the target website. Scrapy starts crawling the web pages present on a certain website, and then you can write the extraction logic to get the required data. Scrapy is built on the top of Twisted, a Python-based asynchronous library that performs the requests in an async fashion to boost up the spider performance. Scrapy is faster than BeautifulSoup. Moreover, it is a framework to write scrapers as opposed to BeautifulSoup, which is just a library to parse HTML pages.

    Here is a simple example of how to use Scrapy. Install Scrapy via pip. Scrapy gives a shell after parsing a website:

    $ pip install scrapy #Install Scrapy"
    $ scrapy shell https://velotio.com
    In [1]: response.xpath("//a").extract() #Fetch all a hrefs

    Now, let’s write a custom spider to parse a website.

    $cat > myspider.py <import scrapy
    
    class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']
    
    def parse(self, response):
    for title in response.css('h2.entry-title'):
    yield {'title': title.css('a ::text').extract_first()}
    EOF
    scrapy runspider myspider.py

    That’s it. Your first custom spider is created. Now. let’s understand the code.

    • name: Name of the spider. In this case, it’s “blogspider”.
    • start_urls: A list of URLs where the spider will begin to crawl from.
    • parse(self, response): This function is called whenever the crawler successfully crawls a URL. The response object used earlier in the Scrapy shell is the same response object that is passed to the parse(..).

    When you run this, Scrapy will look for start URL and will give you all the divs of the h2.entry-title class and extract the associated text from it. Alternatively, you can write your extraction logic in a parse method or create a separate class for extraction and call its object from the parse method.

    You’ve seen how to extract simple items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient. Here is a tutorial for Scrapy and the additional documentation for LinkExtractor by which you can instruct Scrapy to extract links from a web page.

    4. Python lxml.html library:  This is another library from Python just like BeautifulSoup. Scrapy internally uses lxml. It comes with a list of APIs you can use for data extraction. Why will you use this when Scrapy itself can extract the data? Let’s say you want to iterate over the ‘div’ tag and perform some operation on each tag present under “div”, then you can use this library which will give you a list of ‘div’ tags. Now you can simply iterate over them using the iter() function and traverse each child tag inside the parent div tag. Such traversing operations are difficult in scraping. Here is the documentation for this library.

    Challenges while Scraping at Scale

    Let’s look at the challenges and solutions while scraping at large scale, i.e., scraping 100-200 websites regularly:

    1. Data warehousing: Data extraction at a large scale generates vast volumes of information. Fault-tolerant, scalability, security, and high availability are the must-have features for a data warehouse. If your data warehouse is not stable or accessible then operations, like search and filter over data would be an overhead. To achieve this, instead of maintaining own database or infrastructure, you can use Amazon Web Services (AWS). You can use RDS (Relational Database Service) for a structured database and DynamoDB for the non-relational database. AWS takes care of the backup of data. It automatically takes a snapshot of the database. It gives you database error logs as well. This blog explains how to set up infrastructure in the cloud for scraping.  

    2. Pattern Changes: Scraping heavily relies on user interface and its structure, i.e., CSS and Xpath. Now, if the target website gets some adjustments then our scraper may crash completely or it can give random data that we don’t want. This is a common scenario and that’s why it’s more difficult to maintain scrapers than writing it. To handle this case, we can write the test cases for the extraction logic and run them daily, either manually or from CI tools, like Jenkins to track if the target website has changed or not.

    3. Anti-scraping Technologies: Web scraping is a common thing these days, and every website host would want to prevent their data from being scraped. Anti-scraping technologies would help them in this. For example, if you are hitting a particular website from the same IP address on a regular interval then the target website can block your IP. Adding a captcha on a website also helps. There are methods by which we can bypass these anti-scraping methods. For e.g., we can use proxy servers to hide our original IP. There are several proxy services that keep on rotating the IP before each request. Also, it is easy to add support for proxy servers in the code, and in Python, the Scrapy framework does support it.

    4. JavaScript-based dynamic content:  Websites that heavily rely on JavaScript and Ajax to render dynamic content, makes data extraction difficult. Now, Scrapy and related frameworks/libraries will only work or extract what it finds in the HTML document. Ajax calls or JavaScript are executed at runtime so it can’t scrape that. This can be handled by rendering the web page in a headless browser such as Headless Chrome, which essentially allows running Chrome in a server environment. You can also use PhantomJS, which provides a headless Webkit-based environment.

    5. Honeypot traps: Some websites have honeypot traps on the webpages for the detection of web crawlers. They are hard to detect as most of the links are blended with background color or the display property of CSS is set to none. To achieve this requires large coding efforts on both the server and the crawler side, hence this method is not frequently used.

    6. Quality of data: Currently, AI and ML projects are in high demand and these projects need data at large scale. Data integrity is also important as one fault can cause serious problems in AI/ML algorithms. So, in scraping, it is very important to not just scrape the data, but verify its integrity as well. Now doing this in real-time is not possible always, so I would prefer to write test cases of the extraction logic to make sure whatever your spiders are extracting is correct and they are not scraping any bad data

    7. More Data, More Time:  This one is obvious. The larger a website is, the more data it contains, the longer it takes to scrape that site. This may be fine if your purpose for scanning the site isn’t time-sensitive, but that isn’t often the case. Stock prices don’t stay the same over hours. Sales listings, currency exchange rates, media trends, and market prices are just a few examples of time-sensitive data. What to do in this case then? Well, one solution could be to design your spiders carefully. If you’re using Scrapy like framework then apply proper LinkExtractor rules so that spider will not waste time on scraping unrelated URLs.

    You may use multithreading scraping packages available in Python, such as Frontera and Scrapy Redis. Frontera lets you send out only one request per domain at a time, but can hit multiple domains at once, making it great for parallel scraping. Scrapy Redis lets you send out multiple requests to one domain. The right combination of these can result in a very powerful web spider that can handle both the bulk and variation for large websites.

    8. Captchas: Captchas is a good way of keeping crawlers away from a website and it is used by many website hosts. So, in order to scrape the data from such websites, we need a mechanism to solve the captchas. There are packages, software that can solve the captcha and can act as a middleware between the target website and your spider. Also, you may use libraries like Pillow and Tesseract in Python to solve the simple image-based captchas.

    9. Maintaining Deployment: Normally, we don’t want to limit ourselves to scrape just a few websites. We need the maximum amount of data that are present on the Internet and that may introduce scraping of millions of websites. Now, you can imagine the size of the code and the deployment. We can’t run spiders at this scale from a single machine. What I prefer here is to dockerize the scrapers and take advantage of the latest technologies, like AWS ECS, Kubernetes to run our scraper containers. This helps us keeping our scrapers in high availability state and it’s easy to maintain. Also, we can schedule the scrapers to run at regular intervals.

    Scraping Guidelines/ Best Practices

    1. Respect the robots.txt file:  Robots.txt is a text file that webmasters create to instruct search engine robots on how to crawl and index pages on the website. This file generally contains instructions for crawlers. Now, before even planning the extraction logic, you should first check this file. Usually, you can find this at the website admin section. This file has all the rules set on how crawlers should interact with the website. For e.g., if a website has a link to download critical information then they probably don’t want to expose that to crawlers. Another important factor is the frequency interval for crawling, which means that crawlers can only hit the website at specified intervals. If someone has asked not to crawl their website then we better not do it. Because if they catch your crawlers, it can lead to some serious legal issues.

    2. Do not hit the servers too frequently:  As I mentioned above, some websites will have the frequency interval specified for crawlers. We better use it wisely because not every website is tested against the high load. If you are hitting at a constant interval then it creates huge traffic on the server-side, and it may crash or fail to serve other requests. This creates a high impact on user experience as they are more important than the bots. So, we should make the requests according to the specified interval in robots.txt or use a standard delay of 10 seconds. This also helps you not to get blocked by the target website.

    3. User Agent Rotation and Spoofing: Every request consists of a User-Agent string in the header. This string helps to identify the browser you are using, its version, and the platform. If we use the same User-Agent in every request then it’s easy for the target website to check that request is coming from a crawler. So, to make sure we do not face this, try to rotate the User and the Agent between the requests. You can get examples of genuine User-Agent strings on the Internet very easily, try them out. If you’re using Scrapy, you can set USER_AGENT property in settings.py.

    4. Disguise your requests by rotating IPs and Proxy Services: We’ve discussed this in the challenges above. It’s always better to use rotating IPs and proxy service so that your spider won’t get blocked.

    5. Do not follow the same crawling pattern: Now, as you know many websites use anti-scraping technologies, so it’s easy for them to detect your spider if it’s crawling in the same pattern. Normally, we, as a human, would not follow a pattern on a particular website. So, to have your spiders run smoothly, we can introduce actions like mouse movements, clicking a random link, etc, which gives the impression of your spider as a human.

    6. Scrape during off-peak hours: Off-peak hours are suitable for bots/crawlers as the traffic on the website is considerably less. These hours can be identified by the geolocation from where the site’s traffic originates. This also helps to improve the crawling rate and avoid the extra load from spider requests. Thus, it is advisable to schedule the crawlers to run in the off-peak hours.

    7. Use the scraped data responsibly: We should always take the responsibility of the scraped data. It is not acceptable if someone is scraping the data and then republish it somewhere else. This can be considered as breaking the copyright laws and may lead to legal issues. So, it is advisable to check the target website’s Terms of Service page before scraping.

    8. Use Canonical URLs: When we scrape, we tend to scrape duplicate URLs, and hence the duplicate data, which is the last thing we want to do. It may happen in a single website where we get multiple URLs having the same data. In this situation, duplicate URLs will have a canonical URL, which points to the parent or the original URL. By this, we make sure, we don’t scrape duplicate contents. In frameworks like Scrapy, duplicate URLs are handled by default.

    9. Be transparent: Don’t misrepresent your purpose or use deceptive methods to gain access. If you have a login and a password that identifies you to gain access to a source, use it.  Don’t hide who you are. If possible, share your credentials.

    Conclusion

    We’ve seen the basics of scraping, frameworks, how to crawl, and the best practices of scraping. To conclude:

    • Follow target URLs rules while scraping. Don’t make them block your spider.
    • Maintenance of data and spiders at scale is difficult. Use Docker/ Kubernetes and public cloud providers, like AWS to easily scale your web-scraping backend.
    • Always respect the rules of the websites you plan to crawl. If APIs are available, always use them first.