Category: Blogs

  • Setting Up A Single Sign On (SSO) Environment For Your App

    Single Sign On (SSO) makes it simple for users to begin using an application. Support for SSO is crucial for enterprise apps, as many corporate security policies mandate that all applications use certified SSO mechanisms. While the SSO experience is straightforward, the SSO standard is anything but straightforward. It’s easy to get confused when you’re surrounded by complex jargon, including SAML, OAuth 1.0, 1.0a, 2.0, OpenID, OpenID Connect, JWT, and tokens like refresh tokens, access tokens, bearer tokens, and authorization tokens. Standards documentation is too precise to allow generalization, and vendor literature can make you believe it’s too difficult to do it yourself.

    I’ve created SSO for a lot of applications in the past. Knowing your target market, norms, and platform are all crucial.

    Single Sign On

    Single Sign On is an authentication method that allows apps to securely authenticate users into numerous applications by using just one set of login credentials.

    This allows applications to avoid the hassle of storing and managing user information like passwords and also cuts down on troubleshooting login-related issues. With SSO configured, applications check with the SSO provider (Okta, Google, Salesforce, Microsoft) if the user’s identity can be verified.

    Types of SSO

    • Security Access Markup Language (SAML)
    • OpenID Connect (OIDC)
    • OAuth (specifically OAuth 2.0 nowadays)
    • Federated Identity Management (FIM)

    Security Assertion Markup Language – SAML

    SAML (Security Assertion Markup Language) is an open standard that enables identity providers (IdP) to send authorization credentials to service providers (SP). Meaning you can use one set of credentials to log in to many different websites. It’s considerably easier to manage a single login per user than to handle several logins to email, CRM software, Active Directory, and other systems.

    For standardized interactions between the identity provider and service providers, SAML transactions employ Extensible Markup Language (XML). SAML is the link between a user’s identity authentication and authorization to use a service.

    In our example implementation, we will be using SAML 2.0 as the standard for the authentication flow.

    Technical details

    • A Service Provider (SP) is the entity that provides the service, which is in the form of an application. Examples: Active Directory, Okta Inbuilt IdP, Salesforce IdP, Google Suite.
    • An Identity Provider (IdP) is the entity that provides identities, including the ability to authenticate a user. The user profile is normally stored in the Identity Provider typically and also includes additional information about the user such as first name, last name, job code, phone number, address, and so on. Depending on the application, some service providers might require a very simple profile (username, email), while others may need a richer set of user data (department, job code, address, location, and so on). Examples: Google – GDrive, Meet, Gmail.
    • The SAML sign-in flow initiated by the Identity Provider is referred to as an Identity Provider Initiated (IdP-initiated) sign-in. In this flow, the Identity Provider begins a SAML response that is routed to the Service Provider to assert the user’s identity, rather than the SAML flow being triggered by redirection from the Service Provider. When a Service Provider initiates the SAML sign-in process, it is referred to as an SP-initiated sign-in. When end-users try to access a protected resource, such as when the browser tries to load a page from a protected network share, this is often triggered.

    Configuration details

    • Certificate – To validate the signature, the SP must receive the IdP’s public certificate. On the SP side, the certificate is kept and used anytime a SAML response is received.
    • Assertion Consumer Service (ACS) Endpoint – The SP sign-in URL is sometimes referred to simply as the URL. This is the endpoint supplied by the SP for posting SAML responses. This information must be sent by the SP to the IdP.
    • IdP Sign-in URL – This is the endpoint where SAML requests are posted on the IdP side. This information must be obtained by the SP from the IdP.

    OpenID Connect – OIDC

    OIDC protocol is based on the OAuth 2.0 framework. OIDC authenticates the identity of a specific user, while OAuth 2.0 allows two applications to trust each other and exchange data.

    So, while the main flow appears to be the same, the labels are different.

    How are SAML and OIDC similar?

    The basic login flow for both is the same.

    1. A user tries to log into the application directly.

    2. The program sends the user’s login request to the IdP via the browser.

    3. The user logs in to the IdP or confirms that they are already logged in.

    4. The IdP verifies that the user has permission to use the program that initiated the request.

    5. Information about the user is sent from the IdP to the user’s browser.

    6. Their data is subsequently forwarded to the application.

    7. The application verifies that they have permission to use the resources.

    8. The user has been granted access to the program.

    Difference between SAML and OIDC

    1. SAML transmits user data in XML, while OpenID Connect transmits data in JSON.

    2. SAML calls the data it sends an assertion. OAuth2 calls the data it sends a claim.

    3. In SAML, the application or system the user is trying to get into is referred to as the Service Provider. In OIDC, it’s called the Relying Party.

    SAML vs. OIDC

    1. OpenID Connect is becoming increasingly popular. Because it interacts with RESTful API endpoints, it is easier to build than SAML and is easily available through APIs. This also implies that it is considerably more compatible with mobile apps.

    2. You won’t often have a choice between SAML and OIDC when configuring Single Sign On (SSO) for an application through an identity provider like OneLogin. If you do have a choice, it is important to understand not only the differences between the two, but also which one is more likely to be sustained over time. OIDC appears to be the clear winner at this time because developers find it much easier to work with as it is more versatile.

    Use Cases

    1. SAML with OIDC:

    – Log in with Salesforce: SAML Authentication where Salesforce was used as IdP and the web application as an SP.

    Key Reason:

    All users are centrally managed in Salesforce, so SAML was the preferred choice for authentication.

    – Log in with Okta: OIDC Authentication where Okta used IdP and the web application as an SP.

    Key Reason:

    Okta Active Directory (AD) is already used for user provisioning and de-provisioning of all internal users and employees. Okta AD enables them to integrate Okta with any on-premise AD.

    In both the implementation user provisioning and de-provisioning takes place at the IdP side.

    SP-initiated (From web application)

    IdP-initiated (From Okta Active Directory)

    2. Only OIDC login flow:

    • OIDC Authentication where Google, Salesforce, Office365, and Okta are used as IdP and the web application as SP.

    Why not use OAuth for SSO

    1. OAuth 2.0 is not a protocol for authentication. It explicitly states this in its documentation.

    2. With authentication, you’re basically attempting to figure out who the user is when they authenticated, and how they authenticated. These inquiries are usually answered with SAML assertions rather than access tokens and permission grants.

    OIDC vs. OAuth 2.0

    • OAuth 2.0 is a framework that allows a user of a service to grant third-party application access to the service’s data without revealing the user’s credentials (ID and password).
    • OpenID Connect is a framework on top of OAuth 2.0 where a third-party application can obtain a user’s identity information which is managed by a service. OpenID Connect can be used for SSO.
    • In OAuth flow, Authorization Server gives back Access Token only. In the OpenID flow, the Authorization server returns Access Code and ID Token. A JSON Web Token, or JWT, is a specially formatted string of characters that serves as an ID Token. The Client can extract information from the JWT, such as your ID, name, when you logged in, the expiration of the ID Token, and if the JWT has been tampered with.

    Federated Identity Management (FIM)

    Identity Federation, also known as federated identity management, is a system that allows users from different companies to utilize the same verification method for access to apps and other resources.

    In short, it’s what allows you to sign in to Spotify with your Facebook account.

    • Single Sign On (SSO) is a subset of the identity federation.
    • SSO generally enables users to use a single set of credentials to access multiple systems within a single organization, while FIM enables users to access systems across different organizations.

    How does FIM work?

    • To log in to their home network, users use the security domain to authenticate.
    • Users attempt to connect to a distant application that employs identity federation after authenticating to their home domain.
    • Instead of the remote application authenticating the user itself, the user is prompted to authenticate from their home authentication server.
    • The user’s home authentication server authorizes the user to the remote application and the user is permitted to access the app. The user’s home client is authenticated to the remote application, and the user is permitted access to the application.

    A user can log in to their home domain once, to their home domain; remote apps in other domains can then grant access to the user without an additional login process.

    Applications:

    • Auth0: Auth0 uses OpenID Connect and OAuth 2.0 to authenticate users and get their permission to access protected resources. Auth0 allows developers to design and deploy applications and APIs that easily handle authentication and authorization issues such as the OIDC/OAuth 2.0 protocol with ease.
    • AWS Cognito
    • User pools – In Amazon Cognito, a user pool is a user directory. Your users can sign in to your online or mobile app using Amazon Cognito or federate through a third-party identity provider using a user pool (IdP). All members of the user pool have a directory profile that you may access using an SDK, whether they sign indirectly or through a third party.
    • Identity pools – An identity pool allows your users to get temporary AWS credentials for services like Amazon S3 and DynamoDB.

    Conclusion:

    I hope you found the summary of my SSO research beneficial. The optimum implementation approach is determined by your unique situation, technological architecture, and business requirements.

  • Chatbots With Google DialogFlow: Build a Fun Reddit Chatbot in 30 Minutes

    Google DialogFlow

    If you’ve been keeping up with the current advancements in the world of chat and voice bots, you’ve probably come across Google’s newest acquisition – DialogFlow (formerly, api.ai) – a platform that provides a use-case specific, engaging voice and text-based conversations, powered by AI. While understanding the intricacies of human conversations, where we say one thing but mean the other, is still an art lost on machines, a domain-specific bot is the closest thing we can build.

    What is DialogFlow anyway?

    Natural language understanding (NLU) has always been the painful part while building a chatbot. How do you make sure your bot is actually understanding what the user says, and parsing their requests correctly? Well, here’s where DialogFlow comes in and fills the gap. It actually replaces the NLU parsing bit so that you can focus on other areas like your business logic!

    DialogFlow is simply a tool that allows you to make bots (or assistants or agents) that understand human conversation, string together a meaningful API call with appropriate parameters after parsing the conversation and respond with an adequate reply. You can then deploy this bot to any platform of your choosing – Facebook Messenger, Slack, Google Assistant, Twitter, Skype, etc. Or on your own app or website as well!

    The building blocks of DialogFlow

    Agent: DialogFlow allows you to make NLU modules, called agents (basically the face of your bot). This agent connects to your backend and provides it with business logic.

    Intent: An agent is made up of intents. Intents are simply actions that a user can perform on your agent. It maps what a user says to what action should be taken. They’re entry points into a conversation.

    In short, a user may request the same thing in many ways, re-structuring their sentences. But in the end, they should all resolve to a single intent.

    Examples of intents can be:
    “What’s the weather like in Mumbai today?” or “What is the recipe for an omelet?”

    You can create as many intents as your business logic desires, and even co-relate them, using contexts. An intent decides what API to call, with what parameters, and how to respond back, to a user’s request.

    Entity: An agent wouldn’t know what values to extract from a given user’s input. This is where entities come into play. Any information in a sentence, critical to your business logic, will be an entity. This includes stuff like dates, distance, currency, etc. There are system entities, provided by DialogFlow for simple things like numbers and dates. And then there are developer defined entities. For example, “category”, for a bot about Pokemon! We’ll dive into how to make a custom developer entity further in the post.

    Context: Final concept before we can get started with coding is “Context”. This is what makes the bot truly conversational. A context-aware bot can remember things, and hold a conversation like humans do. Consider the following conversation:

    “Hey, are you coming for piano practice tonight?”
    “Sorry, I’ve got dinner plans.”
    “Okay, what about tomorrow night then?”
    “That works!”

    Did you notice what just happened? The first question is straightforward to parse: The time is “tonight”, and the event, “piano practice”.

    However, the second question,  “Okay, what about tomorrow night then?” doesn’t specify anything about the actual event. It’s implied that we’re talking about “piano practice”. This sort of understanding comes naturally to us humans, but bots have to be explicitly programmed so that they understand the context across these sentences.

    Making a Reddit Chatbot using DialogFlow

    Now that we’re well equipped with the basics, let’s get started! We’re going to make a Reddit bot that tells a joke or an interesting fact from the day’s top threads on specific subreddits. We’ll also sprinkle in some context awareness so that the bot doesn’t feel “rigid”.

    NOTE: You would need a billing-enabled account on Google Cloud Platform(GCP) if you want to follow along with this tutorial. It’s free and just needs your credit card details to set up. 

    Creating an Agent 

    1. Log in to the DialogFlow dashboard using your Google account. Here’s the link for the lazy.
    2. Click on “Create Agent”
    3. Enter the details as below, and hit “Create”. You can select any other Google project if it has billing enabled on it as well.

    Setting up a “Welcome” Intent

    As soon as you create the agent, you see this intents page:

    The “Default Fallback” Intent exists in case the user says something unexpected and is outside the scope of your intents. We won’t worry too much about that right now. Go ahead and click on the “Default Welcome Intent”. We can notice a lot of options that we can tweak.
    Let’s start with a triggering phrase. Notice the “User Says” section? We want our bot to activate as soon as we say something along the lines of:

    Let’s fill that in. After that, scroll down to the “Responses” tab. You can see some generic welcome messages are provided. Get rid of them, and put in something more personalized to our bot, as follows:

    Now, this does a couple of things. Firstly, it lets the user know that they’re using our bot. It also guides the user to the next point in the conversation. Here, it is an “or” question.

    Hit “Save” and let’s move on.

    Creating a Custom Entity

    Before we start playing around with Intents, I want to set up a Custom Entity real quick. If you remember, Entities are what we extract from user’s input to process further. I’m going to call our Entity “content”. As the user request will be a content – either a joke or a fact. Let’s go ahead and create that. Click on the “Entities” tab on left-sidebar and click “Create Entity”.

    Fill in the following details:

    As you can see, we have 2 values possible for our content: “joke” and “fact”. We also have entered synonyms for each of them, so that if the user says something like “I want to hear something funny”, we know he wants a “joke” content. Click “Save” and let’s proceed to the next section!

    Attaching our new Entity to the Intent

    Create a new Intent called “say-content”. Add a phrase “Let’s hear a joke” in the “User Says” section, like so:

    Right off the bat, we notice a couple of interesting things. Dialogflow parsed this input and associated the entity content to it, with the correct value (here, “joke”). Let’s add a few more inputs:

    PS: Make sure all the highlighted words are in the same color and have associated the same entity. Dialogflow’s NLU isn’t perfect and sometimes assigns different Entities. If that’s the case, just remove it, double-click the word and assign the correct Entity yourself!

    Let’s add a placeholder text response to see it work. To do that, scroll to the bottom section “Response”, and fill it like so:

    The “$content” is a variable having a value extracted from user’s response that we saw above.

    Let’s see this in action. On the right side of every page on Dialogflow’s platform, you see a “Try It Now” box. Use that to test your work at any point in time. I’m going to go ahead and type in “Tell a fact” in the box. Notice that the “Tell a fact” phrase wasn’t present in the samples that we gave earlier. Dialogflow keeps training using it’s NLU modules and can extract data from similarly structured sentences:

    A Webhook to process requests

    To keep things simple I’m gonna write a JS app that fulfills the request by querying the Reddit’s website and returning the appropriate content. Luckily for us, Reddit doesn’t need authentication to read in JSON format. Here’s the code:

    'use strict';
    const http = require('https');
    exports.appWebhook = (req, res) => { 
    let content = req.body.result.parameters['content']; 
    getContent(content).then((output) => {   
    res.setHeader('Content-Type', 'application/json');   
    res.send(JSON.stringify({ 'speech': output, 'displayText': output    })); 
    }).catch((error) => {   
    // If there is an error let the user know   
    res.setHeader('Content-Type', 'application/json');   
    res.send(JSON.stringify({ 'speech': error, 'displayText': error     })); 
    });
    };
    function getSubreddit (content) { 
    if (content == "funny" || content == "joke" || content == "laugh")   
    return {sub: "jokes", displayText: "joke"};   
    else {     
    return {sub: "todayILearned", displayText: "fact"};   
    }
    }
    function getContent (content) { 
    let subReddit = getSubreddit(content); 
    return new Promise((resolve, reject) => {   
    console.log('API Request: to Reddit');   
    http.get(`https://www.reddit.com/r/${subReddit["sub"]}/top.json?sort=top&t=day`, (resp) => {     
    let data = '';     
    resp.on('data', (chunk) => {       
    data += chunk;     
    });     
    resp.on('end', () => {       
    let response = JSON.parse(data);       
    let thread = response["data"]["children"][(Math.floor((Math.random() * 24) + 1))]["data"];       
    let output = `Here's a ${subReddit["displayText"]}: ${thread["title"]}`;       
    if (subReddit['sub'] == "jokes") {         
    output += " " + thread["selftext"];       
    }       
    output += "nWhat do you want to hear next, a joke or a fact?"       
    console.log(output);       
    resolve(output);     
    });   
    }).on("error", (err) => {     
    console.log("Error: " + err.message);     
    reject(error);   
    }); 
    });
    }

    Now, before going ahead, follow the steps 1-5 mentioned here religiously.

    NOTE: For step 1, select the same Google Project that you created/used, when creating the agent.

    Now, to deploy our function using gcloud:

    $ gcloud beta functions deploy appWebHook --stage-bucket BUCKET_NAME --trigger-http

    To find the BUCKET_NAME, go to your Google project’s console and click on Cloud Storage under the Resources section.

    After you run the command, make note of the httpsTrigger URL mentioned. On the Dialoglow platform, find the “Fulfilment” tab on the sidebar. We need to enable webhooks and paste in the URL, like this:

    Hit “Done” on the bottom of the page, and now the final step. Visit the “say_content” Intent page and perform a couple of steps.

    1. Make the “content” parameter mandatory. This will make the bot ask explicitly for the parameter to the user if it’s not clear:

    2. Notice a new section has been added to the bottom of the screen called “Fulfilment”. Enable the “Use webhook” checkbox:

    Click “Save” and that’s it! Time to test this Intent out!

    Reddit’s crappy humor aside, this looks neat. Our replies always drive the conversation to places (Intents) that we want it to.

    Adding Context to our Bot

    Even though this works perfectly fine, there’s one more thing I’d like to add quickly. We want the user to be able to say, “More” or “Give me another one” and the bot to be able to understand what this means. This is done by emitting and absorbing contexts between intents.

    First, to emit the context, scroll up on the “say-content” Intent’s page and find the “Contexts” section. We want to output the “context”. Let’s say for a count of 5. The count makes sure the bot remembers what the “content” is in the current conversation for up to 5 back and forths.

    Now, we want to create a new content that can absorb this type of context and make sense of phrases like “More please”:

    Finally, since we want it to work the same way, we’ll make the Action and Fulfilment sections look the same way as the “say-content” Intent does:

    And that’s it! Your bot is ready.

    Integrations

    Dialogflow provides integrations with probably every messaging service in the Silicon Valley, and more. But we’ll use the Web Demo. Go to “Integrations” tab from the sidebar and enable “Web Demo” settings. Your bot should work like this:

    And that’s it! Your bot is ready to face a real person! Now, you can easily keep adding more subreddits, like news, sports, bodypainting, dankmemes or whatever your hobbies in life are! Or make it understand a few more parameters. For example, “A joke about Donald Trump”.

    Consider that your homework. You can also add a “Bye” intent, and make the bot stop. Our bot currently isn’t so great with goodbyes, sort of like real people.

    Debugging and Tips

    If you’re facing issues with no replies from the Reddit script, go to your Google Project and check the Errors and Reportings tab to make sure everything’s fine under the hood. If outbound requests are throwing an error, you probably don’t have billing enabled.

    Also, one caveat I found is that the entities can take up any value from the synonyms that you’ve provided. This means you HAVE to hardcode them in your business app as well. Which sucks right now, but maybe DialogFlow will provide a cleaner solution in the near future!

  • A Beginner’s Guide to Python Tornado

    The web is a big place now. We need to support thousands of clients at a time, and here comes Tornado. Tornado is a Python web framework and asynchronous network library, originally developed at FriendFreed.

    Tornado uses non-blocking network-io. Due to this, it can handle thousands of active server connections. It is a saviour for applications where long polling and a large number of active connections are maintained.

    Tornado is not like most Python frameworks. It’s not based on WSGI, while it supports some features of WSGI using module `tornado.wsgi`. It uses an event loop design that makes Tornado request execution faster.  

    What is Synchronous Program?

    A function blocks, performs its computation, and returns, once done . A function may block for many reasons: network I/O, disk I/O, mutexes, etc.

    Application performance depends on how efficiently application uses CPU cycles, that’s why blocking statements/calls must be taken seriously. Consider password hashing functions like bcrypt, which by design use hundreds of milliseconds of CPU time, far more than a typical network or disk access. As the CPU is not idle, there is no need to go for asynchronous functions.

    A function can be blocking in one, and non-blocking in others. In the context of Tornado, we generally consider blocking due to network I/O and disk, although all kinds of blocking need to be minimized.

    What is Asynchronous Program?

    1) Single-threaded architecture:

        Means, it can’t do computation-centric tasks parallely.

    2) I/O concurrency:

        It can handover IO tasks to the operating system and continue to the next task to achieve parallelism.

    3) epoll/ kqueue:

        Underline system-related construct that allows an application to get events on a file descriptor or I/O specific tasks.

    4) Event loop:

        It uses epoll or kqueue to check if any event has happened, and executes callback that is waiting for those network events.

    Asynchronous vs Synchronous Web Framework:

    In case of synchronous model, each request or task is transferred to thread or routing, and as it finishes, the result is handed over to the caller. Here, managing things are easy, but creating new threads is too much overhead.

    On the other hand, in Asynchronous framework, like Node.js, there is a single-threaded model, so very less overhead, but it has complexity.

    Let’s imagine thousands of requests coming through and a server uses event loop and callback. Now, until request gets processed, it has to efficiently store and manage the state of that request to map callback result to the actual client.

    Node.js vs Tornado

    Most of these comparison points are tied to actual programming language and not the framework: 

    • Node.js has one big advantage that all of its libraries are Async. In Python, there are lots of available packages, but very few of them are asynchronous
    • As Node.js is JavaScript runtime, and we can use JS for both front and back-end, developers can keep only one codebase and share the same utility library
    • Google’s V8 engine makes Node.js faster than Tornado. But a lot of Python libraries are written in C and can be faster alternatives.

    A Simple ‘Hello World’ Example

    import tornado.ioloop
    import tornado.web
    
    class MainHandler(tornado.web.RequestHandler):
        def get(self):
            self.write("Hello, world")
    
    def make_app():
        return tornado.web.Application([
            (r"/", MainHandler),
        ])
    
    if __name__ == "__main__":
        app = make_app()
        app.listen(8888)
        tornado.ioloop.IOLoop.current().start()

    Note: This example does not use any asynchronous feature.

    Using AsyncHTTPClient module, we can do REST call asynchronously.

    from tornado.httpclient import AsyncHTTPClient
    from tornado import gen
    
    @gen.coroutine
    def async_fetch_gen(url):
        http_client = AsyncHTTPClient()
        response = yield http_client.fetch(url)
        raise gen.Return(response.body)

    As you can see `yield http_client.fetch(url)` will run as a coroutine.

    Complex Example of Tornado Async

    Please have a look at Asynchronous Request handler.

    WebSockets Using Tornado:

    Tornado has built-in package for WebSockets that can be easily used with coroutines to achieve concurrency, here is one example:

    import logging
    import tornado.escape
    import tornado.ioloop
    import tornado.options
    import tornado.web
    import tornado.websocket
    from tornado.options import define, options
    from tornado.httpserver import HTTPServer
    
    define("port", default=8888, help="run on the given port", type=int)
    
    
    # queue_size = 1
    # producer_num_items = 5
    # q = queues.Queue(queue_size)
    
    def isPrime(num):
        """
        Simple worker but mostly IO/network call
        """
        if num > 1:
            for i in range(2, num // 2):
                if (num % i) == 0:
                    return ("is not a prime number")
            else:
                return("is a prime number")
        else:
            return ("is not a prime number")
    
    class Application(tornado.web.Application):
        def __init__(self):
            handlers = [(r"/chatsocket", TornadoWebSocket)]
            super(Application, self).__init__(handlers)
    
    class TornadoWebSocket(tornado.websocket.WebSocketHandler):
        clients = set()
    
        # enable cross domain origin
        def check_origin(self, origin):
            return True
    
        def open(self):
            TornadoWebSocket.clients.add(self)
    
        # when client closes connection
        def on_close(self):
            TornadoWebSocket.clients.remove(self)
    
        @classmethod
        def send_updates(cls, producer, result):
    
            for client in cls.clients:
    
                # check if result is mapped to correct sender
                if client == producer:
                    try:
                        client.write_message(result)
                    except:
                        logging.error("Error sending message", exc_info=True)
    
        def on_message(self, message):
            try:
                num = int(message)
            except ValueError:
                TornadoWebSocket.send_updates(self, "Invalid input")
                return
            TornadoWebSocket.send_updates(self, isPrime(num))
    
    def start_websockets():
        tornado.options.parse_command_line()
        app = Application()
        server = HTTPServer(app)
        server.listen(options.port)
        tornado.ioloop.IOLoop.current().start()
    
    
    
    if __name__ == "__main__":
        start_websockets()

    One can use a WebSocket client application to connect to the server, message can be any integer. After processing, the client receives the result if the integer is prime or not.  
    Here is one more example of actual async features of Tornado. Many will find it similar to Golang’s Goroutine and channels.

    In this example, we can start worker(s) and they will listen to the ‘tornado.queue‘. This queue is asynchronous and very similar to the asyncio package.

    # Example 1
    from tornado import gen, queues
    from tornado.ioloop import IOLoop
    
    @gen.coroutine
    def consumer(queue, num_expected):
        for _ in range(num_expected):
            # heavy I/O or network task
            print('got: %s' % (yield queue.get()))
    
    
    @gen.coroutine
    def producer(queue, num_items):
        for i in range(num_items):
            print('putting %s' % i)
            yield queue.put(i)
    
    @gen.coroutine
    def main():
        """
        Starts producer and consumer and wait till they finish
        """
        yield [producer(q, producer_num_items), consumer(q, producer_num_items)]
    
    queue_size = 1
    producer_num_items = 5
    q = queues.Queue(queue_size)
    
    results = IOLoop.current().run_sync(main)
    
    
    # Output:
    # putting 0
    # putting 1
    # got: 0
    # got: 1
    # putting 2
    # putting 3
    # putting 4
    # got: 2
    # got: 3
    # got: 4
    
    
    # Example 2
    # Condition
    # A condition allows one or more coroutines to wait until notified.
    from tornado import gen
    from tornado.ioloop import IOLoop
    from tornado.locks import Condition
    
    my_condition = Condition()
    
    @gen.coroutine
    def waiter():
        print("I'll wait right here")
        yield my_condition.wait()
        print("Received notification now doing my things")
    
    @gen.coroutine
    def notifier():
        yield gen.sleep(60)
        print("About to notify")
        my_condition.notify()
        print("Done notifying")
    
    @gen.coroutine
    def runner():
        # Wait for waiter() and notifier() in parallel
        yield([waiter(), notifier()])
    
    results = IOLoop.current().run_sync(runner)
    
    
    # output:
    
    # I'll wait right here
    # About to notify
    # Done notifying
    # Received notification now doing my things

    Conclusion

    1) Asynchronous frameworks are not much of use when most of the computations are CPU centric and not I/O.

    2) Due to a single thread per core model and event loop, it can manage thousands of active client connections.

    3) Many say Django is too big, Flask is too small, and Tornado is just right:)

  • Using Packer and Terraform to Setup Jenkins Master-Slave Architecture

    Automation is everywhere and it is better to adopt it as soon as possible. Today, in this blog post, we are going to discuss creating the infrastructure. For this, we will be using AWS for hosting our deployment pipeline. Packer will be used to create AMI’s and Terraform will be used for creating the master/slaves. We will be discussing different ways of connecting the slaves and will also run a sample application with the pipeline.

    Please remember the intent of the blog is to accumulate all the different components together, this means some of the code which should be available in development code repo is also included here. Now that we have highlighted the required tools, 10000 ft view and intent of the blog. Let’s begin.

    Using Packer to Create AMI’s for Jenkins Master and Linux Slave

    Hashicorp has bestowed with some of the most amazing tools for simplifying our life. Packer is one of them. Packer can be used to create custom AMI from already available AMI’s. We just need to create a JSON file and pass installation script as part of creation and it will take care of developing the AMI for us. Install packer depending upon your requirement from Packer downloads page. For simplicity purpose, we will be using Linux machine for creating Jenkins Master and Linux Slave. JSON file for both of them will be same but can be separated if needed.

    Note: user-data passed from terraform will be different which will eventually differentiate their usage.

    We are using Amazon Linux 2 – JSON file for the same.

    {
      "builders": [
      {
        "ami_description": "{{user `ami-description`}}",
        "ami_name": "{{user `ami-name`}}",
        "ami_regions": [
          "us-east-1"
        ],
        "ami_users": [
          "XXXXXXXXXX"
        ],
        "ena_support": "true",
        "instance_type": "t2.medium",
        "region": "us-east-1",
        "source_ami_filter": {
          "filters": {
            "name": "amzn2-ami-hvm-2.0*x86_64*",
            "root-device-type": "ebs",
            "virtualization-type": "hvm"
          },
          "most_recent": true,
          "owners": [
            "amazon"
          ]
        },
        "sriov_support": "true",
        "ssh_username": "ec2-user",
        "tags": {
          "Name": "{{user `ami-name`}}"
        },
        "type": "amazon-ebs"
      }
    ],
    "post-processors": [
      {
        "inline": [
          "echo AMI Name {{user `ami-name`}}",
          "date",
          "exit 0"
        ],
        "type": "shell-local"
      }
    ],
    "provisioners": [
      {
        "script": "install_amazon.bash",
        "type": "shell"
      }
    ],
      "variables": {
        "ami-description": "Amazon Linux for Jenkins Master and Slave ({{isotime \"2006-01-02-15-04-05\"}})",
        "ami-name": "amazon-linux-for-jenkins-{{isotime \"2006-01-02-15-04-05\"}}",
        "aws_access_key": "",
        "aws_secret_key": ""
      }
    }

    As you can see the file is pretty simple. The only thing of interest here is the install_amazon.bash script. In this blog post, we will deploy a Node-based application which is running inside a docker container. Content of the bash file is as follows:

    #!/bin/bash
    
    set -x
    
    # For Node
    curl -sL https://rpm.nodesource.com/setup_10.x | sudo -E bash -
    
    # For xmlstarlet
    sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    
    sudo yum update -y
    
    sleep 10
    
    # Setting up Docker
    sudo yum install -y docker
    sudo usermod -a -G docker ec2-user
    
    # Just to be safe removing previously available java if present
    sudo yum remove -y java
    
    sudo yum install -y python2-pip jq unzip vim tree biosdevname nc mariadb bind-utils at screen tmux xmlstarlet git java-1.8.0-openjdk nc gcc-c++ make nodejs
    
    sudo -H pip install awscli bcrypt
    sudo -H pip install --upgrade awscli
    sudo -H pip install --upgrade aws-ec2-assign-elastic-ip
    
    sudo npm install -g @angular/cli
    
    sudo systemctl enable docker
    sudo systemctl enable atd
    
    sudo yum clean all
    sudo rm -rf /var/cache/yum/
    exit 0
    @velotiotech

    Now there are a lot of things mentioned let’s check them out. As mentioned earlier we will be discussing different ways of connecting to a slave and for one of them, we need xmlstarlet. Rest of the things are packages that we might need in one way or the other.

    Update ami_users with actual user value. This can be found on AWS console Under Support and inside of it Support Center.

    Validate what we have written is right or not by running packer validate amazon.json.

    Once confirmed, build the packer image by running packer build amazon.json.

    After completion check your AWS console and you will find a new AMI created in “My AMI’s”.

    It’s now time to start using terraform for creating the machines. 

    Prerequisite:

    1. Please make sure you create a provider.tf file.

    provider "aws" {
      region                  = "us-east-1"
      shared_credentials_file = "~/.aws/credentials"
      profile                 = "dev"
    }

    The ‘credentials file’ will contain aws_access_key_id and aws_secret_access_key.

    2.  Keep SSH keys handy for server/slave machines. Here is a nice article highlighting how to create it or else create them before hand on aws console and reference it in the code.

    3. VPC:

    # lookup for the "default" VPC
    data "aws_vpc" "default_vpc" {
      default = true
    }
    
    # subnet list in the "default" VPC
    # The "default" VPC has all "public subnets"
    data "aws_subnet_ids" "default_public" {
      vpc_id = "${data.aws_vpc.default_vpc.id}"
    }

    Creating Terraform Script for Spinning up Jenkins Master

    Creating Terraform Script for Spinning up Jenkins Master. Get terraform from terraform download page.

    We will need to set up the Security Group before setting up the instance.

    # Security Group:
    resource "aws_security_group" "jenkins_server" {
      name        = "jenkins_server"
      description = "Jenkins Server: created by Terraform for [dev]"
    
      # legacy name of VPC ID
      vpc_id = "${data.aws_vpc.default_vpc.id}"
    
      tags {
        Name = "jenkins_server"
        env  = "dev"
      }
    }
    
    ###############################################################################
    # ALL INBOUND
    ###############################################################################
    
    # ssh
    resource "aws_security_group_rule" "jenkins_server_from_source_ingress_ssh" {
      type              = "ingress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["<Your Public IP>/32", "172.0.0.0/8"]
      description       = "ssh to jenkins_server"
    }
    
    # web
    resource "aws_security_group_rule" "jenkins_server_from_source_ingress_webui" {
      type              = "ingress"
      from_port         = 8080
      to_port           = 8080
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "jenkins server web"
    }
    
    # JNLP
    resource "aws_security_group_rule" "jenkins_server_from_source_ingress_jnlp" {
      type              = "ingress"
      from_port         = 33453
      to_port           = 33453
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["172.31.0.0/16"]
      description       = "jenkins server JNLP Connection"
    }
    
    ###############################################################################
    # ALL OUTBOUND
    ###############################################################################
    
    resource "aws_security_group_rule" "jenkins_server_to_other_machines_ssh" {
      type              = "egress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins servers to ssh to other machines"
    }
    
    resource "aws_security_group_rule" "jenkins_server_outbound_all_80" {
      type              = "egress"
      from_port         = 80
      to_port           = 80
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins servers for outbound yum"
    }
    
    resource "aws_security_group_rule" "jenkins_server_outbound_all_443" {
      type              = "egress"
      from_port         = 443
      to_port           = 443
      protocol          = "tcp"
      security_group_id = "${aws_security_group.jenkins_server.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins servers for outbound yum"
    }

    Now that we have a custom AMI and security groups for ourselves let’s use them to create a terraform instance.

    # AMI lookup for this Jenkins Server
    data "aws_ami" "jenkins_server" {
      most_recent      = true
      owners           = ["self"]
    
      filter {
        name   = "name"
        values = ["amazon-linux-for-jenkins*"]
      }
    }
    
    resource "aws_key_pair" "jenkins_server" {
      key_name   = "jenkins_server"
      public_key = "${file("jenkins_server.pub")}"
    }
    
    # lookup the security group of the Jenkins Server
    data "aws_security_group" "jenkins_server" {
      filter {
        name   = "group-name"
        values = ["jenkins_server"]
      }
    }
    
    # userdata for the Jenkins server ...
    data "template_file" "jenkins_server" {
      template = "${file("scripts/jenkins_server.sh")}"
    
      vars {
        env = "dev"
        jenkins_admin_password = "mysupersecretpassword"
      }
    }
    
    # the Jenkins server itself
    resource "aws_instance" "jenkins_server" {
      ami                    		= "${data.aws_ami.jenkins_server.image_id}"
      instance_type          		= "t3.medium"
      key_name               		= "${aws_key_pair.jenkins_server.key_name}"
      subnet_id              		= "${data.aws_subnet_ids.default_public.ids[0]}"
      vpc_security_group_ids 		= ["${data.aws_security_group.jenkins_server.id}"]
      iam_instance_profile   		= "dev_jenkins_server"
      user_data              		= "${data.template_file.jenkins_server.rendered}"
    
      tags {
        "Name" = "jenkins_server"
      }
    
      root_block_device {
        delete_on_termination = true
      }
    }
    
    output "jenkins_server_ami_name" {
        value = "${data.aws_ami.jenkins_server.name}"
    }
    
    output "jenkins_server_ami_id" {
        value = "${data.aws_ami.jenkins_server.id}"
    }
    
    output "jenkins_server_public_ip" {
      value = "${aws_instance.jenkins_server.public_ip}"
    }
    
    output "jenkins_server_private_ip" {
      value = "${aws_instance.jenkins_server.private_ip}"
    }

    As mentioned before, we will be discussing multiple ways in which we can connect the slaves to Jenkins master. But it is already known that every time a new Jenkins comes up, it generates a unique password. Now there are two ways to deal with this, one is to wait for Jenkins to spin up and retrieve that password or just directly edit the admin password while creating Jenkins master. Here we will be discussing how to change the password when configuring Jenkins. (If you need the script to retrieve Jenkins password as soon as it gets created than comment and I will share that with you as well).

    Below is the user data to install Jenkins master, configure its password and install required packages.

    #!/bin/bash
    
    set -x
    
    function wait_for_jenkins()
    {
      while (( 1 )); do
          echo "waiting for Jenkins to launch on port [8080] ..."
          
          nc -zv 127.0.0.1 8080
          if (( $? == 0 )); then
              break
          fi
    
          sleep 10
      done
    
      echo "Jenkins launched"
    }
    
    function updating_jenkins_master_password ()
    {
      cat > /tmp/jenkinsHash.py <<EOF
    import bcrypt
    import sys
    if not sys.argv[1]:
      sys.exit(10)
    plaintext_pwd=sys.argv[1]
    encrypted_pwd=bcrypt.hashpw(sys.argv[1], bcrypt.gensalt(rounds=10, prefix=b"2a"))
    isCorrect=bcrypt.checkpw(plaintext_pwd, encrypted_pwd)
    if not isCorrect:
      sys.exit(20);
    print "{}".format(encrypted_pwd)
    EOF
    
      chmod +x /tmp/jenkinsHash.py
      
      # Wait till /var/lib/jenkins/users/admin* folder gets created
      sleep 10
    
      cd /var/lib/jenkins/users/admin*
      pwd
      while (( 1 )); do
          echo "Waiting for Jenkins to generate admin user's config file ..."
    
          if [[ -f "./config.xml" ]]; then
              break
          fi
    
          sleep 10
      done
    
      echo "Admin config file created"
    
      admin_password=$(python /tmp/jenkinsHash.py ${jenkins_admin_password} 2>&1)
      
      # Please do not remove alter quote as it keeps the hash syntax intact or else while substitution, $<character> will be replaced by null
      xmlstarlet -q ed --inplace -u "/user/properties/hudson.security.HudsonPrivateSecurityRealm_-Details/passwordHash" -v '#jbcrypt:'"$admin_password" config.xml
    
      # Restart
      systemctl restart jenkins
      sleep 10
    }
    
    function install_packages ()
    {
    
      wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat-stable/jenkins.repo
      rpm --import https://jenkins-ci.org/redhat/jenkins-ci.org.key
      yum install -y jenkins
    
      # firewall
      #firewall-cmd --permanent --new-service=jenkins
      #firewall-cmd --permanent --service=jenkins --set-short="Jenkins Service Ports"
      #firewall-cmd --permanent --service=jenkins --set-description="Jenkins Service firewalld port exceptions"
      #firewall-cmd --permanent --service=jenkins --add-port=8080/tcp
      #firewall-cmd --permanent --add-service=jenkins
      #firewall-cmd --zone=public --add-service=http --permanent
      #firewall-cmd --reload
      systemctl enable jenkins
      systemctl restart jenkins
      sleep 10
    }
    
    function configure_jenkins_server ()
    {
      # Jenkins cli
      echo "installing the Jenkins cli ..."
      cp /var/cache/jenkins/war/WEB-INF/jenkins-cli.jar /var/lib/jenkins/jenkins-cli.jar
    
      # Getting initial password
      # PASSWORD=$(cat /var/lib/jenkins/secrets/initialAdminPassword)
      PASSWORD="${jenkins_admin_password}"
      sleep 10
    
      jenkins_dir="/var/lib/jenkins"
      plugins_dir="$jenkins_dir/plugins"
    
      cd $jenkins_dir
    
      # Open JNLP port
      xmlstarlet -q ed --inplace -u "/hudson/slaveAgentPort" -v 33453 config.xml
    
      cd $plugins_dir || { echo "unable to chdir to [$plugins_dir]"; exit 1; }
    
      # List of plugins that are needed to be installed 
      plugin_list="git-client git github-api github-oauth github MSBuild ssh-slaves workflow-aggregator ws-cleanup"
    
      # remove existing plugins, if any ...
      rm -rfv $plugin_list
    
      for plugin in $plugin_list; do
          echo "installing plugin [$plugin] ..."
          java -jar $jenkins_dir/jenkins-cli.jar -s http://127.0.0.1:8080/ -auth admin:$PASSWORD install-plugin $plugin
      done
    
      # Restart jenkins after installing plugins
      java -jar $jenkins_dir/jenkins-cli.jar -s http://127.0.0.1:8080 -auth admin:$PASSWORD safe-restart
    }
    
    ### script starts here ###
    
    install_packages
    
    wait_for_jenkins
    
    updating_jenkins_master_password
    
    wait_for_jenkins
    
    configure_jenkins_server
    
    echo "Done"
    exit 0
    

    There is a lot of stuff that has been covered here. But the most tricky bit is changing Jenkins password. Here we are using a python script which uses brcypt to hash the plain text in Jenkins encryption format and xmlstarlet for replacing that password in the actual location. Also, we are using xmstarlet to edit the JNLP port for windows slave. Do remember initial username for Jenkins is admin.

    Command to run: Initialize terraform – terraform init , Check and apply – terraform plan -> terraform apply

    After successfully running apply command go to AWS console and check for a new instance coming up. Hit the <public ip=””>:8080 and enter credentials as you had passed and you will have the Jenkins master for yourself ready to be used. </public>

    Note: I will be providing the terraform script and permission list of IAM roles for the user at the end of the blog.

    Creating Terraform Script for Spinning up Linux Slave and connect it to master

    We won’t be creating a new image here rather use the same one that we used for Jenkins master.

    VPC will be same and updated Security groups for slave are below:

    resource "aws_security_group" "dev_jenkins_worker_linux" {
      name        = "dev_jenkins_worker_linux"
      description = "Jenkins Server: created by Terraform for [dev]"
    
    # legacy name of VPC ID
      vpc_id = "${data.aws_vpc.default_vpc.id}"
    
      tags {
        Name = "dev_jenkins_worker_linux"
        env  = "dev"
      }
    }
    
    ###############################################################################
    # ALL INBOUND
    ###############################################################################
    
    # ssh
    resource "aws_security_group_rule" "jenkins_worker_linux_from_source_ingress_ssh" {
      type              = "ingress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["<Your Public IP>/32"]
      description       = "ssh to jenkins_worker_linux"
    }
    
    # ssh
    resource "aws_security_group_rule" "jenkins_worker_linux_from_source_ingress_webui" {
      type              = "ingress"
      from_port         = 8080
      to_port           = 8080
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "ssh to jenkins_worker_linux"
    }
    
    
    ###############################################################################
    # ALL OUTBOUND
    ###############################################################################
    
    resource "aws_security_group_rule" "jenkins_worker_linux_to_all_80" {
      type              = "egress"
      from_port         = 80
      to_port           = 80
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker to all 80"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_linux_to_all_443" {
      type              = "egress"
      from_port         = 443
      to_port           = 443
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker to all 443"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_linux_to_other_machines_ssh" {
      type              = "egress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_linux.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker linux to jenkins server"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_linux_to_jenkins_server_8080" {
      type                     = "egress"
      from_port                = 8080
      to_port                  = 8080
      protocol                 = "tcp"
      security_group_id        = "${aws_security_group.dev_jenkins_worker_linux.id}"
      source_security_group_id = "${aws_security_group.jenkins_server.id}"
      description              = "allow jenkins workers linux to jenkins server"
    }

    Now that we have the required security groups in place it is time to bring into light terraform script for linux slave.

    data "aws_ami" "jenkins_worker_linux" {
      most_recent      = true
      owners           = ["self"]
    
      filter {
        name   = "name"
        values = ["amazon-linux-for-jenkins*"]
      }
    }
    
    resource "aws_key_pair" "jenkins_worker_linux" {
      key_name   = "jenkins_worker_linux"
      public_key = "${file("jenkins_worker.pub")}"
    }
    
    data "local_file" "jenkins_worker_pem" {
      filename = "${path.module}/jenkins_worker.pem"
    }
    
    data "template_file" "userdata_jenkins_worker_linux" {
      template = "${file("scripts/jenkins_worker_linux.sh")}"
    
      vars {
        env         = "dev"
        region      = "us-east-1"
        datacenter  = "dev-us-east-1"
        node_name   = "us-east-1-jenkins_worker_linux"
        domain      = ""
        device_name = "eth0"
        server_ip   = "${aws_instance.jenkins_server.private_ip}"
        worker_pem  = "${data.local_file.jenkins_worker_pem.content}"
        jenkins_username = "admin"
        jenkins_password = "mysupersecretpassword"
      }
    }
    
    # lookup the security group of the Jenkins Server
    data "aws_security_group" "jenkins_worker_linux" {
      filter {
        name   = "group-name"
        values = ["dev_jenkins_worker_linux"]
      }
    }
    
    resource "aws_launch_configuration" "jenkins_worker_linux" {
      name_prefix                 = "dev-jenkins-worker-linux"
      image_id                    = "${data.aws_ami.jenkins_worker_linux.image_id}"
      instance_type               = "t3.medium"
      iam_instance_profile        = "dev_jenkins_worker_linux"
      key_name                    = "${aws_key_pair.jenkins_worker_linux.key_name}"
      security_groups             = ["${data.aws_security_group.jenkins_worker_linux.id}"]
      user_data                   = "${data.template_file.userdata_jenkins_worker_linux.rendered}"
      associate_public_ip_address = false
    
      root_block_device {
        delete_on_termination = true
        volume_size = 100
      }
    
      lifecycle {
        create_before_destroy = true
      }
    }
    
    resource "aws_autoscaling_group" "jenkins_worker_linux" {
      name                      = "dev-jenkins-worker-linux"
      min_size                  = "1"
      max_size                  = "2"
      desired_capacity          = "2"
      health_check_grace_period = 60
      health_check_type         = "EC2"
      vpc_zone_identifier       = ["${data.aws_subnet_ids.default_public.ids}"]
      launch_configuration      = "${aws_launch_configuration.jenkins_worker_linux.name}"
      termination_policies      = ["OldestLaunchConfiguration"]
      wait_for_capacity_timeout = "10m"
      default_cooldown          = 60
    
      tags = [
        {
          key                 = "Name"
          value               = "dev_jenkins_worker_linux"
          propagate_at_launch = true
        },
        {
          key                 = "class"
          value               = "dev_jenkins_worker_linux"
          propagate_at_launch = true
        },
      ]
    }

    And now the final piece of code, which is user-data of slave machine.

    #!/bin/bash
    
    set -x
    
    function wait_for_jenkins ()
    {
        echo "Waiting jenkins to launch on 8080..."
    
        while (( 1 )); do
            echo "Waiting for Jenkins"
    
            nc -zv ${server_ip} 8080
            if (( $? == 0 )); then
                break
            fi
    
            sleep 10
        done
    
        echo "Jenkins launched"
    }
    
    function slave_setup()
    {
        # Wait till jar file gets available
        ret=1
        while (( $ret != 0 )); do
            wget -O /opt/jenkins-cli.jar http://${server_ip}:8080/jnlpJars/jenkins-cli.jar
            ret=$?
    
            echo "jenkins cli ret [$ret]"
        done
    
        ret=1
        while (( $ret != 0 )); do
            wget -O /opt/slave.jar http://${server_ip}:8080/jnlpJars/slave.jar
            ret=$?
    
            echo "jenkins slave ret [$ret]"
        done
        
        mkdir -p /opt/jenkins-slave
        chown -R ec2-user:ec2-user /opt/jenkins-slave
    
        # Register_slave
        JENKINS_URL="http://${server_ip}:8080"
    
        USERNAME="${jenkins_username}"
        
        # PASSWORD=$(cat /tmp/secret)
        PASSWORD="${jenkins_password}"
    
        SLAVE_IP=$(ip -o -4 addr list ${device_name} | head -n1 | awk '{print $4}' | cut -d/ -f1)
        NODE_NAME=$(echo "jenkins-slave-linux-$SLAVE_IP" | tr '.' '-')
        NODE_SLAVE_HOME="/opt/jenkins-slave"
        EXECUTORS=2
        SSH_PORT=22
    
        CRED_ID="$NODE_NAME"
        LABELS="build linux docker"
        USERID="ec2-user"
    
        cd /opt
        
        # Creating CMD utility for jenkins-cli commands
        jenkins_cmd="java -jar /opt/jenkins-cli.jar -s $JENKINS_URL -auth $USERNAME:$PASSWORD"
    
        # Waiting for Jenkins to load all plugins
        while (( 1 )); do
    
          count=$($jenkins_cmd list-plugins 2>/dev/null | wc -l)
          ret=$?
    
          echo "count [$count] ret [$ret]"
    
          if (( $count > 0 )); then
              break
          fi
    
          sleep 30
        done
    
        # Delete Credentials if present for respective slave machines
        $jenkins_cmd delete-credentials system::system::jenkins _ $CRED_ID
    
        # Generating cred.xml for creating credentials on Jenkins server
        cat > /tmp/cred.xml <<EOF
    <com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKey plugin="ssh-credentials@1.16">
      <scope>GLOBAL</scope>
      <id>$CRED_ID</id>
      <description>Generated via Terraform for $SLAVE_IP</description>
      <username>$USERID</username>
      <privateKeySource class="com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKey\$DirectEntryPrivateKeySource">
        <privateKey>${worker_pem}</privateKey>
      </privateKeySource>
    </com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKey>
    EOF
    
        # Creating credential using cred.xml
        cat /tmp/cred.xml | $jenkins_cmd create-credentials-by-xml system::system::jenkins _
    
        # For Deleting Node, used when testing
        $jenkins_cmd delete-node $NODE_NAME
        
        # Generating node.xml for creating node on Jenkins server
        cat > /tmp/node.xml <<EOF
    <slave>
      <name>$NODE_NAME</name>
      <description>Linux Slave</description>
      <remoteFS>$NODE_SLAVE_HOME</remoteFS>
      <numExecutors>$EXECUTORS</numExecutors>
      <mode>NORMAL</mode>
      <retentionStrategy class="hudson.slaves.RetentionStrategy\$Always"/>
      <launcher class="hudson.plugins.sshslaves.SSHLauncher" plugin="ssh-slaves@1.5">
        <host>$SLAVE_IP</host>
        <port>$SSH_PORT</port>
        <credentialsId>$CRED_ID</credentialsId>
      </launcher>
      <label>$LABELS</label>
      <nodeProperties/>
      <userId>$USERID</userId>
    </slave>
    EOF
    
      sleep 10
      
      # Creating node using node.xml
      cat /tmp/node.xml | $jenkins_cmd create-node $NODE_NAME
    }
    
    ### script begins here ###
    
    wait_for_jenkins
    
    slave_setup
    
    echo "Done"
    exit 0

    This will not only create a node on Jenkins master but also attach it.

    Command to run: Initialize terraform – terraform init, Check and apply – terraform plan -> terraform apply

    One drawback of this is, if by any chance slave gets disconnected or goes down, it will remain on Jenkins master as offline, also it will not manually attach itself to Jenkins master.

    Some solutions for them are:

    1. Create a cron job on the slave which will run user-data after a certain interval.

    2. Use swarm plugin.

    3. As we are on AWS, we can even use Amazon EC2 Plugin.

    Maybe in a future blog, we will cover using both of these plugins as well.

    Using Packer to create AMI’s for Windows Slave

    Windows AMI will also be created using packer. All the pointers for Windows will remain as it were for Linux.

    {
      "variables": {
        "ami-description": "Windows Server for Jenkins Slave ({{isotime \"2006-01-02-15-04-05\"}})",
        "ami-name": "windows-slave-for-jenkins-{{isotime \"2006-01-02-15-04-05\"}}",
        "aws_access_key": "",
        "aws_secret_key": ""
      },
    
      "builders": [
        {
          "ami_description": "{{user `ami-description`}}",
          "ami_name": "{{user `ami-name`}}",
          "ami_regions": [
            "us-east-1"
          ],
          "ami_users": [
            "XXXXXXXXXX"
          ],
          "ena_support": "true",
          "instance_type": "t3.medium",
          "region": "us-east-1",
          "source_ami_filter": {
            "filters": {
              "name": "Windows_Server-2016-English-Full-Containers-*",
              "root-device-type": "ebs",
              "virtualization-type": "hvm"
            },
            "most_recent": true,
            "owners": [
              "amazon"
            ]
          },
          "sriov_support": "true",
          "user_data_file": "scripts/SetUpWinRM.ps1",
          "communicator": "winrm",
          "winrm_username": "Administrator",
          "winrm_insecure": true,
          "winrm_use_ssl": true,
          "tags": {
            "Name": "{{user `ami-name`}}"
          },
          "type": "amazon-ebs"
        }
      ],
      "post-processors": [
      {
        "inline": [
          "echo AMI Name {{user `ami-name`}}",
          "date",
          "exit 0"
        ],
        "type": "shell-local"
      }
      ],
      "provisioners": [
        {
          "type": "powershell",
          "valid_exit_codes": [ 0, 3010 ],
          "scripts": [
            "scripts/disable-uac.ps1",
            "scripts/enable-rdp.ps1",
            "install_windows.ps1"
          ]
        },
        {
          "type": "windows-restart",
          "restart_check_command": "powershell -command \"& {Write-Output 'restarted.'}\""
        },
        {
          "type": "powershell",
          "inline": [
            "C:\\ProgramData\\Amazon\\EC2-Windows\\Launch\\Scripts\\InitializeInstance.ps1 -Schedule",
            "C:\\ProgramData\\Amazon\\EC2-Windows\\Launch\\Scripts\\SysprepInstance.ps1 -NoShutdown"
          ]
        }
      ]
    }

    Now when it comes to windows one should know that it does not behave the same way Linux does. For us to be able to communicate with this image an essential component required is WinRM. We set it up at the very beginning as part of user_data_file. Also, windows require user input for a lot of things and while automating it is not possible to provide it as it will break the flow of execution so we disable UAC and enable RDP so that we can connect to that machine from our local desktop for debugging if needed. And at last, we will execute install_windows.ps1 file which will set up our slave. Please note at the last we are calling two PowerShell scripts to generate random password every time a new machine is created. It is mandatory to have them or you will never be able to login into your machines.

    There are multiple user-data in the above code, let’s understand them in their order of appearance.

    SetUpWinRM.ps1:

    <powershell>
    
    write-output "Running User Data Script"
    write-host "(host) Running User Data Script"
    
    Set-ExecutionPolicy Unrestricted -Scope LocalMachine -Force -ErrorAction Ignore
    
    # Don't set this before Set-ExecutionPolicy as it throws an error
    $ErrorActionPreference = "stop"
    
    # Remove HTTP listener
    Remove-Item -Path WSMan:\Localhost\listener\listener* -Recurse
    
    $Cert = New-SelfSignedCertificate -CertstoreLocation Cert:\LocalMachine\My -DnsName "packer"
    New-Item -Path WSMan:\LocalHost\Listener -Transport HTTPS -Address * -CertificateThumbPrint $Cert.Thumbprint -Force
    
    # WinRM
    write-output "Setting up WinRM"
    write-host "(host) setting up WinRM"
    
    cmd.exe /c winrm quickconfig -q
    cmd.exe /c winrm set "winrm/config" '@{MaxTimeoutms="1800000"}'
    cmd.exe /c winrm set "winrm/config/winrs" '@{MaxMemoryPerShellMB="1024"}'
    cmd.exe /c winrm set "winrm/config/service" '@{AllowUnencrypted="true"}'
    cmd.exe /c winrm set "winrm/config/client" '@{AllowUnencrypted="true"}'
    cmd.exe /c winrm set "winrm/config/service/auth" '@{Basic="true"}'
    cmd.exe /c winrm set "winrm/config/client/auth" '@{Basic="true"}'
    cmd.exe /c winrm set "winrm/config/service/auth" '@{CredSSP="true"}'
    cmd.exe /c winrm set "winrm/config/listener?Address=*+Transport=HTTPS" "@{Port=`"5986`";Hostname=`"packer`";CertificateThumbprint=`"$($Cert.Thumbprint)`"}"
    cmd.exe /c netsh advfirewall firewall set rule group="remote administration" new enable=yes
    cmd.exe /c netsh firewall add portopening TCP 5986 "Port 5986"
    cmd.exe /c net stop winrm
    cmd.exe /c sc config winrm start= auto
    cmd.exe /c net start winrm
    
    </powershell>

    The content is pretty straightforward as it is just setting up WInRM. The only thing that matters here is the <powershell> and </powershell>. They are mandatory as packer will not be able to understand what is the type of script. Next, we come across disable-uac.ps1 & enable-rdp.ps1, and we have discussed their purpose before. The last user-data is the actual user-data that we need to install all the required packages in the AMI.

    Chocolatey: a blessing in disguise – Installing required applications in windows by scripting is a real headache as you have to write a lot of stuff just to install a single application but luckily for us we have chocolatey. It works as a package manager for windows and helps us install applications as we are installing packages on Linux. install_windows.ps1 has installation step for chocolatey and how it can be used to install other applications on windows.

    See, such a small script and you can get all the components to run your Windows application in no time (Kidding… This script actually takes around 20 minutes to run :P)

    Remaining user-data can be found here.

    Now that we have the image for ourselves let’s start with terraform script to make this machine a slave of your Jenkins master.

    Creating Terraform Script for Spinning up Windows Slave and Connect it to Master

    This time also we will first create the security groups and then create the slave machine from the same AMI that we developed above.

    resource "aws_security_group" "dev_jenkins_worker_windows" {
      name        = "dev_jenkins_worker_windows"
      description = "Jenkins Server: created by Terraform for [dev]"
    
      # legacy name of VPC ID
      vpc_id = "${data.aws_vpc.default_vpc.id}"
    
      tags {
        Name = "dev_jenkins_worker_windows"
        env  = "dev"
      }
    }
    
    ###############################################################################
    # ALL INBOUND
    ###############################################################################
    
    # ssh
    resource "aws_security_group_rule" "jenkins_worker_windows_from_source_ingress_webui" {
      type              = "ingress"
      from_port         = 8080
      to_port           = 8080
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "ssh to jenkins_worker_windows"
    }
    
    # rdp
    resource "aws_security_group_rule" "jenkins_worker_windows_from_rdp" {
      type              = "ingress"
      from_port         = 3389
      to_port           = 3389
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["<Your Public IP>/32"]
      description       = "rdp to jenkins_worker_windows"
    }
    
    ###############################################################################
    # ALL OUTBOUND
    ###############################################################################
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_all_80" {
      type              = "egress"
      from_port         = 80
      to_port           = 80
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker to all 80"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_all_443" {
      type              = "egress"
      from_port         = 443
      to_port           = 443
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker to all 443"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_jenkins_server_33453" {
      type              = "egress"
      from_port         = 33453
      to_port           = 33453
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["172.31.0.0/16"]
      description       = "allow jenkins worker windows to jenkins server"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_jenkins_server_8080" {
      type                     = "egress"
      from_port                = 8080
      to_port                  = 8080
      protocol                 = "tcp"
      security_group_id        = "${aws_security_group.dev_jenkins_worker_windows.id}"
      source_security_group_id = "${aws_security_group.jenkins_server.id}"
      description              = "allow jenkins workers windows to jenkins server"
    }
    
    resource "aws_security_group_rule" "jenkins_worker_windows_to_all_22" {
      type              = "egress"
      from_port         = 22
      to_port           = 22
      protocol          = "tcp"
      security_group_id = "${aws_security_group.dev_jenkins_worker_windows.id}"
      cidr_blocks       = ["0.0.0.0/0"]
      description       = "allow jenkins worker windows to connect outbound from 22"
    }

    Once security groups are in place we move towards creating the terraform file for windows machine itself. Windows can’t connect to Jenkins master using SSH the method we used while connecting the Linux slave instead we have to use JNLP. A quick recap, when creating Jenkins master we used xmlstarlet to modify the JNLP port and also added rules in sg group to allow connection for JNLP. Also, we have opened the port for RDP so that if any issue occurs you can get in the machine and debug it.

    Terraform file:

    # Setting Up Windows Slave 
    data "aws_ami" "jenkins_worker_windows" {
      most_recent      = true
      owners           = ["self"]
    
      filter {
        name   = "name"
        values = ["windows-slave-for-jenkins*"]
      }
    }
    
    resource "aws_key_pair" "jenkins_worker_windows" {
      key_name   = "jenkins_worker_windows"
      public_key = "${file("jenkins_worker.pub")}"
    }
    
    data "template_file" "userdata_jenkins_worker_windows" {
      template = "${file("scripts/jenkins_worker_windows.ps1")}"
    
      vars {
        env         = "dev"
        region      = "us-east-1"
        datacenter  = "dev-us-east-1"
        node_name   = "us-east-1-jenkins_worker_windows"
        domain      = ""
        device_name = "eth0"
        server_ip   = "${aws_instance.jenkins_server.private_ip}"
        worker_pem  = "${data.local_file.jenkins_worker_pem.content}"
        jenkins_username = "admin"
        jenkins_password = "mysupersecretpassword"
      }
    }
    
    # lookup the security group of the Jenkins Server
    data "aws_security_group" "jenkins_worker_windows" {
      filter {
        name   = "group-name"
        values = ["dev_jenkins_worker_windows"]
      }
    }
    
    resource "aws_launch_configuration" "jenkins_worker_windows" {
      name_prefix                 = "dev-jenkins-worker-"
      image_id                    = "${data.aws_ami.jenkins_worker_windows.image_id}"
      instance_type               = "t3.medium"
      iam_instance_profile        = "dev_jenkins_worker_windows"
      key_name                    = "${aws_key_pair.jenkins_worker_windows.key_name}"
      security_groups             = ["${data.aws_security_group.jenkins_worker_windows.id}"]
      user_data                   = "${data.template_file.userdata_jenkins_worker_windows.rendered}"
      associate_public_ip_address = false
    
      root_block_device {
        delete_on_termination = true
        volume_size = 100
      }
    
      lifecycle {
        create_before_destroy = true
      }
    }
    
    resource "aws_autoscaling_group" "jenkins_worker_windows" {
      name                      = "dev-jenkins-worker-windows"
      min_size                  = "1"
      max_size                  = "2"
      desired_capacity          = "2"
      health_check_grace_period = 60
      health_check_type         = "EC2"
      vpc_zone_identifier       = ["${data.aws_subnet_ids.default_public.ids}"]
      launch_configuration      = "${aws_launch_configuration.jenkins_worker_windows.name}"
      termination_policies      = ["OldestLaunchConfiguration"]
      wait_for_capacity_timeout = "10m"
      default_cooldown          = 60
    
      #lifecycle {
      #  create_before_destroy = true
      #}
    
    
      ## on replacement, gives new service time to spin up before moving on to destroy
      #provisioner "local-exec" {
      #  command = "sleep 60"
      #}
    
      tags = [
        {
          key                 = "Name"
          value               = "dev_jenkins_worker_windows"
          propagate_at_launch = true
        },
        {
          key                 = "class"
          value               = "dev_jenkins_worker_windows"
          propagate_at_launch = true
        },
      ]
    }

    Finally, we reach the user-data for the terraform plan. It will download the required jar file, create a node on Jenkins and register itself as a slave.

    <powershell>
    
    function Wait-For-Jenkins {
    
      Write-Host "Waiting jenkins to launch on 8080..."
    
      Do {
      Write-Host "Waiting for Jenkins"
    
       Nc -zv ${server_ip} 8080
       If( $? -eq $true ) {
         Break
       }
       Sleep 10
    
      } While (1)
    
      Do {
       Write-Host "Waiting for JNLP"
          
       Nc -zv ${server_ip} 33453
       If( $? -eq $true ) {
        Break
       }
       Sleep 10
    
      } While (1)      
    
      Write-Host "Jenkins launched"
    }
    
    function Slave-Setup()
    {
      # Register_slave
      $JENKINS_URL="http://${server_ip}:8080"
    
      $USERNAME="${jenkins_username}"
      
      $PASSWORD="${jenkins_password}"
    
      $AUTH = -join ("$USERNAME", ":", "$PASSWORD")
      echo $AUTH
    
      # Below IP collection logic works for Windows Server 2016 edition and needs testing for windows server 2008 edition
      $SLAVE_IP=(ipconfig | findstr /r "[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*" | findstr "IPv4 Address").substring(39) | findstr /B "172.31"
      
      $NODE_NAME="jenkins-slave-windows-$SLAVE_IP"
      
      $NODE_SLAVE_HOME="C:\Jenkins\"
      $EXECUTORS=2
      $JNLP_PORT=33453
    
      $CRED_ID="$NODE_NAME"
      $LABELS="build windows"
      
      # Creating CMD utility for jenkins-cli commands
      # This is not working in windows therefore specify full path
      $jenkins_cmd = "java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth admin:$PASSWORD"
    
      Sleep 20
    
      Write-Host "Downloading jenkins-cli.jar file"
      (New-Object System.Net.WebClient).DownloadFile("$JENKINS_URL/jnlpJars/jenkins-cli.jar", "C:\Jenkins\jenkins-cli.jar")
    
      Write-Host "Downloading slave.jar file"
      (New-Object System.Net.WebClient).DownloadFile("$JENKINS_URL/jnlpJars/slave.jar", "C:\Jenkins\slave.jar")
    
      Sleep 10
    
      # Waiting for Jenkins to load all plugins
      Do {
      
        $count=(java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH list-plugins | Measure-Object -line).Lines
        $ret=$?
    
        Write-Host "count [$count] ret [$ret]"
    
        If ( $count -gt 0 ) {
            Break
        }
    
        sleep 30
      } While ( 1 )
    
      # For Deleting Node, used when testing
      Write-Host "Deleting Node $NODE_NAME if present"
      java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH delete-node $NODE_NAME
      
      # Generating node.xml for creating node on Jenkins server
      $NodeXml = @"
    <slave>
    <name>$NODE_NAME</name>
    <description>Windows Slave</description>
    <remoteFS>$NODE_SLAVE_HOME</remoteFS>
    <numExecutors>$EXECUTORS</numExecutors>
    <mode>NORMAL</mode>
    <retentionStrategy class="hudson.slaves.RetentionStrategy`$Always`"/>
    <launcher class="hudson.slaves.JNLPLauncher">
      <workDirSettings>
        <disabled>false</disabled>
        <internalDir>remoting</internalDir>
        <failIfWorkDirIsMissing>false</failIfWorkDirIsMissing>
      </workDirSettings>
    </launcher>
    <label>$LABELS</label>
    <nodeProperties/>
    </slave>
    "@
      $NodeXml | Out-File -FilePath C:\Jenkins\node.xml 
    
      type C:\Jenkins\node.xml
    
      # Creating node using node.xml
      Write-Host "Creating $NODE_NAME"
      Get-Content -Path C:\Jenkins\node.xml | java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH create-node $NODE_NAME
    
      Write-Host "Registering Node $NODE_NAME via JNLP"
      Start-Process java -ArgumentList "-jar C:\Jenkins\slave.jar -jnlpCredentials $AUTH -jnlpUrl $JENKINS_URL/computer/$NODE_NAME/slave-agent.jnlp"
    }
    
    ### script begins here ###
    
    Wait-For-Jenkins
    
    Slave-Setup
    
    echo "Done"
    </powershell>
    <persist>true</persist>

    Command to run: Initialize terraform – terraform init, Check and apply – terraform plan -> terraform apply

    Same drawbacks are applicable here and the same solutions will work here as well.

    Congratulations! You have a Jenkins master with Windows and Linux slave attached to it.

    IAM roles for reference

    Jenkins Master

    Linux Slave

    Windows Slave

    Bonus:

    If you want to associate IAM permissions to the user but cannot assign FULL ACCESS here is a curated list below for reference:

    Packer Policy

    Terraform Policy

    Conclusion:

    This blog tries to highlight one of the ways in which we can use packer and Terraform to create AMI’s which will serve as Jenkins master and slave. We not only covered their creation but also focused on how to associate security groups and checked some of the basic IAM roles that can be applied. Although we have covered almost all the possible scenarios but still depending on use case, the required changes would be very less and this can serve as a boiler plate code when beginning to plan your infrastructure on cloud.

  • Web Scraping: Introduction, Best Practices & Caveats

    Web scraping is a process to crawl various websites and extract the required data using spiders. This data is processed in a data pipeline and stored in a structured format. Today, web scraping is widely used and has many use cases:

    • Using web scraping, Marketing & Sales companies can fetch lead-related information.
    • Web scraping is useful for Real Estate businesses to get the data of new projects, resale properties, etc.
    • Price comparison portals, like Trivago, extensively use web scraping to get the information of product and price from various e-commerce sites.

    The process of web scraping usually involves spiders, which fetch the HTML documents from relevant websites, extract the needed content based on the business logic, and finally store it in a specific format. This blog is a primer to build highly scalable scrappers. We will cover the following items:

    1. Ways to scrape: We’ll see basic ways to scrape data using techniques and frameworks in Python with some code snippets.
    2. Scraping at scale: Scraping a single page is straightforward, but there are challenges in scraping millions of websites, including managing the spider code, collecting data, and maintaining a data warehouse. We’ll explore such challenges and their solutions to make scraping easy and accurate.
    3. Scraping Guidelines: Scraping data from websites without the owner’s permission can be deemed as malicious. Certain guidelines need to be followed to ensure our scrappers are not blacklisted. We’ll look at some of the best practices one should follow for crawling.

    So let’s start scraping. 

    Different Techniques for Scraping

    Here, we will discuss how to scrape a page and the different libraries available in Python.

    Note: Python is the most popular language for scraping.  

    1. Requests – HTTP Library in Python: To scrape the website or a page, first find out the content of the HTML page in an HTTP response object. The requests library from Python is pretty handy and easy to use. It uses urllib inside. I like ‘requests’ as it’s easy and the code becomes readable too.

    #Example showing how to use the requests library
    import requests
    r = requests.get("https://velotio.com") #Fetch HTML Page

    2. BeautifulSoup: Once you get the webpage, the next step is to extract the data. BeautifulSoup is a powerful Python library that helps you extract the data from the page. It’s easy to use and has a wide range of APIs that’ll help you extract the data. We use the requests library to fetch an HTML page and then use the BeautifulSoup to parse that page. In this example, we can easily fetch the page title and all links on the page. Check out the documentation for all the possible ways in which we can use BeautifulSoup.

    from bs4 import BeautifulSoup
    import requests
    r = requests.get("https://velotio.com") #Fetch HTML Page
    soup = BeautifulSoup(r.text, "html.parser") #Parse HTML Page
    print "Webpage Title:" + soup.title.string
    print "Fetch All Links:" soup.find_all('a')

    3. Python Scrapy Framework:

    Scrapy is a Python-based web scraping framework that allows you to create different kinds of spiders to fetch the source code of the target website. Scrapy starts crawling the web pages present on a certain website, and then you can write the extraction logic to get the required data. Scrapy is built on the top of Twisted, a Python-based asynchronous library that performs the requests in an async fashion to boost up the spider performance. Scrapy is faster than BeautifulSoup. Moreover, it is a framework to write scrapers as opposed to BeautifulSoup, which is just a library to parse HTML pages.

    Here is a simple example of how to use Scrapy. Install Scrapy via pip. Scrapy gives a shell after parsing a website:

    $ pip install scrapy #Install Scrapy"
    $ scrapy shell https://velotio.com
    In [1]: response.xpath("//a").extract() #Fetch all a hrefs

    Now, let’s write a custom spider to parse a website.

    $cat > myspider.py <import scrapy
    
    class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']
    
    def parse(self, response):
    for title in response.css('h2.entry-title'):
    yield {'title': title.css('a ::text').extract_first()}
    EOF
    scrapy runspider myspider.py

    That’s it. Your first custom spider is created. Now. let’s understand the code.

    • name: Name of the spider. In this case, it’s “blogspider”.
    • start_urls: A list of URLs where the spider will begin to crawl from.
    • parse(self, response): This function is called whenever the crawler successfully crawls a URL. The response object used earlier in the Scrapy shell is the same response object that is passed to the parse(..).

    When you run this, Scrapy will look for start URL and will give you all the divs of the h2.entry-title class and extract the associated text from it. Alternatively, you can write your extraction logic in a parse method or create a separate class for extraction and call its object from the parse method.

    You’ve seen how to extract simple items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient. Here is a tutorial for Scrapy and the additional documentation for LinkExtractor by which you can instruct Scrapy to extract links from a web page.

    4. Python lxml.html library:  This is another library from Python just like BeautifulSoup. Scrapy internally uses lxml. It comes with a list of APIs you can use for data extraction. Why will you use this when Scrapy itself can extract the data? Let’s say you want to iterate over the ‘div’ tag and perform some operation on each tag present under “div”, then you can use this library which will give you a list of ‘div’ tags. Now you can simply iterate over them using the iter() function and traverse each child tag inside the parent div tag. Such traversing operations are difficult in scraping. Here is the documentation for this library.

    Challenges while Scraping at Scale

    Let’s look at the challenges and solutions while scraping at large scale, i.e., scraping 100-200 websites regularly:

    1. Data warehousing: Data extraction at a large scale generates vast volumes of information. Fault-tolerant, scalability, security, and high availability are the must-have features for a data warehouse. If your data warehouse is not stable or accessible then operations, like search and filter over data would be an overhead. To achieve this, instead of maintaining own database or infrastructure, you can use Amazon Web Services (AWS). You can use RDS (Relational Database Service) for a structured database and DynamoDB for the non-relational database. AWS takes care of the backup of data. It automatically takes a snapshot of the database. It gives you database error logs as well. This blog explains how to set up infrastructure in the cloud for scraping.  

    2. Pattern Changes: Scraping heavily relies on user interface and its structure, i.e., CSS and Xpath. Now, if the target website gets some adjustments then our scraper may crash completely or it can give random data that we don’t want. This is a common scenario and that’s why it’s more difficult to maintain scrapers than writing it. To handle this case, we can write the test cases for the extraction logic and run them daily, either manually or from CI tools, like Jenkins to track if the target website has changed or not.

    3. Anti-scraping Technologies: Web scraping is a common thing these days, and every website host would want to prevent their data from being scraped. Anti-scraping technologies would help them in this. For example, if you are hitting a particular website from the same IP address on a regular interval then the target website can block your IP. Adding a captcha on a website also helps. There are methods by which we can bypass these anti-scraping methods. For e.g., we can use proxy servers to hide our original IP. There are several proxy services that keep on rotating the IP before each request. Also, it is easy to add support for proxy servers in the code, and in Python, the Scrapy framework does support it.

    4. JavaScript-based dynamic content:  Websites that heavily rely on JavaScript and Ajax to render dynamic content, makes data extraction difficult. Now, Scrapy and related frameworks/libraries will only work or extract what it finds in the HTML document. Ajax calls or JavaScript are executed at runtime so it can’t scrape that. This can be handled by rendering the web page in a headless browser such as Headless Chrome, which essentially allows running Chrome in a server environment. You can also use PhantomJS, which provides a headless Webkit-based environment.

    5. Honeypot traps: Some websites have honeypot traps on the webpages for the detection of web crawlers. They are hard to detect as most of the links are blended with background color or the display property of CSS is set to none. To achieve this requires large coding efforts on both the server and the crawler side, hence this method is not frequently used.

    6. Quality of data: Currently, AI and ML projects are in high demand and these projects need data at large scale. Data integrity is also important as one fault can cause serious problems in AI/ML algorithms. So, in scraping, it is very important to not just scrape the data, but verify its integrity as well. Now doing this in real-time is not possible always, so I would prefer to write test cases of the extraction logic to make sure whatever your spiders are extracting is correct and they are not scraping any bad data

    7. More Data, More Time:  This one is obvious. The larger a website is, the more data it contains, the longer it takes to scrape that site. This may be fine if your purpose for scanning the site isn’t time-sensitive, but that isn’t often the case. Stock prices don’t stay the same over hours. Sales listings, currency exchange rates, media trends, and market prices are just a few examples of time-sensitive data. What to do in this case then? Well, one solution could be to design your spiders carefully. If you’re using Scrapy like framework then apply proper LinkExtractor rules so that spider will not waste time on scraping unrelated URLs.

    You may use multithreading scraping packages available in Python, such as Frontera and Scrapy Redis. Frontera lets you send out only one request per domain at a time, but can hit multiple domains at once, making it great for parallel scraping. Scrapy Redis lets you send out multiple requests to one domain. The right combination of these can result in a very powerful web spider that can handle both the bulk and variation for large websites.

    8. Captchas: Captchas is a good way of keeping crawlers away from a website and it is used by many website hosts. So, in order to scrape the data from such websites, we need a mechanism to solve the captchas. There are packages, software that can solve the captcha and can act as a middleware between the target website and your spider. Also, you may use libraries like Pillow and Tesseract in Python to solve the simple image-based captchas.

    9. Maintaining Deployment: Normally, we don’t want to limit ourselves to scrape just a few websites. We need the maximum amount of data that are present on the Internet and that may introduce scraping of millions of websites. Now, you can imagine the size of the code and the deployment. We can’t run spiders at this scale from a single machine. What I prefer here is to dockerize the scrapers and take advantage of the latest technologies, like AWS ECS, Kubernetes to run our scraper containers. This helps us keeping our scrapers in high availability state and it’s easy to maintain. Also, we can schedule the scrapers to run at regular intervals.

    Scraping Guidelines/ Best Practices

    1. Respect the robots.txt file:  Robots.txt is a text file that webmasters create to instruct search engine robots on how to crawl and index pages on the website. This file generally contains instructions for crawlers. Now, before even planning the extraction logic, you should first check this file. Usually, you can find this at the website admin section. This file has all the rules set on how crawlers should interact with the website. For e.g., if a website has a link to download critical information then they probably don’t want to expose that to crawlers. Another important factor is the frequency interval for crawling, which means that crawlers can only hit the website at specified intervals. If someone has asked not to crawl their website then we better not do it. Because if they catch your crawlers, it can lead to some serious legal issues.

    2. Do not hit the servers too frequently:  As I mentioned above, some websites will have the frequency interval specified for crawlers. We better use it wisely because not every website is tested against the high load. If you are hitting at a constant interval then it creates huge traffic on the server-side, and it may crash or fail to serve other requests. This creates a high impact on user experience as they are more important than the bots. So, we should make the requests according to the specified interval in robots.txt or use a standard delay of 10 seconds. This also helps you not to get blocked by the target website.

    3. User Agent Rotation and Spoofing: Every request consists of a User-Agent string in the header. This string helps to identify the browser you are using, its version, and the platform. If we use the same User-Agent in every request then it’s easy for the target website to check that request is coming from a crawler. So, to make sure we do not face this, try to rotate the User and the Agent between the requests. You can get examples of genuine User-Agent strings on the Internet very easily, try them out. If you’re using Scrapy, you can set USER_AGENT property in settings.py.

    4. Disguise your requests by rotating IPs and Proxy Services: We’ve discussed this in the challenges above. It’s always better to use rotating IPs and proxy service so that your spider won’t get blocked.

    5. Do not follow the same crawling pattern: Now, as you know many websites use anti-scraping technologies, so it’s easy for them to detect your spider if it’s crawling in the same pattern. Normally, we, as a human, would not follow a pattern on a particular website. So, to have your spiders run smoothly, we can introduce actions like mouse movements, clicking a random link, etc, which gives the impression of your spider as a human.

    6. Scrape during off-peak hours: Off-peak hours are suitable for bots/crawlers as the traffic on the website is considerably less. These hours can be identified by the geolocation from where the site’s traffic originates. This also helps to improve the crawling rate and avoid the extra load from spider requests. Thus, it is advisable to schedule the crawlers to run in the off-peak hours.

    7. Use the scraped data responsibly: We should always take the responsibility of the scraped data. It is not acceptable if someone is scraping the data and then republish it somewhere else. This can be considered as breaking the copyright laws and may lead to legal issues. So, it is advisable to check the target website’s Terms of Service page before scraping.

    8. Use Canonical URLs: When we scrape, we tend to scrape duplicate URLs, and hence the duplicate data, which is the last thing we want to do. It may happen in a single website where we get multiple URLs having the same data. In this situation, duplicate URLs will have a canonical URL, which points to the parent or the original URL. By this, we make sure, we don’t scrape duplicate contents. In frameworks like Scrapy, duplicate URLs are handled by default.

    9. Be transparent: Don’t misrepresent your purpose or use deceptive methods to gain access. If you have a login and a password that identifies you to gain access to a source, use it.  Don’t hide who you are. If possible, share your credentials.

    Conclusion

    We’ve seen the basics of scraping, frameworks, how to crawl, and the best practices of scraping. To conclude:

    • Follow target URLs rules while scraping. Don’t make them block your spider.
    • Maintenance of data and spiders at scale is difficult. Use Docker/ Kubernetes and public cloud providers, like AWS to easily scale your web-scraping backend.
    • Always respect the rules of the websites you plan to crawl. If APIs are available, always use them first.
  • Acquiring Temporary AWS Credentials with Browser Navigated Authentication

    In one of my previous blog posts (Hacking your way around AWS IAM Roles), we demonstrated how users can access AWS resources without having to store AWS credentials on disk. This was achieved by setting up an OpenVPN server and client-side route that gets automatically pushed when the user is connected to the VPN. To this date, I really find this as a complaint-friendly solution without forcing users to do any manual configuration on their system. It also makes sense to have access to AWS resources as long as they are connected on VPN. One of the downsides to this method is maintaining an OpenVPN server, keeping it secure and having it running in a highly available (HA) state. If the OpenVPN server is compromised, our credentials are at stake. Secondly, all the users connected on VPN get the same level of access.

    In this blog post, we present to you a CLI utility written in Rust that writes temporary AWS credentials to a user profile (~/.aws/credentials file) using web browser navigated Google authentication. This utility is inspired by gimme-aws-creds (written in python for Okta authenticated AWS farm) and heroku cli (written in nodejs and utilizes oclif framework). We will refer to our utility as aws-authcreds throughout this post.

    “If you have an apple and I have an apple and we exchange these apples then you and I will still each have one apple. But if you have an idea and I have an idea and we exchange these ideas, then each of us will have two ideas.”

    – George Bernard Shaw

    What does this CLI utility (auth-awscreds) do?

    When the user fires a command (auth-awscreds) on the terminal, our program reads utility configuration from file .auth-awscreds located in the user home directory. If this file is not present, the utility prompts for setting the configuration for the first time. Utility configuration file is INI format. Program then opens a default web browser and navigates to the URL read from the configuration file. At this point, the utility waits for the browser URL to navigate and authorize. Web UI then navigates to Google Authentication. If authentication is successful, a callback is shared with CLI utility along with temporary AWS credentials, which is then written to ~/.aws/credentials file.

    Block Diagram

    Tech Stack Used

    As stated earlier, we wrote this utility in Rust. One of the reasons for choosing Rust is because we wanted a statically typed binary (ELF) file (executed independent of interpreter), which ships as it is when compiled. Unlike programs written in Python or Node.js, one needs a language interpreter and has supporting libraries installed for your program. The golang would have also suffice our purpose, but I prefer Rust over golang.

    Software Stack:

    • Rust (for CLI utility)
    • Actix Web – HTTP Server
    • Node.js, Express, ReactJS, serverless-http, aws-sdk, AWS Amplify, axios
    • Terraform and serverless framework

    Infrastructure Stack:

    • AWS Cognito (User Pool and Federated Identities)
    • AWS API Gateway (HTTP API)
    • AWS Lambda
    • AWS S3 Bucket (React App)
    • AWS CloudFront (For Serving React App)
    • AWS ACM (SSL Certificate)

    Recipe

    Architecture Diagram

    CLI Utility: auth-awscreds

    Our goal is, when the auth-awscreds command is fired, we first check if the user’s home directory ~/.aws/credentials file exists. If not, we create a ~/.aws directory. This is the default AWS credentials directory, where usually AWS SDK looks for credentials (unless exclusively specified by env var AWS_SHARED_CREDENTIALS_FILE). The next step would be to check if a ~/.auth-awscredds file exists. If this file doesn’t exist, we create a prompt user with two inputs: 

    1. AWS credentials profile name (used by SDK, default is preferred) 

    2. Application domain URL (Our backend app domain is used for authentication)

    let app_profile_file = format!("{}/.auth-awscreds",&user_home_dir);
     
       let config_exist : bool = Path::new(&app_profile_file).exists();
     
       let mut profile_name = String::new();
       let mut app_domain = String::new();
     
       if !config_exist {
           //ask the series of questions
           print!("Which profile to write AWS Credentials [default] : ");
           io::stdout().flush().unwrap();
           io::stdin()
               .read_line(&mut profile_name)
               .expect("Failed to read line");
     
           print!("App Domain : ");
           io::stdout().flush().unwrap();
          
           io::stdin()
               .read_line(&mut app_domain)
               .expect("Failed to read line");
          
           profile_name=String::from(profile_name.trim());
           app_domain=String::from(app_domain.trim());
          
           config_profile(&profile_name,&app_domain);
          
       }
       else {
           (profile_name,app_domain) = read_profile();
       }

    These two properties are written in ~/.auth-awscreds under the default section. Followed by this, our utility generates RSA asymmetric 1024 bit public and private key. Both the keypair are converted to base64.

    pub fn genkeypairs() -> (String,String) {
       let rsa = Rsa::generate(1024).unwrap();
     
       let private_key: Vec<u8> = rsa.private_key_to_pem_passphrase(Cipher::aes_128_cbc(),"Sagar Barai".as_bytes()).unwrap();
       let public_key: Vec<u8> = rsa.public_key_to_pem().unwrap();
     
       (base64::encode(private_key) , base64::encode(public_key))
    }

    We then launch a browser window and navigate to the specified app domain URL. At this stage, our utility starts a temporary web server with the help of the Actix Web framework and listens on 63442 port of localhost.

    println!("Opening web ui for authentication...!");
       open::that(&app_domain).unwrap();
     
       HttpServer::new(move || {
           //let stopper = tx.clone();
           let cors = Cors::permissive();
           App::new()
           .wrap(cors)
           //.app_data(stopper)
           .app_data(crypto_data.clone())
           .service(get_public_key)
           .service(set_aws_creds)
       })
       .bind(("127.0.0.1",63442))?
       .run()
       .await

    Localhost web server has two end points.

    1. GET Endpoint (/publickey): This endpoint is called by our React app after authentication and returns the public key created during the initialization process. Since the web server hosted by the Rust application is insecure (non ssl),  when actual AWS credentials are received, they should be posted as an encrypted string with the help of this public key.

    #[get("/publickey")]
    pub async fn get_public_key(data: web::Data<AppData>) -> impl Responder {
       let public_key = &data.public_key;
      
       web::Json(HTTPResponseData{
           status: 200,
           msg: String::from("Ok"),
           success: true,
           data: String::from(public_key)
       })
    }

    2. POST Endpoint (/setcreds): This endpoint is called when the react app has successfully retrieved credentials from API Gateway. Credentials are decrypted by private key and then written to ~/.aws/credentials file defined by profile name in utility configuration. 

    let encrypted_data = payload["data"].as_array().unwrap();
       let username = payload["username"].as_str().unwrap();
     
       let mut decypted_payload = vec![];
     
       for str in encrypted_data.iter() {
           //println!("{}",str.to_string());
           let s = str.as_str().unwrap();
           let decrypted = decrypt_data(&private_key, &s.to_string());
           decypted_payload.extend_from_slice(&decrypted);
       }
     
       let credentials : serde_json::Value = serde_json::from_str(&String::from_utf8(decypted_payload).unwrap()).unwrap();
     
       let aws_creds = AWSCreds{
           profile_name: String::from(profile_name),
           aws_access_key_id: String::from(credentials["AccessKeyId"].as_str().unwrap()),
           aws_secret_access_key: String::from(credentials["SecretAccessKey"].as_str().unwrap()),
           aws_session_token: String::from(credentials["SessionToken"].as_str().unwrap())
       };
     
       println!("Authenticated as {}",username);
       println!("Updating AWS Credentials File...!");
     
       configcreds(&aws_creds);

    One of the interesting parts of this code is the decryption process, which iterates through an array of strings and is joined by method decypted_payload.extend_from_slice(&decrypted);. RSA 1024 is 128-byte encryption, and we used OAEP padding, which uses 42 bytes for padding and the rest for encrypted data. Thus, 86 bytes can be encrypted at max. So, when credentials are received they are an array of 128 bytes long base64 encoded data. One has to decode the bas64 string to a data buffer and then decrypt data piece by piece.

    To generate a statically typed binary file, run: cargo build –release

    AWS Cognito and Google Authentication

    This guide does not cover how to set up Cognito and integration with Google Authentication. You can refer to our old post for a detailed guide on setting up authentication and authorization. (Refer to the sections Setup Authentication and Setup Authorization).

    React App:

    The React app is launched via our Rust CLI utility. This application is served right from the S3 bucket via CloudFront. When our React app is loaded, it checks if the current session is authenticated. If not, then with the help of the AWS Amplify framework, our app is redirected to Cognito-hosted UI authentication, which in turn auto redirects to Google Login page.

    render(){
       return (
         <div className="centerdiv">
           {
             this.state.appInitialised ?
               this.state.user === null ? Auth.federatedSignIn({provider: 'Google'}) :
               <Aux>
                 {this.state.pageContent}
               </Aux>
             :
             <Loader/>
           }
         </div>
       )
     }

    Once the session is authenticated, we set the react state variables and then retrieve the public key from the actix web server (Rust CLI App: auth-awscreds) by calling /publickey GET method. Followed by this, an Ajax POST request (/auth-creds) is made via axios library to API Gateway. The payload contains a public key, and JWT token for authentication. Expected response from API gateway is encrypted AWS temporary credentials which is then proxied to our CLI application.

    To ease this deployment, we have written a terraform code (available in the repository) that takes care of creating an S3 bucket, CloudFront distribution, ACM, React build, and deploying it to the S3 bucket. Navigate to vars.tf file and change the respective default variables). The Terraform script will fail at first launch since the ACM needs a DNS record validation. You can create a CNAME record for DNS validation and re-run the Terraform script to continue deployment. The React app expects few environment variables. Below is the sample .env file; update the respective values for your environment.

    REACT_APP_IDENTITY_POOL_ID=
    REACT_APP_COGNITO_REGION=
    REACT_APP_COGNITO_USER_POOL_ID=
    REACT_APP_COGNTIO_DOMAIN_NAME=
    REACT_APP_DOMAIN_NAME=
    REACT_APP_CLIENT_ID=
    REACT_APP_CLI_APP_URL=
    REACT_APP_API_APP_URL=

    Finally, deploy the React app using below sample commands.

    $ terraform plan -out plan     #creates plan for revision
    $ terraform apply plan         #apply plan and deploy

    API Gateway HTTP API and Lambda Function

    When a request is first intercepted by API Gateway, it validates the JWT token on its own. API Gateway natively supports Cognito integration. Thus, any payload with invalid authorization header is rejected at API Gateway itself. This eases our authentication process and validates the identity. If the request is valid, it is then received by our Lambda function. Our Lambda function is written in Node.js and wrapped by serverless-http framework around express app. The Express app has only one endpoint.

    /auth-creds (POST): once the request is received, it retrieves the ID from Cognito and logs it to stdout for audit purpose.

    let identityParams = {
               IdentityPoolId: process.env.IDENTITY_POOL_ID,
               Logins: {}
           };
      
           identityParams.Logins[`${process.env.COGNITOIDP}`] = req.headers.authorization;
      
           const ci = new CognitoIdentity({region : process.env.AWSREGION});
      
           let idpResponse = await ci.getId(identityParams).promise();
      
           console.log("Auth Creds Request Received from ",JSON.stringify(idpResponse));

    The app then extracts the base64 encoded public key. Followed by this, an STS api call (Security Token Service) is made and temporary credentials are derived. These credentials are then encrypted with a public key in chunks of 86 bytes.

    const pemPublicKey = Buffer.from(public_key,'base64').toString();
     
           const authdata=await sts.assumeRole({
               ExternalId: process.env.STS_EXTERNAL_ID,
               RoleArn: process.env.IAM_ROLE_ARN,
               RoleSessionName: "DemoAWSAuthSession"
           }).promise();
     
           const creds = JSON.stringify(authdata.Credentials);
           const splitData = creds.match(/.{1,86}/g);
          
           const encryptedData = splitData.map(d=>{
               return publicEncrypt(pemPublicKey,Buffer.from(d)).toString('base64');
           });

    Here, the assumeRole calls the IAM role, which has appropriate policy documents attached. For the sake of this demo, we attached an Administrator role. However, one should consider a hardening policy document and avoid attaching Administrator policy directly to the role.

    resources:
     Resources:
       AuthCredsAssumeRole:
         Type: AWS::IAM::Role
         Properties:
           AssumeRolePolicyDocument:
             Version: "2012-10-17"
             Statement:
               -
                 Effect: Allow
                 Principal:
                   AWS: !GetAtt IamRoleLambdaExecution.Arn
                 Action: sts:AssumeRole
                 Condition:
                   StringEquals:
                     sts:ExternalId: ${env:STS_EXTERNAL_ID}
           RoleName: auth-awscreds-api
           ManagedPolicyArns:
             - arn:aws:iam::aws:policy/AdministratorAccess

    Finally, the response is sent to the React app. 

    We have used the Serverless framework to deploy the API. The Serverless framework creates API gateway, lambda function, Lambda Layer, and IAM role, and takes care of code deployment to lambda function.

    To deploy this application, follow the below steps.

    1. cd layer/nodejs && npm install && cd ../.. && npm install

    2. npm install -g serverless (on mac you can skip this step and use the npx serverless command instead) 

    3. Create .env file and below environment variables to file and set the respective values.

    AWSREGION=ap-south-1
    COGNITO_USER_POOL_ID=
    IDENTITY_POOL_ID=
    COGNITOIDP=
    APP_CLIENT_ID=
    STS_EXTERNAL_ID=
    IAM_ROLE_ARN=
    DEPLOYMENT_BUCKET=
    APP_DOMAIN=

    4. serverless deploy or npx serverless deploy

    Entire codebase for CLI APP, React App, and Backend API  is available on the GitHub repository.

    Testing:

    Assuming that you have compiled binary (auth-awscreds) available in your local machine and for the sake of testing you have installed `aws-cli`, you can then run /path/to/your/auth-awscreds. 

    App Testing

    If you selected your AWS profile name as “demo-awscreds,” you can then export the AWS_PROFILE environment variable. If you prefer a “default” profile, you don’t need to export the environment variable as AWS SDK selects a “default” profile on its own.

    [demo-awscreds]
    aws_access_key_id=ASIAUAOF2CHC77SJUPZU
    aws_secret_access_key=r21J4vwPDnDYWiwdyJe3ET+yhyzFEj7Wi1XxdIaq
    aws_session_token=FwoGZXIvYXdzEIj//////////wEaDHVLdvxSNEqaQZPPQyK2AeuaSlfAGtgaV1q2aKBCvK9c8GCJqcRLlNrixCAFga9n+9Vsh/5AWV2fmea6HwWGqGYU9uUr3mqTSFfh+6/9VQH3RTTwfWEnQONuZ6+E7KT9vYxPockyIZku2hjAUtx9dSyBvOHpIn2muMFmizZH/8EvcZFuzxFrbcy0LyLFHt2HI/gy9k6bLCMbcG9w7Ej2l8vfF3dQ6y1peVOQ5Q8dDMahhS+CMm1q/T1TdNeoon7mgqKGruO4KJrKiZoGMi1JZvXeEIVGiGAW0ro0/Vlp8DY1MaL7Af8BlWI1ZuJJwDJXbEi2Y7rHme5JjbA=

    To validate, you can then run “aws s3 ls.” You should see S3 buckets listed from your AWS account. Note that these credentials are only valid for 60 minutes. This means you will have to re-run the command and acquire a new pair of AWS credentials. Of course, you can configure your IAM role to extend expiry for an “assume role.” 

    auth-awscreds in Action:

    Summary

    Currently, “auth-awscreds” is at its early development stage. This post demonstrates how AWS credentials can be acquired temporarily without having to worry about key rotation. One of the features that we are currently working on is RBAC, with the help of AWS Cognito. Since this tool currently doesn’t support any command line argument, we can’t reconfigure utility configuration. You can manually edit or delete the utility configuration file, which triggers a prompt for configuring during the next run. We also want to add multiple profiles so that multiple AWS accounts can be used.

  • A Primer on HTTP Load Balancing in Kubernetes using Ingress on Google Cloud Platform

    Containerized applications and Kubernetes adoption in cloud environments is on the rise. One of the challenges while deploying applications in Kubernetes is exposing these containerized applications to the outside world. This blog explores different options via which applications can be externally accessed with focus on Ingress – a new feature in Kubernetes that provides an external load balancer. This blog also provides a simple hand-on tutorial on Google Cloud Platform (GCP).  

    Ingress is the new feature (currently in beta) from Kubernetes which aspires to be an Application Load Balancer intending to simplify the ability to expose your applications and services to the outside world. It can be configured to give services externally-reachable URLs, load balance traffic, terminate SSL, offer name based virtual hosting etc. Before we dive into Ingress, let’s look at some of the alternatives currently available that help expose your applications, their complexities/limitations and then try to understand Ingress and how it addresses these problems.

    Current ways of exposing applications externally:

    There are certain ways using which you can expose your applications externally. Lets look at each of them:

    EXPOSE Pod:

    You can expose your application directly from your pod by using a port from the node which is running your pod, mapping that port to a port exposed by your container and using the combination of your HOST-IP:HOST-PORT to access your application externally. This is similar to what you would have done when running docker containers directly without using Kubernetes. Using Kubernetes you can use hostPortsetting in service configuration which will do the same thing. Another approach is to set hostNetwork: true in service configuration to use the host’s network interface from your pod.

    Limitations:

    • In both scenarios you should take extra care to avoid port conflicts at the host, and possibly some issues with packet routing and name resolutions.
    • This would limit running only one replica of the pod per cluster node as the hostport you use is unique and can bind with only one service.

    EXPOSE Service:

    Kubernetes services primarily work to interconnect different pods which constitute an application. You can scale the pods of your application very easily using services. Services are not primarily intended for external access, but there are some accepted ways to expose services to the external world.

    Basically, services provide a routing, balancing and discovery mechanism for the pod’s endpoints. Services target pods using selectors, and can map container ports to service ports. A service exposes one or more ports, although usually, you will find that only one is defined.

    A service can be exposed using 3 ServiceType choices:

    • ClusterIP: Exposes the service on a cluster-internal IP. Choosing this value makes the service only reachable from within the cluster. This is the default ServiceType.
    • NodePort: Exposes the service on each Node’s IP at a static port (the NodePort). A ClusterIP service, to which the NodePort service will route, is automatically created. You’ll be able to contact the NodePort service, from outside the cluster, by requesting <nodeip>:<nodeport>.Here NodePort remains fixed and NodeIP can be any node IP of your Kubernetes cluster.</nodeport></nodeip>
    • LoadBalancer: Exposes the service externally using a cloud provider’s load balancer (eg. AWS ELB). NodePort and ClusterIP services, to which the external load balancer will route, are automatically created.
    • ExternalName: Maps the service to the contents of the externalName field (e.g. foo.bar.example.com), by returning a CNAME record with its value. No proxying of any kind is set up. This requires version 1.7 or higher of kube-dns

    Limitations:

    • If we choose NodePort to expose our services, kubernetes will generate ports corresponding to the ports of your pods in the range of 30000-32767. You will need to add an external proxy layer that uses DNAT to expose more friendly ports. The external proxy layer will also have to take care of load balancing so that you leverage the power of your pod replicas. Also it would not be easy to add TLS or simple host header routing rules to the external service.
    • ClusterIP and ExternalName similarly while easy to use have the limitation where we can add any routing or load balancing rules.
    • Choosing LoadBalancer is probably the easiest of all methods to get your service exposed to the internet. The problem is that there is no standard way of telling a Kubernetes service about the elements that a balancer requires, again TLS and host headers are left out. Another limitation is reliance on an external load balancer (AWS’s ELB, GCP’s Cloud Load Balancer etc.)

    Endpoints

    Endpoints are usually automatically created by services, unless you are using headless services and adding the endpoints manually. An endpoint is a host:port tuple registered at Kubernetes, and in the service context it is used to route traffic. The service tracks the endpoints as pods, that match the selector are created, deleted and modified. Individually, endpoints are not useful to expose services, since they are to some extent ephemeral objects.

    Summary

    If you can rely on your cloud provider to correctly implement the LoadBalancer for their API, to keep up-to-date with Kubernetes releases, and you are happy with their management interfaces for DNS and certificates, then setting up your services as type LoadBalancer is quite acceptable.

    On the other hand, if you want to manage load balancing systems manually and set up port mappings yourself, NodePort is a low-complexity solution. If you are directly using Endpoints to expose external traffic, perhaps you already know what you are doing (but consider that you might have made a mistake, there could be another option).

    Given that none of these elements has been originally designed to expose services to the internet, their functionality may seem limited for this purpose.

    Understanding Ingress

    Traditionally, you would create a LoadBalancer service for each public application you want to expose. Ingress gives you a way to route requests to services based on the request host or path, centralizing a number of services into a single entrypoint.

    Ingress is split up into two main pieces. The first is an Ingress resource, which defines how you want requests routed to the backing services and second is the Ingress Controller which does the routing and also keeps track of the changes on a service level.

    Ingress Resources

    The Ingress resource is a set of rules that map to Kubernetes services. Ingress resources are defined purely within Kubernetes as an object that other entities can watch and respond to.

    Ingress Supports defining following rules in beta stage:

    • host header:  Forward traffic based on domain names.
    • paths: Looks for a match at the beginning of the path.
    • TLS: If the ingress adds TLS, HTTPS and a certificate configured through a secret will be used.

    When no host header rules are included at an Ingress, requests without a match will use that Ingress and be mapped to the backend service. You will usually do this to send a 404 page to requests for sites/paths which are not sent to the other services. Ingress tries to match requests to rules, and forwards them to backends, which are composed of a service and a port.

    Ingress Controllers

    Ingress controller is the entity which grants (or remove) access, based on the changes in the services, pods and Ingress resources. Ingress controller gets the state change data by directly calling Kubernetes API.

    Ingress controllers are applications that watch Ingresses in the cluster and configure a balancer to apply those rules. You can configure any of the third party balancers like HAProxy, NGINX, Vulcand or Traefik to create your version of the Ingress controller.  Ingress controller should track the changes in ingress resources, services and pods and accordingly update configuration of the balancer.

    Ingress controllers will usually track and communicate with endpoints behind services instead of using services directly. This way some network plumbing is avoided, and we can also manage the balancing strategy from the balancer. Some of the open source implementations of Ingress Controllers can be found here.

    Now, let’s do an exercise of setting up a HTTP Load Balancer using Ingress on Google Cloud Platform (GCP), which has already integrated the ingress feature in it’s Container Engine (GKE) service.

    Ingress-based HTTP Load Balancer in Google Cloud Platform

    The tutorial assumes that you have your GCP account setup done and a default project created. We will first create a Container cluster, followed by deployment of a nginx server service and an echoserver service. Then we will setup an ingress resource for both the services, which will configure the HTTP Load Balancer provided by GCP

    Basic Setup

    Get your project ID by going to the “Project info” section in your GCP dashboard. Start the Cloud Shell terminal, set your project id and the compute/zone in which you want to create your cluster.

    $ gcloud config set project glassy-chalice-129514$ 
    gcloud config set compute/zone us-east1-d
    # Create a 3 node cluster with name “loadbalancedcluster”$ 
    gcloud container clusters create loadbalancedcluster  

    Fetch the cluster credentials for the kubectl tool:

    $ gcloud container clusters get-credentials loadbalancedcluster --zone us-east1-d --project glassy-chalice-129514

    Step 1: Deploy an nginx server and echoserver service

    $ kubectl run nginx --image=nginx --port=80
    $ kubectl run echoserver --image=gcr.io/google_containers/echoserver:1.4 --port=8080
    $ kubectl get deployments
    NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
    echoserver   1         1         1            1           15s
    nginx        1         1         1            1           26m

    Step 2: Expose your nginx and echoserver deployment as a service internally

    Create a Service resource to make the nginx and echoserver deployment reachable within your container cluster:

    $ kubectl expose deployment nginx --target-port=80  --type=NodePort
    $ kubectl expose deployment echoserver --target-port=8080 --type=NodePort

    When you create a Service of type NodePort with this command, Container Engine makes your Service available on a randomly-selected high port number (e.g. 30746) on all the nodes in your cluster. Verify the Service was created and a node port was allocated:

    $ kubectl get service nginx
    NAME      CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
    nginx     10.47.245.54   <nodes>       80:30746/TCP   20s
    $ kubectl get service echoserver
    NAME         CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
    echoserver   10.47.251.9   <nodes>       8080:32301/TCP   33s

    In the output above, the node port for the nginx Service is 30746 and for echoserver service is 32301. Also, note that there is no external IP allocated for this Services. Since the Container Engine nodes are not externally accessible by default, creating this Service does not make your application accessible from the Internet. To make your HTTP(S) web server application publicly accessible, you need to create an Ingress resource.

    Step 3: Create an Ingress resource

    On Container Engine, Ingress is implemented using Cloud Load Balancing. When you create an Ingress in your cluster, Container Engine creates an HTTP(S) load balancer and configures it to route traffic to your application. Container Engine has internally defined an Ingress Controller, which takes the Ingress resource as input for setting up proxy rules and talk to Kubernetes API to get the service related information.

    The following config file defines an Ingress resource that directs traffic to your nginx and echoserver server:

    apiVersion: extensions/v1beta1
    kind: Ingress
    metadata: 
    name: fanout-ingress
    spec: 
    rules: 
    - http:     
    paths:     
    - path: /       
    backend:         
    serviceName: nginx         
    servicePort: 80     
    - path: /echo       
    backend:         
    serviceName: echoserver         
    servicePort: 8080

    To deploy this Ingress resource run in the cloud shell:

    $ kubectl apply -f basic-ingress.yaml

    Step 4: Access your application

    Find out the external IP address of the load balancer serving your application by running:

    $ kubectl get ingress fanout-ingres
    NAME             HOSTS     ADDRESS          PORTS     AG
    fanout-ingress   *         130.211.36.168   80        36s    

     

    Use http://<external-ip-address> </external-ip-address>and http://<external-ip-address>/echo</external-ip-address> to access nginx and the echo-server.

    Summary

    Ingresses are simple and very easy to deploy, and really fun to play with. However, it’s currently in beta phase and misses some of the features that may restrict it from production use. Stay tuned to get updates in Ingress on Kubernetes page and their Github repo.

    References

  • SEO for Web Apps: How to Boost Your Search Rankings

    The responsibilities of a web developer are not just designing and developing a web application but adding the right set of features that allow the site get higher traffic. One way of getting traffic is by ensuring your web page is listed in top search results of Google. Search engines consider certain factors while ranking the web page (which are covered in this guide below), and accommodating these factors in your web app is called search engine optimization. 

    A web app that is search engine optimized loads faster, has a good user experience, and is shown in the top search results of Google. If you want your web app to have these features, then this essential guide to SEO will provide you with a checklist to follow when working on SEO improvements.

    Key Facts:

    • 75% of visitors only visit the first three links listed and results from the second page get only 0.78% of clicks.
    • 95% of visitors visit only the links from the first page of Google.
    • Search engines give 300% more traffic than social media.
    • 8% of searches from browsers are in the form of a question.
    • 40% of visitors will leave a website if it takes more than 3 seconds to load. And more shocking is that 80% of those visitors will not visit the same site again.

    How Search Works:

     

     

    1. Crawling: These are the automated scripts that are often referred to as web crawlers, web spiders, Googlebot, and sometimes shortened to crawlers. These scripts look for the past crawls and look for the sitemap file, which is found at the root directory of the web application. We will cover more on the sitemap later. For now, just understand that the sitemap file has all the links to your website, which are ordered hierarchically. Crawlers add those links to the crawl queue so that they can be crawled later. Crawlers pay special attention to newly added sites and frequently updated/visited sites, and they use several algorithms to find how often the existing site should be recrawled.
    2. Indexing: Let us first understand what indexing means. Indexing is collecting, parsing, and storing data to enable a super-fast response to queries. Now, Google uses the same steps to perform web indexing. Google visits each page from the crawl queue and analyzes what the page is about and analyzes the content, images, and video, then parses the analyzed result and stores it into their database called Google Index.
    3. Serving: When a user makes a search query on Google, Google tries to determine the highest quality result and considers other criteria before serving the result, like user’s location, user’s submitted data, language, and device (desktop/mobile). That is why responsiveness is also considered for SEO. Unresponsive sites might have a higher ranking for desktop but will have a lower ranking for mobile because, while analyzing the page content, these bots see the pages as what the user sees and assign the ranking accordingly.

    Factors that affect SEO ranking:

    1. Sitemap: The sitemap file has two types: HTML & XML, and both files are placed at the root of the web app. The HTML sitemap guides users around the website pages, and it has the pages listed hierarchically  to help users understand the flow of the website. The XML sitemap helps the search engine bots crawl the pages of the site, and it helps the crawlers to understand the website structure. It has different types of data, which helps the bots to perform crawling cleverly.

    loc: The URL of the webpage.

    lastmod: When the content of the URL got updated.

    changefreq: How often the content of the page gets changed.

    priority: It has the range from 0 to 1—0 represents the lowest priority, and 1 represents the highest. 1 is generally given to the home or landing page. Setting 1 to every URL will cause search engines to ignore this field.

    Click here to see how a sitemap.xml looks like.

    The below example shows how the URL will be written along with the fields.

     

    2. Meta tags: Meta tags are very important because they indirectly affect the SEO ranking,  and they contain important information about the web page, and this information is shown as the snippet in Google search results. Users see this snippet and decide whether to click this link, and search engines consider the click rates parameter when serving the results. Meta tags are not visible to the user on the web page, but they are part of HTML code.

    A few important meta tags for SEO are:

    • Meta title: This is the primary content shown by the search results, and it plays a huge role in deciding the click rates because it gives users a quick glance at what this page is about. It should ideally be 50-60 characters long, and the title should be unique for each page.
    • Meta description: It summarizes or gives an overview of the page content in short. The description should be precise and of high quality. It should include some targeted keywords the user will likely search and be under 160 characters.
    • Meta robots: It tells search engines whether to index and crawl web pages. The four values it can contain are index, noindex, follow, or nofollow. If these values are not used correctly, then it will negatively impact the SEO.
      index/noindex: Tells whether to index the web page.
      follow/nofollow: Tells whether to crawl links on the web page.
    • Meta viewport: It sends the signal to search engines that the web page is responsive to different screen sizes, and it instructs the browser on how to render the page. This tag presence helps search engines understand that the website is mobile-friendly, which matters because Google ranks the results differently in mobile search. If the desktop version is opened in mobile, then the user will most likely close the page, sending a negative signal to Google that this page has some undesirable content and results in lowering the ranking. This tag should be present on all the web pages.

      Let us look at what a Velotio page would look like with and without the meta viewport tag.


    • Meta charset: It sets the character encoding of the webpage in simple terms, telling how the text should be displayed on the page. Wrong character encoding will make content hard to read for search engines and will lead to a bad user experience. Use UTF-8 character encoding wherever possible.
    • Meta keywords: Search engines don’t consider this tag anymore. Bing considers this tag as spam. If this tag is added to any of the web pages, it may work against SEO. It is advisable not to have this tag on your pages.

    3. Usage of Headers / Hierarchical content: Header tags are the heading tags that are important for user readability and search engines. Headers organize the content of the web page so that it won’t look like a plain wall of text. Bots check for how well the content is organized and assign the ranking accordingly. Headers make the content user-friendly, scannable, and accessible. Header tags are from h1 to h6, with h1 being high importance and h6 being low importance. Googlebot considers h1 mainly because it is typically the title of the page and provides brief information about what this page content has.

    If Velotio’s different pages of content were written on one big page (not good advice, just for example), then hierarchy can be done like the below snapshot.

    4. Usage of Breadcrumb: Breadcrumbs are the navigational elements that allow users to track which page they are currently on. Search engines find this helpful to understand the structure of the website. It lowers the bounce rate by engaging users to explore other pages of the website. Breadcrumbs can be found at the top of the page with slightly smaller fonts. Usage of breadcrumb is always recommended if your site has deeply nested pages.

    If we refer to the MDN pages, then a hierarchical breadcrumb can be found at the top of the page.

    5. User Experience (UX): UX has become an integral component of SEO. A good UX always makes your users stay longer, which lowers the bounce rate and makes them visit your site again. Google recognizes this stay time and click rates and considers the site as more attractive to users, ranking it higher in the search results. Consider the following points to have a good user experience.

    1. Divide content into sections, not just a plain wall of text
    2. Use hierarchical font sizes
    3. Use images/videos that summarize the content
    4. Good theme and color contrast
    5. Responsiveness (desktop/tablet/mobile)

    6. Robots.txt: The robots.txt file prevents crawlers from accessing all pages of the site. It contains some commands that tell the bots not to index the disallowed pages. By doing this, crawlers will not crawl those pages and will not index them. The best example of a page that should not be crawled is the payment gateway page. Robots.txt is kept at the root of the web app and should be public. Refer to Velotio’s robots.txt file to know more about it. User-Agent:* means the given command will be applied to all the bots that support robots.txt.

    7. Page speed: Page speed is the time it takes to get the page fully displayed and interactive. Google also considers page speed an important factor for SEO. As we have seen from the facts section, users tend to close a site if it takes longer than 3 seconds to load. To Googlebot, this is something unfavorable to the user experience, and it will lower the ranking. We will go through some tools later in this section to  know the loading speed of a page, but if your site loads slowly, then look into the recommendations below.

    • Image compression: In a consumer-oriented website, the images contribute to around 50-90% of the page. The images must load quickly. Use compressed images, which lowers the file size without compromising the quality. Cloudinary is a platform that does this job decently.
      If your image size is 700×700 and is shown in a 300x*300 container, then rather than doing this with CSS, load the image at 300x*300 only, because browsers don’t need to load such a big image, and it will take more time to reduce the image through CSS. All this time can be avoided by loading an image of the required size.
      By utilizing deferring/lazy image loading, images are downloaded when they are needed as the user scrolls on the webpage. Doing this allows the images to not be loaded at once, and browsers will have the bandwidth to perform other tasks.
      Using sprite images is also an effective way to reduce the HTTP requests by combining small icons into one sprite image and displaying the section we want to show. This will save load time by avoiding loading multiple images.
    • Code optimization: Every developer should consider reusability while developing code, which will help in reducing the code size. Nowadays, most websites are developed using bundlers. Use bundle analyzers to analyze which piece of code is leading to a size increase. Bundlers are already doing the minification process while generating the build artifacts.
    • Removing render-blocking resources: Browsers build the DOM tree by parsing HTML. During this process, if it finds any scripts, then the creation of the DOM tree is paused and script execution starts. This will increase the page load time, and to make it work without blocking DOM creation, use async & defer in your scripts and load the script at the footer of the body. Keep in mind, though, that some scripts need to be loaded on the header like Google analytics script. Don’t use this suggested step blindly as it may cause some unusual behavior in your site.
    • Implementing a Content Distribution Network (CDN): It helps in loading the resources in a shorter time by figuring out the nearest server located from the user location and delivering the content from the nearest server.
    • Good hosting platform: Optimizing images and code alone can not always improve page speed. Budget-friendly servers serve millions of other websites, which will prevent your site from loading quickly. So, it is always recommended to use the premium hosting service or a dedicated server.
    • Implement caching: If resources are cached on a browser, then they are not fetched from the server; rather the browser picks them from the cache. It is important to have an expiration time while setting cache. And caching should also be done only on the resources that are not updated frequently.
    • Reducing redirects: In redirecting a page, an additional time is added for the HTTP request-response cycle. It is advisable not to use too many redirects.

    Some tools help us find the score of our website and provide information on what areas can be improved. These tools consider SEO, user experience, and accessibility point of view while calculating the score. These tools give results in some technical terms. Let us understand them in short:

    1. Time to first byte: It represents the moment when the web page starts loading. When we see a white screen for some time on page landing, that is TTFB at work.

    2. First contentful paint: It represents when the user sees something on the web page.

    3. First meaningful paint: It tells when the user understands the content, like text/images on the web page.

    4. First CPU idle: It represents the moment when the site has loaded enough information for it to be able to handle the user’s first input.

    5. Largest contentful paint: It represents when everything above the page’s fold (without scrolling) is visible.

    6. Time to interactive: It represents the moment when the web page is fully interactive.

    7. Total blocking time: It is the total amount of time the webpage was blocked.

    8. Cumulative layout shift: It is measured as the time taken in shifting web elements while the page is being rendered.

    Below are some popular tools we can use for performance analysis:

    1. Page speed insights: This assessment tool provides the score and opportunities to improve.

    2. Web page test: This monitoring tool lets you analyze each resource’s loading time.

    3. Gtmetrix: This is also an assessment tool like Lighthouse that gives some more information, and we can set test location as well.

    Conclusion:

    We have seen what SEO is, how it works, and how we can improve it by going through sitemap, meta tags, heading tags, robots.txt, breadcrumb, user experience, and finally the page load speed. For a business-to-consumer application, SEO is highly important. It lets you drive more traffic to your website. Hopefully, this basic guide will help you improve SEO for your existing and future websites.

    Related Articles

    1. Eliminate Render-blocking Resources using React and Webpack

    2. Building High-performance Apps: A Checklist To Get It Right

    3. Building a Progressive Web Application in React [With Live Code Examples]

  • Elasticsearch 101: Fundamentals & Core Components

    Elasticsearch is currently the most popular way to implement free text search and analytics in applications. It is highly scalable and can easily manage petabytes of data. It supports a variety of use cases like allowing users to easily search through any portal, collect and analyze log data, build business intelligence dashboards to quickly analyze and visualize data.  

    This blog acts as an introduction to Elasticsearch and covers the basic concepts of clusters, nodes, index, document and shards.

    What is Elasticsearch?

    Elasticsearch (ES) is a combination of open-source, distributed, highly scalable data store, and Lucene – a search engine that supports extremely fast full-text search. It is a beautifully crafted software, which hides the internal complexities and provides full-text search capabilities with simple REST APIs. Elasticsearch is written in Java with Apache Lucene at its core. It should be clear that Elasticsearch is not like a traditional RDBMS. It is not suitable for your transactional database needs, and hence, in my opinion, it should not be your primary data store. It is a common practice to use a relational database as the primary data store and inject only required data into Elasticsearch.

    Elasticsearch is meant for fast text search. There are several functionalities, which make it different from RDBMS. Unlike RDBMS, Elasticsearch stores data in the form of a JSON document, which is denormalized and doesn’t support transactions, referential integrity, joins, and subqueries.

    Elasticsearch works with structured, semi-structured, and unstructured data as well. In the next section, let’s walk through the various components in Elasticsearch.

    Elasticsearch Components

    Cluster

    One or more servers collectively providing indexing and search capabilities form an Elasticsearch cluster. The cluster size can vary from a single node to thousands of nodes, depending on the use cases.

    Node

    Node is a single physical or virtual machine that holds full or part of your data and provides computing power for indexing and searching your data. Every node is identified with a unique name. If the node identifier is not specified, a random UUID is assigned as a node identifier at the startup. Every node configuration has the property `cluster.name`. The cluster will be formed automatically with all the nodes having the same `cluster.name` at startup.

    A node has to accomplish several duties such as:

    • storing the data
    • performing operations on data (indexing, searching, aggregation, etc.)
    • maintaining the health of the cluster

    Each node in a cluster can do all these operations. Elasticsearch provides the capability to split responsibilities across different nodes. This makes it easy to scale, optimize, and maintain the cluster. Based on the responsibilities, the following are the different types of nodes that are supported:

    Data Node

    Data node is the node that has storage and computation capability. Data node stores the part of data in the form of shards (explained in the later section). Data nodes also participate in the CRUD, search, and aggregate operations. These operations are resource-intensive, and hence, it is a good practice to have dedicated data nodes without having the additional load of cluster administration. By default, every node of the cluster is a data node.

    Master Node

    Master nodes are reserved to perform administrative tasks. Master nodes track the availability/failure of the data nodes. The master nodes are responsible for creating and deleting the indices (explained in the later section).

    This makes the master node a critical part of the Elasticsearch cluster. It has to be stable and healthy. A single master node for a cluster is certainly a single point of failure. Elasticsearch provides the capability to have multiple master-eligible nodes. All the master eligible nodes participate in an election to elect a master node. It is recommended to have a minimum of three nodes in the cluster to avoid a split-brain situation. By default, all the nodes are both data nodes as well as master nodes. However, some nodes can be master-eligible nodes only through explicit configuration.

    Coordinating-Only Node

    Any node, which is not a master node or a data node, is a coordinating node. Coordinating nodes act as smart load balancers. Coordinating nodes are exposed to end-user requests. It appropriately redirects the requests between data nodes and master nodes.

    To take an example, a user’s search request is sent to different data nodes. Each data node searches locally and sends the result back to the coordinating node. Coordinating node aggregates and returns the result to the user.

    There are a few concepts that are core to Elasticsearch. Understanding these basic concepts will tremendously ease the learning process.

    Index

    Index is a container to store data similar to a database in the relational databases. An index contains a collection of documents that have similar characteristics or are logically related. If we take an example of an e-commerce website, there will be one index for products, one for customers, and so on. Indices are identified by the lowercase name. The index name is required to perform the add, update, and delete operations on the documents.

    Type

    Type is a logical grouping of the documents within the index. In the previous example of product index, we can further group documents into types, like electronics, fashion, furniture, etc. Types are defined on the basis of documents having similar properties in it. It isn’t easy to decide when to use the type over the index. Indices have more overheads, so sometimes, it is better to use different types in the same index for better performance. There are a couple of restrictions to use types as well. For example, two fields having the same name in different types of documents should be of the same datatype (string, date, etc.).

    Document

    Document is the piece indexed by Elasticsearch. A document is represented in the JSON format. We can add as many documents as we want into an index. The following snippet shows how to create a document of type mobile in the index store. We will cover more about the individual field of the document in the Mapping Type section.

    HTTP POST <hostname:port>/store/mobile/
    {    
    "name": "Motorola G5",    
    "model": "XT3300",    
    "release_date": "2016-01-01",    
    "features": "16 GB ROM | Expandable Upto 128 GB | 5.2 inch Full HD Display | 12MP Rear Camera | 5MP Front Camera | 3000 mAh Battery | Snapdragon 625 Processor",    
    "ram_gb": "3",    
    "screen_size_inches": "5.2"
    }

    Mapping Types

    To create different types in an index, we need mapping types (or simply mapping) to be specified during index creation. Mappings can be defined as a list of directives given to Elasticseach about how the data is supposed to be stored and retrieved. It is important to provide mapping information at the time of index creation based on how we want to retrieve our data later. In the context of relational databases, think of mappings as a table schema.

    Mapping provides information on how to treat each JSON field. For example, the field can be of type date, geolocation, or person name. Mappings also allow specifying which fields will participate in the full-text search, and specify the analyzers used to transform and decorate data before storing into an index. If no mapping is provided, Elasticsearch tries to identify the schema itself, known as Dynamic Mapping. 

    Each mapping type has Meta Fields and Properties. The snippet below shows the mapping of the type mobile.

    {    
    "mappings": {        
      "mobile": {            
        "properties": {                
          "name": {                    
            "type": "keyword"                
          },                
            "model": {                    
              "type": "keyword"                
           },               
              "release_date": {                    
                "type": "date"                
           },                
                "features": {                    
                  "type": "text"               
             },                
                "ram_gb": {                    
                  "type": "short"                
              },                
                  "screen_size_inches": {                    
                    "type": "float"                
              }            
            }        
          }    
       }
    }

    Meta Fields

    As the name indicates, meta fields stores additional information about the document. Meta fields are meant for mostly internal usage, and it is unlikely that the end-user has to deal with meta fields. Meta field names starts with an underscore. There are around ten meta fields in total. We will talk about some of them here:

    _index

    It stores the name of the index document it belongs to. This is used internally to store/search the document within an index.

    _type

    It stores the type of the document. To get better performance, it is often included in search queries.

    _id

    This is the unique id of the document. It is used to access specific document directly over the HTTP GET API.

    _source

    This holds the original JSON document before applying any analyzers/transformations. It is important to note that Elasticsearch can query on fields that are indexed (provided mapping for). The _source field is not indexed, and hence, can’t be queried on but it can be included in the final search result.

    Fields Or Properties

    List of fields specifies which all JSON fields in the document should be included in a particular type. In the e-commerce website example, mobile can be a type. It will have fields, like operating_system, camera_specification, ram_size, etc.

    Fields also carry the data type information with them. This directs Elasticsearch to treat the specific fields in a particular way of storing/searching data. Data types are similar to what we see in any other programming language. We will talk about a few of them here.

    Simple Data Types

    Text

    This data type is used to store full-text like product description. These fields participate in full-text search. These types of fields are analyzed while storing, which enables to searching them by the individual word in it. Such fields are not used in sorting and aggregation queries.

    Keywords

    This type is also used to store text data, but unlike Text, it is not analyzed and stored. This is suitable to store information like a user’s mobile number, city, age, etc. These fields are used in filter, aggregation, and sorting queries. For e.g., list all users from a particular city and filter them by age.

    Numeric

    Elasticsearch supports a wide range of numeric type: long, integer, short, byte, double, float.

    There are a few more data types to support date, boolean (true/false, on/off, 1/0), IP (to store IP addresses).

    Special Data Types

    Geo Point

    This data type is used to store geographical location. It accepts latitude and longitude pair. For example, this data type can be used to arrange the user’s photo library by their geographical location or graphically display the locations trending on social media news.

    Geo Shape

    It allows storing arbitrary geometric shapes like rectangle, polygon, etc.

    Completion Suggester

    This data type is used to provide auto-completion feature over a specific field. As the user types certain text, the completion suggester can guide the user to reach particular results.

    Complex Data Type

    Object

    If you know JSON well, this concept won’t be new for you. Elasticsearch also allows storing nested JSON object structure as a document.

    Nested

    The Object data type is not that useful due to its underlying data representation in the Lucene index. Lucene index does not support inner JSON object. ES flattens the original JSON to make it compatible with storing in Lucene index. Thus, fields of the multiple inner objects get merged into one leading object to wrong search results. Most of the time, you may use Nested data type over Object.

    Shards

    Shards help with enabling Elasticsearch to become horizontally scalable. An index can store millions of documents and occupy terabytes of data. This can cause problems with performance, scalability, and maintenance. Let’s see how Shards help achieve scalability.

    Indices are divided into multiple units called Shards (refer the diagram below). Shard is a full-featured subset of an index. Shards of the same index now can reside on the same or different nodes of the cluster. Shard decides the degree of parallelism for search and indexing operations. Shards allow the cluster to grow horizontally. The number of shards per index can be specified at the time of index creation. By default, the number of shards created is 5. Although, once the index is created the number of shards can not be changed. To change the number of shards, reindex the data.

    Replication

    Hardware can fail at any time. To ensure fault tolerance and high availability, ES provides a feature to replicate the data. Shards can be replicated. A shard which is being copied is called as Primary Shard. The copy of the primary shard is called a replica shard or simply replica. Like the number of shards, the number of replication can also be specified at the time of index creation. Replication served two purposes:

    • High Availability – Replica is never been created on the same node where the primary shard is present. This ensures that data can be available through the replica shard even if the complete node is failed.
    • Performance – Replica can also contribute to search capabilities. The search queries will be executed parallelly across the replicas.

    To summarize, to achieve high availability and performance, the index is split into multiple shards. In a production environment, multiple replicas are created for every index. In the replicated index, only primary shards can serve write requests. However, all the shards (the primary shard as well as replicated shards) can serve read/query requests. The replication factor is defined at the time of index creation and can be changed later if required. Choosing the number of shards is an important exercise. As once defined, it can’t be changed. In critical scenarios, changing the number of shards requires creating a new index with required shards and reindexing old data.

    Summary

    In this blog, we have covered the basic but important aspects of Elasticsearch. In the following posts, I will talk about how indexing & searching works in detail. Stay tuned!

  • Improving Elasticsearch Indexing in the Rails Model using Searchkick

    Searching has become a prominent feature of any web application, and a relevant search feature requires a robust search engine. The search engine should be capable of performing a full-text search, auto completion, providing suggestions, spelling corrections, fuzzy search, and analytics. 

    Elasticsearch, a distributed, fast, and scalable search and analytic engine, takes care of all these basic search requirements.

    The focus of this post is using a few approaches with Elasticsearch in our Rails application to reduce time latency for web requests. Let’s review one of the best ways to improve the Elasticsearch indexing in Rails models by moving them to background jobs.

    In a Rails application, Elasticsearch can be integrated with any of the following popular gems:

    We can continue with any of these gems mentioned above. But for this post, we will be moving forward with the Searchkick gem, which is a much more Rails-friendly gem.

    The default Searchkick gem option uses the object callbacks to sync the data in the respective Elasticsearch index. Being in the callbacks, it costs the request, which has the creation and updation of a resource to take additional time to process the web request.

    The below image shows logs from a Rails application, captured for an update request of a user record. We have added a print statement before Elasticsearch tries to sync in the Rails model so that it helps identify from the logs where the indexing has started. These logs show that the last two queries were executed for indexing the data in the Elasticsearch index.

    Since the Elasticsearch sync is happening while updating a user record, we can conclude that the user update request will take additional time to cover up the Elasticsearch sync.

    Below is the request flow diagram:

    From the request flow diagram, we can say that the end-user must wait for step 3 and 4 to be completed. Step 3 is to fetch the children object details from the database.

    To tackle the problem, we can move the Elasticsearch indexing to the background jobs. Usually, for Rails apps in production, there are separate app servers, database servers, background job processing servers, and Elasticsearch servers (in this scenario).

    This is how the request flow looks when we move Elasticsearch indexing:

    Let’s get to coding!

    For demo purposes, we will have a Rails app with models: `User` and `Blogpost`. The stack used here:

    • Rails 5.2
    • Elasticsearch 6.6.7
    • MySQL 5.6
    • Searchkick (gem for writing Elasticsearch queries in Ruby)
    • Sidekiq (gem for background processing)

    This approach does not require  any specific version of Rails, Elasticsearch or Mysql. Moreover, this approach is database agnostic. You can go through the code from this Github repo for reference.

    Let’s take a look at the user model with Elasticsearch index.

    # == Schema Information
    #
    # Table name: users
    #
    #  id            :bigint           not null, primary key
    #  name          :string(255)
    #  email         :string(255)
    #  mobile_number :string(255)
    #  created_at    :datetime         not null
    #  updated_at    :datetime         not null
    #
    class User < ApplicationRecord
     searchkick
    
     has_many :blogposts
     def search_data
       {
         name: name,
         email: email,
         total_blogposts: blogposts.count,
         last_published_blogpost_date: last_published_blogpost_date
       }
     end
     ...
    end

    Anytime a user object is inserted, updated, or deleted, Searchkick reindexes the data in the Elasticsearch user index synchronously.

    Searchkick already provides four ways to sync Elasticsearch index:

    • Inline (default)
    • Asynchronous
    • Queuing
    • Manual

    For more detailed information on this, refer to this page. In this post, we are looking in the manual approach to reindex the model data.

    To manually reindex, the user model will look like:

    class User < ApplicationRecord
     searchkick callbacks: false
    
     def search_data
       ...
     end
    end

    Now, we will need to define a callback that can sync the data to the Elasticsearch index. Typically, this callback must be written in all the models that have the Elasticsearch index. Instead, we can write a common concern and include it to required models.

    Here is what our concern will look like:

    module ElasticsearchIndexer
     extend ActiveSupport::Concern
    
     included do
       after_commit :reindex_model
       def reindex_model
         ElasticsearchWorker.perform_async(self.id, self.class.name)
       end
     end
    end

    In the above active support concern, we have called the Sidekiq worker named ElasticsearchWorker. After adding this concern, don’t forget to include the Elasticsearch indexer concern in the user model, like so:

    include ElasticsearchIndexer

    Now, let’s see the Elasticsearch Sidekiq worker:

    class ElasticsearchWorker
     include Sidekiq::Worker
     def perform(id, klass)
       begin
         klass.constantize.find(id.to_s).reindex
       rescue => e
         # Handle exception
       end
     end
    end

    That’s it, we’ve done it. Cool, huh? Now, whenever a user creates, updates, or deletes web request, a background job will be created. The background job can be seen in the Sidekiq web UI at localhost:3000/sidekiq

    Now, there is little problem in the Elasticsearch indexer concern. To reproduce this, go to your user edit page, click save, and look at localhost:3000/sidekiq—a job will be queued.

    We can handle this case by tracking the dirty attributes. 

    module ElasticsearchIndexer
     extend ActiveSupport::Concern
     included do
       after_commit :reindex_model
       def reindex_model
         return if self.previous_changes.keys.blank?
         ElasticsearchWorker.perform_async(self.id, klass)
       end
     end
    end

    Furthermore, there are few more areas of improvement. Suppose you are trying to update the field of user model that is not part of the Elasticsearch index, the Elasticsearch worker Sidekiq job will still get created and reindex the associated model object. This can be modified to create the Elasticsearch indexing worker Sidekiq job only if the Elasticsearch index fields are updated.

    module ElasticsearchIndexer
     extend ActiveSupport::Concern
     included do
       after_commit :reindex_model
       def reindex_model
         updated_fields = self.previous_changes.keys
        
         # For getting ES Index fields you can also maintain constant
         # on model level or get from the search_data method.
         es_index_fields = self.search_data.stringify_keys.keys
         return if (updated_fields & es_index_fields).blank?
         ElasticsearchWorker.perform_async(self.id, klass)
       end
     end
    end

    Conclusion

    Moving the Elasticsearch indexing to background jobs is a great way to boost the performance of the web app by reducing the response time of any web request. Implementing this approach for every model would not be ideal. I would recommend this approach only if the Elasticsearch index data are not needed in real-time.

    Since the execution of background jobs depends on the number of jobs it must perform, it might take time to reflect the changes in the Elasticsearch index if there are lots of jobs queued up. To solve this problem to some extent, the Elasticsearch indexing jobs can be added in a queue with high priority. Also, make sure you have a different app server and background job processing server. This approach works best if the app server is different than the background job processing server.