Category: Type

  • Web Scraping: Introduction, Best Practices & Caveats

    Web scraping is a process to crawl various websites and extract the required data using spiders. This data is processed in a data pipeline and stored in a structured format. Today, web scraping is widely used and has many use cases:

    • Using web scraping, Marketing & Sales companies can fetch lead-related information.
    • Web scraping is useful for Real Estate businesses to get the data of new projects, resale properties, etc.
    • Price comparison portals, like Trivago, extensively use web scraping to get the information of product and price from various e-commerce sites.

    The process of web scraping usually involves spiders, which fetch the HTML documents from relevant websites, extract the needed content based on the business logic, and finally store it in a specific format. This blog is a primer to build highly scalable scrappers. We will cover the following items:

    1. Ways to scrape: We’ll see basic ways to scrape data using techniques and frameworks in Python with some code snippets.
    2. Scraping at scale: Scraping a single page is straightforward, but there are challenges in scraping millions of websites, including managing the spider code, collecting data, and maintaining a data warehouse. We’ll explore such challenges and their solutions to make scraping easy and accurate.
    3. Scraping Guidelines: Scraping data from websites without the owner’s permission can be deemed as malicious. Certain guidelines need to be followed to ensure our scrappers are not blacklisted. We’ll look at some of the best practices one should follow for crawling.

    So let’s start scraping. 

    Different Techniques for Scraping

    Here, we will discuss how to scrape a page and the different libraries available in Python.

    Note: Python is the most popular language for scraping.  

    1. Requests – HTTP Library in Python: To scrape the website or a page, first find out the content of the HTML page in an HTTP response object. The requests library from Python is pretty handy and easy to use. It uses urllib inside. I like ‘requests’ as it’s easy and the code becomes readable too.

    #Example showing how to use the requests library
    import requests
    r = requests.get("https://velotio.com") #Fetch HTML Page

    2. BeautifulSoup: Once you get the webpage, the next step is to extract the data. BeautifulSoup is a powerful Python library that helps you extract the data from the page. It’s easy to use and has a wide range of APIs that’ll help you extract the data. We use the requests library to fetch an HTML page and then use the BeautifulSoup to parse that page. In this example, we can easily fetch the page title and all links on the page. Check out the documentation for all the possible ways in which we can use BeautifulSoup.

    from bs4 import BeautifulSoup
    import requests
    r = requests.get("https://velotio.com") #Fetch HTML Page
    soup = BeautifulSoup(r.text, "html.parser") #Parse HTML Page
    print "Webpage Title:" + soup.title.string
    print "Fetch All Links:" soup.find_all('a')

    3. Python Scrapy Framework:

    Scrapy is a Python-based web scraping framework that allows you to create different kinds of spiders to fetch the source code of the target website. Scrapy starts crawling the web pages present on a certain website, and then you can write the extraction logic to get the required data. Scrapy is built on the top of Twisted, a Python-based asynchronous library that performs the requests in an async fashion to boost up the spider performance. Scrapy is faster than BeautifulSoup. Moreover, it is a framework to write scrapers as opposed to BeautifulSoup, which is just a library to parse HTML pages.

    Here is a simple example of how to use Scrapy. Install Scrapy via pip. Scrapy gives a shell after parsing a website:

    $ pip install scrapy #Install Scrapy"
    $ scrapy shell https://velotio.com
    In [1]: response.xpath("//a").extract() #Fetch all a hrefs

    Now, let’s write a custom spider to parse a website.

    $cat > myspider.py <import scrapy
    
    class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']
    
    def parse(self, response):
    for title in response.css('h2.entry-title'):
    yield {'title': title.css('a ::text').extract_first()}
    EOF
    scrapy runspider myspider.py

    That’s it. Your first custom spider is created. Now. let’s understand the code.

    • name: Name of the spider. In this case, it’s “blogspider”.
    • start_urls: A list of URLs where the spider will begin to crawl from.
    • parse(self, response): This function is called whenever the crawler successfully crawls a URL. The response object used earlier in the Scrapy shell is the same response object that is passed to the parse(..).

    When you run this, Scrapy will look for start URL and will give you all the divs of the h2.entry-title class and extract the associated text from it. Alternatively, you can write your extraction logic in a parse method or create a separate class for extraction and call its object from the parse method.

    You’ve seen how to extract simple items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient. Here is a tutorial for Scrapy and the additional documentation for LinkExtractor by which you can instruct Scrapy to extract links from a web page.

    4. Python lxml.html library:  This is another library from Python just like BeautifulSoup. Scrapy internally uses lxml. It comes with a list of APIs you can use for data extraction. Why will you use this when Scrapy itself can extract the data? Let’s say you want to iterate over the ‘div’ tag and perform some operation on each tag present under “div”, then you can use this library which will give you a list of ‘div’ tags. Now you can simply iterate over them using the iter() function and traverse each child tag inside the parent div tag. Such traversing operations are difficult in scraping. Here is the documentation for this library.

    Challenges while Scraping at Scale

    Let’s look at the challenges and solutions while scraping at large scale, i.e., scraping 100-200 websites regularly:

    1. Data warehousing: Data extraction at a large scale generates vast volumes of information. Fault-tolerant, scalability, security, and high availability are the must-have features for a data warehouse. If your data warehouse is not stable or accessible then operations, like search and filter over data would be an overhead. To achieve this, instead of maintaining own database or infrastructure, you can use Amazon Web Services (AWS). You can use RDS (Relational Database Service) for a structured database and DynamoDB for the non-relational database. AWS takes care of the backup of data. It automatically takes a snapshot of the database. It gives you database error logs as well. This blog explains how to set up infrastructure in the cloud for scraping.  

    2. Pattern Changes: Scraping heavily relies on user interface and its structure, i.e., CSS and Xpath. Now, if the target website gets some adjustments then our scraper may crash completely or it can give random data that we don’t want. This is a common scenario and that’s why it’s more difficult to maintain scrapers than writing it. To handle this case, we can write the test cases for the extraction logic and run them daily, either manually or from CI tools, like Jenkins to track if the target website has changed or not.

    3. Anti-scraping Technologies: Web scraping is a common thing these days, and every website host would want to prevent their data from being scraped. Anti-scraping technologies would help them in this. For example, if you are hitting a particular website from the same IP address on a regular interval then the target website can block your IP. Adding a captcha on a website also helps. There are methods by which we can bypass these anti-scraping methods. For e.g., we can use proxy servers to hide our original IP. There are several proxy services that keep on rotating the IP before each request. Also, it is easy to add support for proxy servers in the code, and in Python, the Scrapy framework does support it.

    4. JavaScript-based dynamic content:  Websites that heavily rely on JavaScript and Ajax to render dynamic content, makes data extraction difficult. Now, Scrapy and related frameworks/libraries will only work or extract what it finds in the HTML document. Ajax calls or JavaScript are executed at runtime so it can’t scrape that. This can be handled by rendering the web page in a headless browser such as Headless Chrome, which essentially allows running Chrome in a server environment. You can also use PhantomJS, which provides a headless Webkit-based environment.

    5. Honeypot traps: Some websites have honeypot traps on the webpages for the detection of web crawlers. They are hard to detect as most of the links are blended with background color or the display property of CSS is set to none. To achieve this requires large coding efforts on both the server and the crawler side, hence this method is not frequently used.

    6. Quality of data: Currently, AI and ML projects are in high demand and these projects need data at large scale. Data integrity is also important as one fault can cause serious problems in AI/ML algorithms. So, in scraping, it is very important to not just scrape the data, but verify its integrity as well. Now doing this in real-time is not possible always, so I would prefer to write test cases of the extraction logic to make sure whatever your spiders are extracting is correct and they are not scraping any bad data

    7. More Data, More Time:  This one is obvious. The larger a website is, the more data it contains, the longer it takes to scrape that site. This may be fine if your purpose for scanning the site isn’t time-sensitive, but that isn’t often the case. Stock prices don’t stay the same over hours. Sales listings, currency exchange rates, media trends, and market prices are just a few examples of time-sensitive data. What to do in this case then? Well, one solution could be to design your spiders carefully. If you’re using Scrapy like framework then apply proper LinkExtractor rules so that spider will not waste time on scraping unrelated URLs.

    You may use multithreading scraping packages available in Python, such as Frontera and Scrapy Redis. Frontera lets you send out only one request per domain at a time, but can hit multiple domains at once, making it great for parallel scraping. Scrapy Redis lets you send out multiple requests to one domain. The right combination of these can result in a very powerful web spider that can handle both the bulk and variation for large websites.

    8. Captchas: Captchas is a good way of keeping crawlers away from a website and it is used by many website hosts. So, in order to scrape the data from such websites, we need a mechanism to solve the captchas. There are packages, software that can solve the captcha and can act as a middleware between the target website and your spider. Also, you may use libraries like Pillow and Tesseract in Python to solve the simple image-based captchas.

    9. Maintaining Deployment: Normally, we don’t want to limit ourselves to scrape just a few websites. We need the maximum amount of data that are present on the Internet and that may introduce scraping of millions of websites. Now, you can imagine the size of the code and the deployment. We can’t run spiders at this scale from a single machine. What I prefer here is to dockerize the scrapers and take advantage of the latest technologies, like AWS ECS, Kubernetes to run our scraper containers. This helps us keeping our scrapers in high availability state and it’s easy to maintain. Also, we can schedule the scrapers to run at regular intervals.

    Scraping Guidelines/ Best Practices

    1. Respect the robots.txt file:  Robots.txt is a text file that webmasters create to instruct search engine robots on how to crawl and index pages on the website. This file generally contains instructions for crawlers. Now, before even planning the extraction logic, you should first check this file. Usually, you can find this at the website admin section. This file has all the rules set on how crawlers should interact with the website. For e.g., if a website has a link to download critical information then they probably don’t want to expose that to crawlers. Another important factor is the frequency interval for crawling, which means that crawlers can only hit the website at specified intervals. If someone has asked not to crawl their website then we better not do it. Because if they catch your crawlers, it can lead to some serious legal issues.

    2. Do not hit the servers too frequently:  As I mentioned above, some websites will have the frequency interval specified for crawlers. We better use it wisely because not every website is tested against the high load. If you are hitting at a constant interval then it creates huge traffic on the server-side, and it may crash or fail to serve other requests. This creates a high impact on user experience as they are more important than the bots. So, we should make the requests according to the specified interval in robots.txt or use a standard delay of 10 seconds. This also helps you not to get blocked by the target website.

    3. User Agent Rotation and Spoofing: Every request consists of a User-Agent string in the header. This string helps to identify the browser you are using, its version, and the platform. If we use the same User-Agent in every request then it’s easy for the target website to check that request is coming from a crawler. So, to make sure we do not face this, try to rotate the User and the Agent between the requests. You can get examples of genuine User-Agent strings on the Internet very easily, try them out. If you’re using Scrapy, you can set USER_AGENT property in settings.py.

    4. Disguise your requests by rotating IPs and Proxy Services: We’ve discussed this in the challenges above. It’s always better to use rotating IPs and proxy service so that your spider won’t get blocked.

    5. Do not follow the same crawling pattern: Now, as you know many websites use anti-scraping technologies, so it’s easy for them to detect your spider if it’s crawling in the same pattern. Normally, we, as a human, would not follow a pattern on a particular website. So, to have your spiders run smoothly, we can introduce actions like mouse movements, clicking a random link, etc, which gives the impression of your spider as a human.

    6. Scrape during off-peak hours: Off-peak hours are suitable for bots/crawlers as the traffic on the website is considerably less. These hours can be identified by the geolocation from where the site’s traffic originates. This also helps to improve the crawling rate and avoid the extra load from spider requests. Thus, it is advisable to schedule the crawlers to run in the off-peak hours.

    7. Use the scraped data responsibly: We should always take the responsibility of the scraped data. It is not acceptable if someone is scraping the data and then republish it somewhere else. This can be considered as breaking the copyright laws and may lead to legal issues. So, it is advisable to check the target website’s Terms of Service page before scraping.

    8. Use Canonical URLs: When we scrape, we tend to scrape duplicate URLs, and hence the duplicate data, which is the last thing we want to do. It may happen in a single website where we get multiple URLs having the same data. In this situation, duplicate URLs will have a canonical URL, which points to the parent or the original URL. By this, we make sure, we don’t scrape duplicate contents. In frameworks like Scrapy, duplicate URLs are handled by default.

    9. Be transparent: Don’t misrepresent your purpose or use deceptive methods to gain access. If you have a login and a password that identifies you to gain access to a source, use it.  Don’t hide who you are. If possible, share your credentials.

    Conclusion

    We’ve seen the basics of scraping, frameworks, how to crawl, and the best practices of scraping. To conclude:

    • Follow target URLs rules while scraping. Don’t make them block your spider.
    • Maintenance of data and spiders at scale is difficult. Use Docker/ Kubernetes and public cloud providers, like AWS to easily scale your web-scraping backend.
    • Always respect the rules of the websites you plan to crawl. If APIs are available, always use them first.
  • Building a Collaborative Editor Using Quill and Yjs

    “Hope this email finds you well” is how 2020-2021 has been in a nutshell. Since we’ve all been working remotely since last year, actively collaborating with teammates became one notch harder, from activities like brainstorming a topic on a whiteboard to building documentation.

    Having tools powered by collaborative systems had become a necessity, and to explore the same following the principle of build fast fail fast, I started building up a collaborative editor using existing available, open-source tools, which can eventually be extended for needs across different projects.

    Conflicts, as they say, are inevitable, when multiple users are working on the same document constantly modifying it, especially if it’s the same block of content. Ultimately, the end-user experience is defined by how such conflicts are resolved.

    There are various conflict resolution mechanisms, but two of the most commonly discussed ones are Operational Transformation (OT) and Conflict-Free Replicated Data Type (CRDT). So, let’s briefly talk about those first.

    Operational Transformation

    The order of operations matter in OT, as each user will have their own local copy of the document, and since mutations are atomic, such as insert V at index 4 and delete X at index 2. If the order of these operations is changed, the end result will be different. And that’s why all the operations are synchronized through a central server. The central server can then alter the indices and operations and then forward to the clients. For example, in the below image, User2 makes a delete(0) operation, but as the OT server realizes that User1 has made an insert operation, the User2’s operation needs to be changed as delete(1) before applying to User1.

    OT with a central server is typically easier to implement. Plain text operations with OT in its basic form only has three defined operations: insert, delete, and apply.

    Source: Conclave

    “Fully distributed OT and adding rich text operations are very hard, and that’s why there’s a million papers.”

    CRDT

    Instead of performing operations directly on characters like in OT, CRDT uses a complex data structure to which it can then add/update/remove properties to signify transformation, enabling scope for commutativity and idempotency. CRDTs guarantee eventual consistency.

    There are different algorithms, but in general, CRDT has two requirements: globally unique characters and globally ordered characters. Basically, this involves a global reference for each object, instead of positional indices, in which the ordering is based on the neighboring objects. Fractional indices can be used to assign index to an object.

    Source: Conclave

    As all the objects have their own unique reference, delete operation becomes idempotent. And giving fractional indices is one way to give unique references while insertion and updation.

    There are two types of CRDT, one is state-based, where the whole state (or delta) is shared between the instances and merged continuously. The other is operational based, where only individual operations are sent between replicas. If you want to dive deep into CRDT, here’s a nice resource.

    For our purposes, we choose CRDT since it can also support peer-to-peer networks. If you directly want to jump to the code, you can visit the repo here.

    Tools used for this project:

    As our goal was for a quick implementation, we targeted off-the-shelf tools for editor and backend to manage collaborative operations.

    • Quill.js is an API-driven WYSIWYG rich text editor built for compatibility and extensibility. We choose Quill as our editor because of the ease to plug it into your application and availability of extensions.
    • Yjs is a framework that provides shared editing capabilities by exposing its different shared data types (Array, Map, Text, etc) that are synced automatically. It’s also network agnostic, so the changes are synced when a client is online. We used it because it’s a CRDT implementation, and surprisingly had readily available bindings for quill.js.

    Prerequisites:

    To keep it simple, we’ll set up a client and server both in the same code base. Initialize a project with npm init and install the below dependencies:

    npm i quill quill-cursors webpack webpack-cli webpack-dev-server y-quill y-websocket yjs

    • Quill: Quill is the WYSIWYG rich text editor we will use as our editor.
    • quill-cursors is an extension that helps us to display cursors of other connected clients to the same editor room.
    • Webpack, webpack-cli, and webpack-dev-server are developer utilities, webpack being the bundler that creates a deployable bundle for your application.
    • The Y-quill module provides bindings between Yjs and QuillJS with use of the SharedType y.Text. For more information, you can check out the module’s source on Github.
    • Y-websocket provides a WebsocketProvider to communicate with Yjs server in a client-server manner to exchange awareness information and data.
    • Yjs, this is the CRDT framework which orchestrates conflict resolution between multiple clients. 

    Code to use

    const path = require('path');
    
    module.exports = {
      mode: 'development',
      devtool: 'source-map',
      entry: {
        index: './index.js'
      },
      output: {
        globalObject: 'self',
        path: path.resolve(__dirname, './dist/'),
        filename: '[name].bundle.js',
        publicPath: '/quill/dist'
      },
      devServer: {
        contentBase: path.join(__dirname),
        compress: true,
        publicPath: '/dist/'
      }
    }

    This is a basic webpack config where we have provided which file is the starting point of our frontend project, i.e., the index.js file. Webpack then uses that file to build the internal dependency graph of your project. The output property is to define where and how the generated bundles should be saved. And the devServer config defines necessary parameters for the local dev server, which runs when you execute “npm start”.

    We’ll first create an index.html file to define the basic skeleton:

    <!DOCTYPE html>
    <html>
      <head>
        <title>Yjs Quill Example</title>
        <script src="./dist/index.bundle.js" async defer></script>
        <link rel=stylesheet href="//cdn.quilljs.com/1.3.6/quill.snow.css" async defer>
      </head>
      <body>
        <button type="button" id="connect-btn">Disconnect</button>
        <div id="editor" style="height: 500px;"></div>
      </body>
    </html>

    The index.html has a pretty basic structure. In <head>, we’ve provided the path of the bundled js file that will be created by webpack, and the css theme for the quill editor. And for the <body> part, we’ve just created a button to connect/disconnect from the backend and a placeholder div where the quill editor will be plugged.

    • Here, we’ve just made the imports, registered quill-cursors extension, and added an event listener for window load:
    import Quill from "quill";
    import * as Y from 'yjs';
    import { QuillBinding } from 'y-quill';
    import { WebsocketProvider } from 'y-websocket';
    import QuillCursors from "quill-cursors";
    
    // Register QuillCursors module to add the ability to show multiple cursors on the editor.
    Quill.register('modules/cursors', QuillCursors);
    
    window.addEventListener('load', () => {
      // We'll add more blocks as we continue
    });

    • Let’s initialize the Yjs document, socket provider, and load the document:
    window.addEventListener('load', () => {
      const ydoc = new Y.Doc();
      const provider = new WebsocketProvider('ws://localhost:3312', 'velotio-demo', ydoc);
      const type = ydoc.getText('Velotio-Blog');
    });

    • We’ll now initialize and plug the Quill editor with its bindings:
    window.addEventListener('load', () => {
      // ### ABOVE CODE HERE ###
    
      const editorContainer = document.getElementById('editor');
      const toolbarOptions = [
        ['bold', 'italic', 'underline', 'strike'],  // toggled buttons
        ['blockquote', 'code-block'],
        [{ 'header': 1 }, { 'header': 2 }],               // custom button values
        [{ 'list': 'ordered' }, { 'list': 'bullet' }],
        [{ 'script': 'sub' }, { 'script': 'super' }],      // superscript/subscript
        [{ 'indent': '-1' }, { 'indent': '+1' }],          // outdent/indent
        [{ 'direction': 'rtl' }],                         // text direction
        // array for drop-downs, empty array = defaults
        [{ 'size': [] }],
        [{ 'header': [1, 2, 3, 4, 5, 6, false] }],
        [{ 'color': [] }, { 'background': [] }],          // dropdown with defaults from theme
        [{ 'font': [] }],
        [{ 'align': [] }],
        ['image', 'video'],
        ['clean']                                         // remove formatting button
      ];
    
      const editor = new Quill(editorContainer, {
        modules: {
          cursors: true,
          toolbar: toolbarOptions,
          history: {
            userOnly: true  // only user changes will be undone or redone.
          }
        },
        placeholder: "collab-edit-test",
        theme: "snow"
      });
    
      const binding = new QuillBinding(type, editor, provider.awareness);
    });

    • Finally, let’s implement the Connect/Disconnect button and complete the callback:
    window.addEventListener('load', () => {
      // ### ABOVE CODE HERE ###
    
      const connectBtn = document.getElementById('connect-btn');
      connectBtn.addEventListener('click', () => {
    	if (provider.shouldConnect) {
      	  provider.disconnect();
      	  connectBtn.textContent = 'Connect'
    	} else {
      	  provider.connect();
      	  connectBtn.textContent = 'Disconnect'
    	}
      });
    
      window.example = { provider, ydoc, type, binding, Y }
    });

    Steps to run:

    • Server:

    For simplicity, we’ll directly use the y-websocket-server out of the box.

    NOTE: You can either let it run and open a new terminal for the next commands, or let it run in the background using `&` at the end of the command.

    • Client:

    Start the client by npm start. On successful compilation, it should open on your default browser, or you can just go to http://localhost:8080.

    Show me the repo

    You can find the repository here.

    Conclusion:

    Conflict resolution approaches are not relatively new, but with the trend of remote culture, it is important to have good collaborative systems in place to enhance productivity.

    Although this example was just on rich text editing capabilities, we can extend existing resources to build more features and structures like tabular data, graphs, charts, etc. Yjs shared types can be used to define your own data format based on how your custom editor represents data internally.

  • Acquiring Temporary AWS Credentials with Browser Navigated Authentication

    In one of my previous blog posts (Hacking your way around AWS IAM Roles), we demonstrated how users can access AWS resources without having to store AWS credentials on disk. This was achieved by setting up an OpenVPN server and client-side route that gets automatically pushed when the user is connected to the VPN. To this date, I really find this as a complaint-friendly solution without forcing users to do any manual configuration on their system. It also makes sense to have access to AWS resources as long as they are connected on VPN. One of the downsides to this method is maintaining an OpenVPN server, keeping it secure and having it running in a highly available (HA) state. If the OpenVPN server is compromised, our credentials are at stake. Secondly, all the users connected on VPN get the same level of access.

    In this blog post, we present to you a CLI utility written in Rust that writes temporary AWS credentials to a user profile (~/.aws/credentials file) using web browser navigated Google authentication. This utility is inspired by gimme-aws-creds (written in python for Okta authenticated AWS farm) and heroku cli (written in nodejs and utilizes oclif framework). We will refer to our utility as aws-authcreds throughout this post.

    “If you have an apple and I have an apple and we exchange these apples then you and I will still each have one apple. But if you have an idea and I have an idea and we exchange these ideas, then each of us will have two ideas.”

    – George Bernard Shaw

    What does this CLI utility (auth-awscreds) do?

    When the user fires a command (auth-awscreds) on the terminal, our program reads utility configuration from file .auth-awscreds located in the user home directory. If this file is not present, the utility prompts for setting the configuration for the first time. Utility configuration file is INI format. Program then opens a default web browser and navigates to the URL read from the configuration file. At this point, the utility waits for the browser URL to navigate and authorize. Web UI then navigates to Google Authentication. If authentication is successful, a callback is shared with CLI utility along with temporary AWS credentials, which is then written to ~/.aws/credentials file.

    Block Diagram

    Tech Stack Used

    As stated earlier, we wrote this utility in Rust. One of the reasons for choosing Rust is because we wanted a statically typed binary (ELF) file (executed independent of interpreter), which ships as it is when compiled. Unlike programs written in Python or Node.js, one needs a language interpreter and has supporting libraries installed for your program. The golang would have also suffice our purpose, but I prefer Rust over golang.

    Software Stack:

    • Rust (for CLI utility)
    • Actix Web – HTTP Server
    • Node.js, Express, ReactJS, serverless-http, aws-sdk, AWS Amplify, axios
    • Terraform and serverless framework

    Infrastructure Stack:

    • AWS Cognito (User Pool and Federated Identities)
    • AWS API Gateway (HTTP API)
    • AWS Lambda
    • AWS S3 Bucket (React App)
    • AWS CloudFront (For Serving React App)
    • AWS ACM (SSL Certificate)

    Recipe

    Architecture Diagram

    CLI Utility: auth-awscreds

    Our goal is, when the auth-awscreds command is fired, we first check if the user’s home directory ~/.aws/credentials file exists. If not, we create a ~/.aws directory. This is the default AWS credentials directory, where usually AWS SDK looks for credentials (unless exclusively specified by env var AWS_SHARED_CREDENTIALS_FILE). The next step would be to check if a ~/.auth-awscredds file exists. If this file doesn’t exist, we create a prompt user with two inputs: 

    1. AWS credentials profile name (used by SDK, default is preferred) 

    2. Application domain URL (Our backend app domain is used for authentication)

    let app_profile_file = format!("{}/.auth-awscreds",&user_home_dir);
     
       let config_exist : bool = Path::new(&app_profile_file).exists();
     
       let mut profile_name = String::new();
       let mut app_domain = String::new();
     
       if !config_exist {
           //ask the series of questions
           print!("Which profile to write AWS Credentials [default] : ");
           io::stdout().flush().unwrap();
           io::stdin()
               .read_line(&mut profile_name)
               .expect("Failed to read line");
     
           print!("App Domain : ");
           io::stdout().flush().unwrap();
          
           io::stdin()
               .read_line(&mut app_domain)
               .expect("Failed to read line");
          
           profile_name=String::from(profile_name.trim());
           app_domain=String::from(app_domain.trim());
          
           config_profile(&profile_name,&app_domain);
          
       }
       else {
           (profile_name,app_domain) = read_profile();
       }

    These two properties are written in ~/.auth-awscreds under the default section. Followed by this, our utility generates RSA asymmetric 1024 bit public and private key. Both the keypair are converted to base64.

    pub fn genkeypairs() -> (String,String) {
       let rsa = Rsa::generate(1024).unwrap();
     
       let private_key: Vec<u8> = rsa.private_key_to_pem_passphrase(Cipher::aes_128_cbc(),"Sagar Barai".as_bytes()).unwrap();
       let public_key: Vec<u8> = rsa.public_key_to_pem().unwrap();
     
       (base64::encode(private_key) , base64::encode(public_key))
    }

    We then launch a browser window and navigate to the specified app domain URL. At this stage, our utility starts a temporary web server with the help of the Actix Web framework and listens on 63442 port of localhost.

    println!("Opening web ui for authentication...!");
       open::that(&app_domain).unwrap();
     
       HttpServer::new(move || {
           //let stopper = tx.clone();
           let cors = Cors::permissive();
           App::new()
           .wrap(cors)
           //.app_data(stopper)
           .app_data(crypto_data.clone())
           .service(get_public_key)
           .service(set_aws_creds)
       })
       .bind(("127.0.0.1",63442))?
       .run()
       .await

    Localhost web server has two end points.

    1. GET Endpoint (/publickey): This endpoint is called by our React app after authentication and returns the public key created during the initialization process. Since the web server hosted by the Rust application is insecure (non ssl),  when actual AWS credentials are received, they should be posted as an encrypted string with the help of this public key.

    #[get("/publickey")]
    pub async fn get_public_key(data: web::Data<AppData>) -> impl Responder {
       let public_key = &data.public_key;
      
       web::Json(HTTPResponseData{
           status: 200,
           msg: String::from("Ok"),
           success: true,
           data: String::from(public_key)
       })
    }

    2. POST Endpoint (/setcreds): This endpoint is called when the react app has successfully retrieved credentials from API Gateway. Credentials are decrypted by private key and then written to ~/.aws/credentials file defined by profile name in utility configuration. 

    let encrypted_data = payload["data"].as_array().unwrap();
       let username = payload["username"].as_str().unwrap();
     
       let mut decypted_payload = vec![];
     
       for str in encrypted_data.iter() {
           //println!("{}",str.to_string());
           let s = str.as_str().unwrap();
           let decrypted = decrypt_data(&private_key, &s.to_string());
           decypted_payload.extend_from_slice(&decrypted);
       }
     
       let credentials : serde_json::Value = serde_json::from_str(&String::from_utf8(decypted_payload).unwrap()).unwrap();
     
       let aws_creds = AWSCreds{
           profile_name: String::from(profile_name),
           aws_access_key_id: String::from(credentials["AccessKeyId"].as_str().unwrap()),
           aws_secret_access_key: String::from(credentials["SecretAccessKey"].as_str().unwrap()),
           aws_session_token: String::from(credentials["SessionToken"].as_str().unwrap())
       };
     
       println!("Authenticated as {}",username);
       println!("Updating AWS Credentials File...!");
     
       configcreds(&aws_creds);

    One of the interesting parts of this code is the decryption process, which iterates through an array of strings and is joined by method decypted_payload.extend_from_slice(&decrypted);. RSA 1024 is 128-byte encryption, and we used OAEP padding, which uses 42 bytes for padding and the rest for encrypted data. Thus, 86 bytes can be encrypted at max. So, when credentials are received they are an array of 128 bytes long base64 encoded data. One has to decode the bas64 string to a data buffer and then decrypt data piece by piece.

    To generate a statically typed binary file, run: cargo build –release

    AWS Cognito and Google Authentication

    This guide does not cover how to set up Cognito and integration with Google Authentication. You can refer to our old post for a detailed guide on setting up authentication and authorization. (Refer to the sections Setup Authentication and Setup Authorization).

    React App:

    The React app is launched via our Rust CLI utility. This application is served right from the S3 bucket via CloudFront. When our React app is loaded, it checks if the current session is authenticated. If not, then with the help of the AWS Amplify framework, our app is redirected to Cognito-hosted UI authentication, which in turn auto redirects to Google Login page.

    render(){
       return (
         <div className="centerdiv">
           {
             this.state.appInitialised ?
               this.state.user === null ? Auth.federatedSignIn({provider: 'Google'}) :
               <Aux>
                 {this.state.pageContent}
               </Aux>
             :
             <Loader/>
           }
         </div>
       )
     }

    Once the session is authenticated, we set the react state variables and then retrieve the public key from the actix web server (Rust CLI App: auth-awscreds) by calling /publickey GET method. Followed by this, an Ajax POST request (/auth-creds) is made via axios library to API Gateway. The payload contains a public key, and JWT token for authentication. Expected response from API gateway is encrypted AWS temporary credentials which is then proxied to our CLI application.

    To ease this deployment, we have written a terraform code (available in the repository) that takes care of creating an S3 bucket, CloudFront distribution, ACM, React build, and deploying it to the S3 bucket. Navigate to vars.tf file and change the respective default variables). The Terraform script will fail at first launch since the ACM needs a DNS record validation. You can create a CNAME record for DNS validation and re-run the Terraform script to continue deployment. The React app expects few environment variables. Below is the sample .env file; update the respective values for your environment.

    REACT_APP_IDENTITY_POOL_ID=
    REACT_APP_COGNITO_REGION=
    REACT_APP_COGNITO_USER_POOL_ID=
    REACT_APP_COGNTIO_DOMAIN_NAME=
    REACT_APP_DOMAIN_NAME=
    REACT_APP_CLIENT_ID=
    REACT_APP_CLI_APP_URL=
    REACT_APP_API_APP_URL=

    Finally, deploy the React app using below sample commands.

    $ terraform plan -out plan     #creates plan for revision
    $ terraform apply plan         #apply plan and deploy

    API Gateway HTTP API and Lambda Function

    When a request is first intercepted by API Gateway, it validates the JWT token on its own. API Gateway natively supports Cognito integration. Thus, any payload with invalid authorization header is rejected at API Gateway itself. This eases our authentication process and validates the identity. If the request is valid, it is then received by our Lambda function. Our Lambda function is written in Node.js and wrapped by serverless-http framework around express app. The Express app has only one endpoint.

    /auth-creds (POST): once the request is received, it retrieves the ID from Cognito and logs it to stdout for audit purpose.

    let identityParams = {
               IdentityPoolId: process.env.IDENTITY_POOL_ID,
               Logins: {}
           };
      
           identityParams.Logins[`${process.env.COGNITOIDP}`] = req.headers.authorization;
      
           const ci = new CognitoIdentity({region : process.env.AWSREGION});
      
           let idpResponse = await ci.getId(identityParams).promise();
      
           console.log("Auth Creds Request Received from ",JSON.stringify(idpResponse));

    The app then extracts the base64 encoded public key. Followed by this, an STS api call (Security Token Service) is made and temporary credentials are derived. These credentials are then encrypted with a public key in chunks of 86 bytes.

    const pemPublicKey = Buffer.from(public_key,'base64').toString();
     
           const authdata=await sts.assumeRole({
               ExternalId: process.env.STS_EXTERNAL_ID,
               RoleArn: process.env.IAM_ROLE_ARN,
               RoleSessionName: "DemoAWSAuthSession"
           }).promise();
     
           const creds = JSON.stringify(authdata.Credentials);
           const splitData = creds.match(/.{1,86}/g);
          
           const encryptedData = splitData.map(d=>{
               return publicEncrypt(pemPublicKey,Buffer.from(d)).toString('base64');
           });

    Here, the assumeRole calls the IAM role, which has appropriate policy documents attached. For the sake of this demo, we attached an Administrator role. However, one should consider a hardening policy document and avoid attaching Administrator policy directly to the role.

    resources:
     Resources:
       AuthCredsAssumeRole:
         Type: AWS::IAM::Role
         Properties:
           AssumeRolePolicyDocument:
             Version: "2012-10-17"
             Statement:
               -
                 Effect: Allow
                 Principal:
                   AWS: !GetAtt IamRoleLambdaExecution.Arn
                 Action: sts:AssumeRole
                 Condition:
                   StringEquals:
                     sts:ExternalId: ${env:STS_EXTERNAL_ID}
           RoleName: auth-awscreds-api
           ManagedPolicyArns:
             - arn:aws:iam::aws:policy/AdministratorAccess

    Finally, the response is sent to the React app. 

    We have used the Serverless framework to deploy the API. The Serverless framework creates API gateway, lambda function, Lambda Layer, and IAM role, and takes care of code deployment to lambda function.

    To deploy this application, follow the below steps.

    1. cd layer/nodejs && npm install && cd ../.. && npm install

    2. npm install -g serverless (on mac you can skip this step and use the npx serverless command instead) 

    3. Create .env file and below environment variables to file and set the respective values.

    AWSREGION=ap-south-1
    COGNITO_USER_POOL_ID=
    IDENTITY_POOL_ID=
    COGNITOIDP=
    APP_CLIENT_ID=
    STS_EXTERNAL_ID=
    IAM_ROLE_ARN=
    DEPLOYMENT_BUCKET=
    APP_DOMAIN=

    4. serverless deploy or npx serverless deploy

    Entire codebase for CLI APP, React App, and Backend API  is available on the GitHub repository.

    Testing:

    Assuming that you have compiled binary (auth-awscreds) available in your local machine and for the sake of testing you have installed `aws-cli`, you can then run /path/to/your/auth-awscreds. 

    App Testing

    If you selected your AWS profile name as “demo-awscreds,” you can then export the AWS_PROFILE environment variable. If you prefer a “default” profile, you don’t need to export the environment variable as AWS SDK selects a “default” profile on its own.

    [demo-awscreds]
    aws_access_key_id=ASIAUAOF2CHC77SJUPZU
    aws_secret_access_key=r21J4vwPDnDYWiwdyJe3ET+yhyzFEj7Wi1XxdIaq
    aws_session_token=FwoGZXIvYXdzEIj//////////wEaDHVLdvxSNEqaQZPPQyK2AeuaSlfAGtgaV1q2aKBCvK9c8GCJqcRLlNrixCAFga9n+9Vsh/5AWV2fmea6HwWGqGYU9uUr3mqTSFfh+6/9VQH3RTTwfWEnQONuZ6+E7KT9vYxPockyIZku2hjAUtx9dSyBvOHpIn2muMFmizZH/8EvcZFuzxFrbcy0LyLFHt2HI/gy9k6bLCMbcG9w7Ej2l8vfF3dQ6y1peVOQ5Q8dDMahhS+CMm1q/T1TdNeoon7mgqKGruO4KJrKiZoGMi1JZvXeEIVGiGAW0ro0/Vlp8DY1MaL7Af8BlWI1ZuJJwDJXbEi2Y7rHme5JjbA=

    To validate, you can then run “aws s3 ls.” You should see S3 buckets listed from your AWS account. Note that these credentials are only valid for 60 minutes. This means you will have to re-run the command and acquire a new pair of AWS credentials. Of course, you can configure your IAM role to extend expiry for an “assume role.” 

    auth-awscreds in Action:

    Summary

    Currently, “auth-awscreds” is at its early development stage. This post demonstrates how AWS credentials can be acquired temporarily without having to worry about key rotation. One of the features that we are currently working on is RBAC, with the help of AWS Cognito. Since this tool currently doesn’t support any command line argument, we can’t reconfigure utility configuration. You can manually edit or delete the utility configuration file, which triggers a prompt for configuring during the next run. We also want to add multiple profiles so that multiple AWS accounts can be used.

  • A Primer on HTTP Load Balancing in Kubernetes using Ingress on Google Cloud Platform

    Containerized applications and Kubernetes adoption in cloud environments is on the rise. One of the challenges while deploying applications in Kubernetes is exposing these containerized applications to the outside world. This blog explores different options via which applications can be externally accessed with focus on Ingress – a new feature in Kubernetes that provides an external load balancer. This blog also provides a simple hand-on tutorial on Google Cloud Platform (GCP).  

    Ingress is the new feature (currently in beta) from Kubernetes which aspires to be an Application Load Balancer intending to simplify the ability to expose your applications and services to the outside world. It can be configured to give services externally-reachable URLs, load balance traffic, terminate SSL, offer name based virtual hosting etc. Before we dive into Ingress, let’s look at some of the alternatives currently available that help expose your applications, their complexities/limitations and then try to understand Ingress and how it addresses these problems.

    Current ways of exposing applications externally:

    There are certain ways using which you can expose your applications externally. Lets look at each of them:

    EXPOSE Pod:

    You can expose your application directly from your pod by using a port from the node which is running your pod, mapping that port to a port exposed by your container and using the combination of your HOST-IP:HOST-PORT to access your application externally. This is similar to what you would have done when running docker containers directly without using Kubernetes. Using Kubernetes you can use hostPortsetting in service configuration which will do the same thing. Another approach is to set hostNetwork: true in service configuration to use the host’s network interface from your pod.

    Limitations:

    • In both scenarios you should take extra care to avoid port conflicts at the host, and possibly some issues with packet routing and name resolutions.
    • This would limit running only one replica of the pod per cluster node as the hostport you use is unique and can bind with only one service.

    EXPOSE Service:

    Kubernetes services primarily work to interconnect different pods which constitute an application. You can scale the pods of your application very easily using services. Services are not primarily intended for external access, but there are some accepted ways to expose services to the external world.

    Basically, services provide a routing, balancing and discovery mechanism for the pod’s endpoints. Services target pods using selectors, and can map container ports to service ports. A service exposes one or more ports, although usually, you will find that only one is defined.

    A service can be exposed using 3 ServiceType choices:

    • ClusterIP: Exposes the service on a cluster-internal IP. Choosing this value makes the service only reachable from within the cluster. This is the default ServiceType.
    • NodePort: Exposes the service on each Node’s IP at a static port (the NodePort). A ClusterIP service, to which the NodePort service will route, is automatically created. You’ll be able to contact the NodePort service, from outside the cluster, by requesting <nodeip>:<nodeport>.Here NodePort remains fixed and NodeIP can be any node IP of your Kubernetes cluster.</nodeport></nodeip>
    • LoadBalancer: Exposes the service externally using a cloud provider’s load balancer (eg. AWS ELB). NodePort and ClusterIP services, to which the external load balancer will route, are automatically created.
    • ExternalName: Maps the service to the contents of the externalName field (e.g. foo.bar.example.com), by returning a CNAME record with its value. No proxying of any kind is set up. This requires version 1.7 or higher of kube-dns

    Limitations:

    • If we choose NodePort to expose our services, kubernetes will generate ports corresponding to the ports of your pods in the range of 30000-32767. You will need to add an external proxy layer that uses DNAT to expose more friendly ports. The external proxy layer will also have to take care of load balancing so that you leverage the power of your pod replicas. Also it would not be easy to add TLS or simple host header routing rules to the external service.
    • ClusterIP and ExternalName similarly while easy to use have the limitation where we can add any routing or load balancing rules.
    • Choosing LoadBalancer is probably the easiest of all methods to get your service exposed to the internet. The problem is that there is no standard way of telling a Kubernetes service about the elements that a balancer requires, again TLS and host headers are left out. Another limitation is reliance on an external load balancer (AWS’s ELB, GCP’s Cloud Load Balancer etc.)

    Endpoints

    Endpoints are usually automatically created by services, unless you are using headless services and adding the endpoints manually. An endpoint is a host:port tuple registered at Kubernetes, and in the service context it is used to route traffic. The service tracks the endpoints as pods, that match the selector are created, deleted and modified. Individually, endpoints are not useful to expose services, since they are to some extent ephemeral objects.

    Summary

    If you can rely on your cloud provider to correctly implement the LoadBalancer for their API, to keep up-to-date with Kubernetes releases, and you are happy with their management interfaces for DNS and certificates, then setting up your services as type LoadBalancer is quite acceptable.

    On the other hand, if you want to manage load balancing systems manually and set up port mappings yourself, NodePort is a low-complexity solution. If you are directly using Endpoints to expose external traffic, perhaps you already know what you are doing (but consider that you might have made a mistake, there could be another option).

    Given that none of these elements has been originally designed to expose services to the internet, their functionality may seem limited for this purpose.

    Understanding Ingress

    Traditionally, you would create a LoadBalancer service for each public application you want to expose. Ingress gives you a way to route requests to services based on the request host or path, centralizing a number of services into a single entrypoint.

    Ingress is split up into two main pieces. The first is an Ingress resource, which defines how you want requests routed to the backing services and second is the Ingress Controller which does the routing and also keeps track of the changes on a service level.

    Ingress Resources

    The Ingress resource is a set of rules that map to Kubernetes services. Ingress resources are defined purely within Kubernetes as an object that other entities can watch and respond to.

    Ingress Supports defining following rules in beta stage:

    • host header:  Forward traffic based on domain names.
    • paths: Looks for a match at the beginning of the path.
    • TLS: If the ingress adds TLS, HTTPS and a certificate configured through a secret will be used.

    When no host header rules are included at an Ingress, requests without a match will use that Ingress and be mapped to the backend service. You will usually do this to send a 404 page to requests for sites/paths which are not sent to the other services. Ingress tries to match requests to rules, and forwards them to backends, which are composed of a service and a port.

    Ingress Controllers

    Ingress controller is the entity which grants (or remove) access, based on the changes in the services, pods and Ingress resources. Ingress controller gets the state change data by directly calling Kubernetes API.

    Ingress controllers are applications that watch Ingresses in the cluster and configure a balancer to apply those rules. You can configure any of the third party balancers like HAProxy, NGINX, Vulcand or Traefik to create your version of the Ingress controller.  Ingress controller should track the changes in ingress resources, services and pods and accordingly update configuration of the balancer.

    Ingress controllers will usually track and communicate with endpoints behind services instead of using services directly. This way some network plumbing is avoided, and we can also manage the balancing strategy from the balancer. Some of the open source implementations of Ingress Controllers can be found here.

    Now, let’s do an exercise of setting up a HTTP Load Balancer using Ingress on Google Cloud Platform (GCP), which has already integrated the ingress feature in it’s Container Engine (GKE) service.

    Ingress-based HTTP Load Balancer in Google Cloud Platform

    The tutorial assumes that you have your GCP account setup done and a default project created. We will first create a Container cluster, followed by deployment of a nginx server service and an echoserver service. Then we will setup an ingress resource for both the services, which will configure the HTTP Load Balancer provided by GCP

    Basic Setup

    Get your project ID by going to the “Project info” section in your GCP dashboard. Start the Cloud Shell terminal, set your project id and the compute/zone in which you want to create your cluster.

    $ gcloud config set project glassy-chalice-129514$ 
    gcloud config set compute/zone us-east1-d
    # Create a 3 node cluster with name “loadbalancedcluster”$ 
    gcloud container clusters create loadbalancedcluster  

    Fetch the cluster credentials for the kubectl tool:

    $ gcloud container clusters get-credentials loadbalancedcluster --zone us-east1-d --project glassy-chalice-129514

    Step 1: Deploy an nginx server and echoserver service

    $ kubectl run nginx --image=nginx --port=80
    $ kubectl run echoserver --image=gcr.io/google_containers/echoserver:1.4 --port=8080
    $ kubectl get deployments
    NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
    echoserver   1         1         1            1           15s
    nginx        1         1         1            1           26m

    Step 2: Expose your nginx and echoserver deployment as a service internally

    Create a Service resource to make the nginx and echoserver deployment reachable within your container cluster:

    $ kubectl expose deployment nginx --target-port=80  --type=NodePort
    $ kubectl expose deployment echoserver --target-port=8080 --type=NodePort

    When you create a Service of type NodePort with this command, Container Engine makes your Service available on a randomly-selected high port number (e.g. 30746) on all the nodes in your cluster. Verify the Service was created and a node port was allocated:

    $ kubectl get service nginx
    NAME      CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
    nginx     10.47.245.54   <nodes>       80:30746/TCP   20s
    $ kubectl get service echoserver
    NAME         CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
    echoserver   10.47.251.9   <nodes>       8080:32301/TCP   33s

    In the output above, the node port for the nginx Service is 30746 and for echoserver service is 32301. Also, note that there is no external IP allocated for this Services. Since the Container Engine nodes are not externally accessible by default, creating this Service does not make your application accessible from the Internet. To make your HTTP(S) web server application publicly accessible, you need to create an Ingress resource.

    Step 3: Create an Ingress resource

    On Container Engine, Ingress is implemented using Cloud Load Balancing. When you create an Ingress in your cluster, Container Engine creates an HTTP(S) load balancer and configures it to route traffic to your application. Container Engine has internally defined an Ingress Controller, which takes the Ingress resource as input for setting up proxy rules and talk to Kubernetes API to get the service related information.

    The following config file defines an Ingress resource that directs traffic to your nginx and echoserver server:

    apiVersion: extensions/v1beta1
    kind: Ingress
    metadata: 
    name: fanout-ingress
    spec: 
    rules: 
    - http:     
    paths:     
    - path: /       
    backend:         
    serviceName: nginx         
    servicePort: 80     
    - path: /echo       
    backend:         
    serviceName: echoserver         
    servicePort: 8080

    To deploy this Ingress resource run in the cloud shell:

    $ kubectl apply -f basic-ingress.yaml

    Step 4: Access your application

    Find out the external IP address of the load balancer serving your application by running:

    $ kubectl get ingress fanout-ingres
    NAME             HOSTS     ADDRESS          PORTS     AG
    fanout-ingress   *         130.211.36.168   80        36s    

     

    Use http://<external-ip-address> </external-ip-address>and http://<external-ip-address>/echo</external-ip-address> to access nginx and the echo-server.

    Summary

    Ingresses are simple and very easy to deploy, and really fun to play with. However, it’s currently in beta phase and misses some of the features that may restrict it from production use. Stay tuned to get updates in Ingress on Kubernetes page and their Github repo.

    References

  • SEO for Web Apps: How to Boost Your Search Rankings

    The responsibilities of a web developer are not just designing and developing a web application but adding the right set of features that allow the site get higher traffic. One way of getting traffic is by ensuring your web page is listed in top search results of Google. Search engines consider certain factors while ranking the web page (which are covered in this guide below), and accommodating these factors in your web app is called search engine optimization. 

    A web app that is search engine optimized loads faster, has a good user experience, and is shown in the top search results of Google. If you want your web app to have these features, then this essential guide to SEO will provide you with a checklist to follow when working on SEO improvements.

    Key Facts:

    • 75% of visitors only visit the first three links listed and results from the second page get only 0.78% of clicks.
    • 95% of visitors visit only the links from the first page of Google.
    • Search engines give 300% more traffic than social media.
    • 8% of searches from browsers are in the form of a question.
    • 40% of visitors will leave a website if it takes more than 3 seconds to load. And more shocking is that 80% of those visitors will not visit the same site again.

    How Search Works:

     

     

    1. Crawling: These are the automated scripts that are often referred to as web crawlers, web spiders, Googlebot, and sometimes shortened to crawlers. These scripts look for the past crawls and look for the sitemap file, which is found at the root directory of the web application. We will cover more on the sitemap later. For now, just understand that the sitemap file has all the links to your website, which are ordered hierarchically. Crawlers add those links to the crawl queue so that they can be crawled later. Crawlers pay special attention to newly added sites and frequently updated/visited sites, and they use several algorithms to find how often the existing site should be recrawled.
    2. Indexing: Let us first understand what indexing means. Indexing is collecting, parsing, and storing data to enable a super-fast response to queries. Now, Google uses the same steps to perform web indexing. Google visits each page from the crawl queue and analyzes what the page is about and analyzes the content, images, and video, then parses the analyzed result and stores it into their database called Google Index.
    3. Serving: When a user makes a search query on Google, Google tries to determine the highest quality result and considers other criteria before serving the result, like user’s location, user’s submitted data, language, and device (desktop/mobile). That is why responsiveness is also considered for SEO. Unresponsive sites might have a higher ranking for desktop but will have a lower ranking for mobile because, while analyzing the page content, these bots see the pages as what the user sees and assign the ranking accordingly.

    Factors that affect SEO ranking:

    1. Sitemap: The sitemap file has two types: HTML & XML, and both files are placed at the root of the web app. The HTML sitemap guides users around the website pages, and it has the pages listed hierarchically  to help users understand the flow of the website. The XML sitemap helps the search engine bots crawl the pages of the site, and it helps the crawlers to understand the website structure. It has different types of data, which helps the bots to perform crawling cleverly.

    loc: The URL of the webpage.

    lastmod: When the content of the URL got updated.

    changefreq: How often the content of the page gets changed.

    priority: It has the range from 0 to 1—0 represents the lowest priority, and 1 represents the highest. 1 is generally given to the home or landing page. Setting 1 to every URL will cause search engines to ignore this field.

    Click here to see how a sitemap.xml looks like.

    The below example shows how the URL will be written along with the fields.

     

    2. Meta tags: Meta tags are very important because they indirectly affect the SEO ranking,  and they contain important information about the web page, and this information is shown as the snippet in Google search results. Users see this snippet and decide whether to click this link, and search engines consider the click rates parameter when serving the results. Meta tags are not visible to the user on the web page, but they are part of HTML code.

    A few important meta tags for SEO are:

    • Meta title: This is the primary content shown by the search results, and it plays a huge role in deciding the click rates because it gives users a quick glance at what this page is about. It should ideally be 50-60 characters long, and the title should be unique for each page.
    • Meta description: It summarizes or gives an overview of the page content in short. The description should be precise and of high quality. It should include some targeted keywords the user will likely search and be under 160 characters.
    • Meta robots: It tells search engines whether to index and crawl web pages. The four values it can contain are index, noindex, follow, or nofollow. If these values are not used correctly, then it will negatively impact the SEO.
      index/noindex: Tells whether to index the web page.
      follow/nofollow: Tells whether to crawl links on the web page.
    • Meta viewport: It sends the signal to search engines that the web page is responsive to different screen sizes, and it instructs the browser on how to render the page. This tag presence helps search engines understand that the website is mobile-friendly, which matters because Google ranks the results differently in mobile search. If the desktop version is opened in mobile, then the user will most likely close the page, sending a negative signal to Google that this page has some undesirable content and results in lowering the ranking. This tag should be present on all the web pages.

      Let us look at what a Velotio page would look like with and without the meta viewport tag.


    • Meta charset: It sets the character encoding of the webpage in simple terms, telling how the text should be displayed on the page. Wrong character encoding will make content hard to read for search engines and will lead to a bad user experience. Use UTF-8 character encoding wherever possible.
    • Meta keywords: Search engines don’t consider this tag anymore. Bing considers this tag as spam. If this tag is added to any of the web pages, it may work against SEO. It is advisable not to have this tag on your pages.

    3. Usage of Headers / Hierarchical content: Header tags are the heading tags that are important for user readability and search engines. Headers organize the content of the web page so that it won’t look like a plain wall of text. Bots check for how well the content is organized and assign the ranking accordingly. Headers make the content user-friendly, scannable, and accessible. Header tags are from h1 to h6, with h1 being high importance and h6 being low importance. Googlebot considers h1 mainly because it is typically the title of the page and provides brief information about what this page content has.

    If Velotio’s different pages of content were written on one big page (not good advice, just for example), then hierarchy can be done like the below snapshot.

    4. Usage of Breadcrumb: Breadcrumbs are the navigational elements that allow users to track which page they are currently on. Search engines find this helpful to understand the structure of the website. It lowers the bounce rate by engaging users to explore other pages of the website. Breadcrumbs can be found at the top of the page with slightly smaller fonts. Usage of breadcrumb is always recommended if your site has deeply nested pages.

    If we refer to the MDN pages, then a hierarchical breadcrumb can be found at the top of the page.

    5. User Experience (UX): UX has become an integral component of SEO. A good UX always makes your users stay longer, which lowers the bounce rate and makes them visit your site again. Google recognizes this stay time and click rates and considers the site as more attractive to users, ranking it higher in the search results. Consider the following points to have a good user experience.

    1. Divide content into sections, not just a plain wall of text
    2. Use hierarchical font sizes
    3. Use images/videos that summarize the content
    4. Good theme and color contrast
    5. Responsiveness (desktop/tablet/mobile)

    6. Robots.txt: The robots.txt file prevents crawlers from accessing all pages of the site. It contains some commands that tell the bots not to index the disallowed pages. By doing this, crawlers will not crawl those pages and will not index them. The best example of a page that should not be crawled is the payment gateway page. Robots.txt is kept at the root of the web app and should be public. Refer to Velotio’s robots.txt file to know more about it. User-Agent:* means the given command will be applied to all the bots that support robots.txt.

    7. Page speed: Page speed is the time it takes to get the page fully displayed and interactive. Google also considers page speed an important factor for SEO. As we have seen from the facts section, users tend to close a site if it takes longer than 3 seconds to load. To Googlebot, this is something unfavorable to the user experience, and it will lower the ranking. We will go through some tools later in this section to  know the loading speed of a page, but if your site loads slowly, then look into the recommendations below.

    • Image compression: In a consumer-oriented website, the images contribute to around 50-90% of the page. The images must load quickly. Use compressed images, which lowers the file size without compromising the quality. Cloudinary is a platform that does this job decently.
      If your image size is 700×700 and is shown in a 300x*300 container, then rather than doing this with CSS, load the image at 300x*300 only, because browsers don’t need to load such a big image, and it will take more time to reduce the image through CSS. All this time can be avoided by loading an image of the required size.
      By utilizing deferring/lazy image loading, images are downloaded when they are needed as the user scrolls on the webpage. Doing this allows the images to not be loaded at once, and browsers will have the bandwidth to perform other tasks.
      Using sprite images is also an effective way to reduce the HTTP requests by combining small icons into one sprite image and displaying the section we want to show. This will save load time by avoiding loading multiple images.
    • Code optimization: Every developer should consider reusability while developing code, which will help in reducing the code size. Nowadays, most websites are developed using bundlers. Use bundle analyzers to analyze which piece of code is leading to a size increase. Bundlers are already doing the minification process while generating the build artifacts.
    • Removing render-blocking resources: Browsers build the DOM tree by parsing HTML. During this process, if it finds any scripts, then the creation of the DOM tree is paused and script execution starts. This will increase the page load time, and to make it work without blocking DOM creation, use async & defer in your scripts and load the script at the footer of the body. Keep in mind, though, that some scripts need to be loaded on the header like Google analytics script. Don’t use this suggested step blindly as it may cause some unusual behavior in your site.
    • Implementing a Content Distribution Network (CDN): It helps in loading the resources in a shorter time by figuring out the nearest server located from the user location and delivering the content from the nearest server.
    • Good hosting platform: Optimizing images and code alone can not always improve page speed. Budget-friendly servers serve millions of other websites, which will prevent your site from loading quickly. So, it is always recommended to use the premium hosting service or a dedicated server.
    • Implement caching: If resources are cached on a browser, then they are not fetched from the server; rather the browser picks them from the cache. It is important to have an expiration time while setting cache. And caching should also be done only on the resources that are not updated frequently.
    • Reducing redirects: In redirecting a page, an additional time is added for the HTTP request-response cycle. It is advisable not to use too many redirects.

    Some tools help us find the score of our website and provide information on what areas can be improved. These tools consider SEO, user experience, and accessibility point of view while calculating the score. These tools give results in some technical terms. Let us understand them in short:

    1. Time to first byte: It represents the moment when the web page starts loading. When we see a white screen for some time on page landing, that is TTFB at work.

    2. First contentful paint: It represents when the user sees something on the web page.

    3. First meaningful paint: It tells when the user understands the content, like text/images on the web page.

    4. First CPU idle: It represents the moment when the site has loaded enough information for it to be able to handle the user’s first input.

    5. Largest contentful paint: It represents when everything above the page’s fold (without scrolling) is visible.

    6. Time to interactive: It represents the moment when the web page is fully interactive.

    7. Total blocking time: It is the total amount of time the webpage was blocked.

    8. Cumulative layout shift: It is measured as the time taken in shifting web elements while the page is being rendered.

    Below are some popular tools we can use for performance analysis:

    1. Page speed insights: This assessment tool provides the score and opportunities to improve.

    2. Web page test: This monitoring tool lets you analyze each resource’s loading time.

    3. Gtmetrix: This is also an assessment tool like Lighthouse that gives some more information, and we can set test location as well.

    Conclusion:

    We have seen what SEO is, how it works, and how we can improve it by going through sitemap, meta tags, heading tags, robots.txt, breadcrumb, user experience, and finally the page load speed. For a business-to-consumer application, SEO is highly important. It lets you drive more traffic to your website. Hopefully, this basic guide will help you improve SEO for your existing and future websites.

    Related Articles

    1. Eliminate Render-blocking Resources using React and Webpack

    2. Building High-performance Apps: A Checklist To Get It Right

    3. Building a Progressive Web Application in React [With Live Code Examples]

  • Elasticsearch 101: Fundamentals & Core Components

    Elasticsearch is currently the most popular way to implement free text search and analytics in applications. It is highly scalable and can easily manage petabytes of data. It supports a variety of use cases like allowing users to easily search through any portal, collect and analyze log data, build business intelligence dashboards to quickly analyze and visualize data.  

    This blog acts as an introduction to Elasticsearch and covers the basic concepts of clusters, nodes, index, document and shards.

    What is Elasticsearch?

    Elasticsearch (ES) is a combination of open-source, distributed, highly scalable data store, and Lucene – a search engine that supports extremely fast full-text search. It is a beautifully crafted software, which hides the internal complexities and provides full-text search capabilities with simple REST APIs. Elasticsearch is written in Java with Apache Lucene at its core. It should be clear that Elasticsearch is not like a traditional RDBMS. It is not suitable for your transactional database needs, and hence, in my opinion, it should not be your primary data store. It is a common practice to use a relational database as the primary data store and inject only required data into Elasticsearch.

    Elasticsearch is meant for fast text search. There are several functionalities, which make it different from RDBMS. Unlike RDBMS, Elasticsearch stores data in the form of a JSON document, which is denormalized and doesn’t support transactions, referential integrity, joins, and subqueries.

    Elasticsearch works with structured, semi-structured, and unstructured data as well. In the next section, let’s walk through the various components in Elasticsearch.

    Elasticsearch Components

    Cluster

    One or more servers collectively providing indexing and search capabilities form an Elasticsearch cluster. The cluster size can vary from a single node to thousands of nodes, depending on the use cases.

    Node

    Node is a single physical or virtual machine that holds full or part of your data and provides computing power for indexing and searching your data. Every node is identified with a unique name. If the node identifier is not specified, a random UUID is assigned as a node identifier at the startup. Every node configuration has the property `cluster.name`. The cluster will be formed automatically with all the nodes having the same `cluster.name` at startup.

    A node has to accomplish several duties such as:

    • storing the data
    • performing operations on data (indexing, searching, aggregation, etc.)
    • maintaining the health of the cluster

    Each node in a cluster can do all these operations. Elasticsearch provides the capability to split responsibilities across different nodes. This makes it easy to scale, optimize, and maintain the cluster. Based on the responsibilities, the following are the different types of nodes that are supported:

    Data Node

    Data node is the node that has storage and computation capability. Data node stores the part of data in the form of shards (explained in the later section). Data nodes also participate in the CRUD, search, and aggregate operations. These operations are resource-intensive, and hence, it is a good practice to have dedicated data nodes without having the additional load of cluster administration. By default, every node of the cluster is a data node.

    Master Node

    Master nodes are reserved to perform administrative tasks. Master nodes track the availability/failure of the data nodes. The master nodes are responsible for creating and deleting the indices (explained in the later section).

    This makes the master node a critical part of the Elasticsearch cluster. It has to be stable and healthy. A single master node for a cluster is certainly a single point of failure. Elasticsearch provides the capability to have multiple master-eligible nodes. All the master eligible nodes participate in an election to elect a master node. It is recommended to have a minimum of three nodes in the cluster to avoid a split-brain situation. By default, all the nodes are both data nodes as well as master nodes. However, some nodes can be master-eligible nodes only through explicit configuration.

    Coordinating-Only Node

    Any node, which is not a master node or a data node, is a coordinating node. Coordinating nodes act as smart load balancers. Coordinating nodes are exposed to end-user requests. It appropriately redirects the requests between data nodes and master nodes.

    To take an example, a user’s search request is sent to different data nodes. Each data node searches locally and sends the result back to the coordinating node. Coordinating node aggregates and returns the result to the user.

    There are a few concepts that are core to Elasticsearch. Understanding these basic concepts will tremendously ease the learning process.

    Index

    Index is a container to store data similar to a database in the relational databases. An index contains a collection of documents that have similar characteristics or are logically related. If we take an example of an e-commerce website, there will be one index for products, one for customers, and so on. Indices are identified by the lowercase name. The index name is required to perform the add, update, and delete operations on the documents.

    Type

    Type is a logical grouping of the documents within the index. In the previous example of product index, we can further group documents into types, like electronics, fashion, furniture, etc. Types are defined on the basis of documents having similar properties in it. It isn’t easy to decide when to use the type over the index. Indices have more overheads, so sometimes, it is better to use different types in the same index for better performance. There are a couple of restrictions to use types as well. For example, two fields having the same name in different types of documents should be of the same datatype (string, date, etc.).

    Document

    Document is the piece indexed by Elasticsearch. A document is represented in the JSON format. We can add as many documents as we want into an index. The following snippet shows how to create a document of type mobile in the index store. We will cover more about the individual field of the document in the Mapping Type section.

    HTTP POST <hostname:port>/store/mobile/
    {    
    "name": "Motorola G5",    
    "model": "XT3300",    
    "release_date": "2016-01-01",    
    "features": "16 GB ROM | Expandable Upto 128 GB | 5.2 inch Full HD Display | 12MP Rear Camera | 5MP Front Camera | 3000 mAh Battery | Snapdragon 625 Processor",    
    "ram_gb": "3",    
    "screen_size_inches": "5.2"
    }

    Mapping Types

    To create different types in an index, we need mapping types (or simply mapping) to be specified during index creation. Mappings can be defined as a list of directives given to Elasticseach about how the data is supposed to be stored and retrieved. It is important to provide mapping information at the time of index creation based on how we want to retrieve our data later. In the context of relational databases, think of mappings as a table schema.

    Mapping provides information on how to treat each JSON field. For example, the field can be of type date, geolocation, or person name. Mappings also allow specifying which fields will participate in the full-text search, and specify the analyzers used to transform and decorate data before storing into an index. If no mapping is provided, Elasticsearch tries to identify the schema itself, known as Dynamic Mapping. 

    Each mapping type has Meta Fields and Properties. The snippet below shows the mapping of the type mobile.

    {    
    "mappings": {        
      "mobile": {            
        "properties": {                
          "name": {                    
            "type": "keyword"                
          },                
            "model": {                    
              "type": "keyword"                
           },               
              "release_date": {                    
                "type": "date"                
           },                
                "features": {                    
                  "type": "text"               
             },                
                "ram_gb": {                    
                  "type": "short"                
              },                
                  "screen_size_inches": {                    
                    "type": "float"                
              }            
            }        
          }    
       }
    }

    Meta Fields

    As the name indicates, meta fields stores additional information about the document. Meta fields are meant for mostly internal usage, and it is unlikely that the end-user has to deal with meta fields. Meta field names starts with an underscore. There are around ten meta fields in total. We will talk about some of them here:

    _index

    It stores the name of the index document it belongs to. This is used internally to store/search the document within an index.

    _type

    It stores the type of the document. To get better performance, it is often included in search queries.

    _id

    This is the unique id of the document. It is used to access specific document directly over the HTTP GET API.

    _source

    This holds the original JSON document before applying any analyzers/transformations. It is important to note that Elasticsearch can query on fields that are indexed (provided mapping for). The _source field is not indexed, and hence, can’t be queried on but it can be included in the final search result.

    Fields Or Properties

    List of fields specifies which all JSON fields in the document should be included in a particular type. In the e-commerce website example, mobile can be a type. It will have fields, like operating_system, camera_specification, ram_size, etc.

    Fields also carry the data type information with them. This directs Elasticsearch to treat the specific fields in a particular way of storing/searching data. Data types are similar to what we see in any other programming language. We will talk about a few of them here.

    Simple Data Types

    Text

    This data type is used to store full-text like product description. These fields participate in full-text search. These types of fields are analyzed while storing, which enables to searching them by the individual word in it. Such fields are not used in sorting and aggregation queries.

    Keywords

    This type is also used to store text data, but unlike Text, it is not analyzed and stored. This is suitable to store information like a user’s mobile number, city, age, etc. These fields are used in filter, aggregation, and sorting queries. For e.g., list all users from a particular city and filter them by age.

    Numeric

    Elasticsearch supports a wide range of numeric type: long, integer, short, byte, double, float.

    There are a few more data types to support date, boolean (true/false, on/off, 1/0), IP (to store IP addresses).

    Special Data Types

    Geo Point

    This data type is used to store geographical location. It accepts latitude and longitude pair. For example, this data type can be used to arrange the user’s photo library by their geographical location or graphically display the locations trending on social media news.

    Geo Shape

    It allows storing arbitrary geometric shapes like rectangle, polygon, etc.

    Completion Suggester

    This data type is used to provide auto-completion feature over a specific field. As the user types certain text, the completion suggester can guide the user to reach particular results.

    Complex Data Type

    Object

    If you know JSON well, this concept won’t be new for you. Elasticsearch also allows storing nested JSON object structure as a document.

    Nested

    The Object data type is not that useful due to its underlying data representation in the Lucene index. Lucene index does not support inner JSON object. ES flattens the original JSON to make it compatible with storing in Lucene index. Thus, fields of the multiple inner objects get merged into one leading object to wrong search results. Most of the time, you may use Nested data type over Object.

    Shards

    Shards help with enabling Elasticsearch to become horizontally scalable. An index can store millions of documents and occupy terabytes of data. This can cause problems with performance, scalability, and maintenance. Let’s see how Shards help achieve scalability.

    Indices are divided into multiple units called Shards (refer the diagram below). Shard is a full-featured subset of an index. Shards of the same index now can reside on the same or different nodes of the cluster. Shard decides the degree of parallelism for search and indexing operations. Shards allow the cluster to grow horizontally. The number of shards per index can be specified at the time of index creation. By default, the number of shards created is 5. Although, once the index is created the number of shards can not be changed. To change the number of shards, reindex the data.

    Replication

    Hardware can fail at any time. To ensure fault tolerance and high availability, ES provides a feature to replicate the data. Shards can be replicated. A shard which is being copied is called as Primary Shard. The copy of the primary shard is called a replica shard or simply replica. Like the number of shards, the number of replication can also be specified at the time of index creation. Replication served two purposes:

    • High Availability – Replica is never been created on the same node where the primary shard is present. This ensures that data can be available through the replica shard even if the complete node is failed.
    • Performance – Replica can also contribute to search capabilities. The search queries will be executed parallelly across the replicas.

    To summarize, to achieve high availability and performance, the index is split into multiple shards. In a production environment, multiple replicas are created for every index. In the replicated index, only primary shards can serve write requests. However, all the shards (the primary shard as well as replicated shards) can serve read/query requests. The replication factor is defined at the time of index creation and can be changed later if required. Choosing the number of shards is an important exercise. As once defined, it can’t be changed. In critical scenarios, changing the number of shards requires creating a new index with required shards and reindexing old data.

    Summary

    In this blog, we have covered the basic but important aspects of Elasticsearch. In the following posts, I will talk about how indexing & searching works in detail. Stay tuned!

  • Improving Elasticsearch Indexing in the Rails Model using Searchkick

    Searching has become a prominent feature of any web application, and a relevant search feature requires a robust search engine. The search engine should be capable of performing a full-text search, auto completion, providing suggestions, spelling corrections, fuzzy search, and analytics. 

    Elasticsearch, a distributed, fast, and scalable search and analytic engine, takes care of all these basic search requirements.

    The focus of this post is using a few approaches with Elasticsearch in our Rails application to reduce time latency for web requests. Let’s review one of the best ways to improve the Elasticsearch indexing in Rails models by moving them to background jobs.

    In a Rails application, Elasticsearch can be integrated with any of the following popular gems:

    We can continue with any of these gems mentioned above. But for this post, we will be moving forward with the Searchkick gem, which is a much more Rails-friendly gem.

    The default Searchkick gem option uses the object callbacks to sync the data in the respective Elasticsearch index. Being in the callbacks, it costs the request, which has the creation and updation of a resource to take additional time to process the web request.

    The below image shows logs from a Rails application, captured for an update request of a user record. We have added a print statement before Elasticsearch tries to sync in the Rails model so that it helps identify from the logs where the indexing has started. These logs show that the last two queries were executed for indexing the data in the Elasticsearch index.

    Since the Elasticsearch sync is happening while updating a user record, we can conclude that the user update request will take additional time to cover up the Elasticsearch sync.

    Below is the request flow diagram:

    From the request flow diagram, we can say that the end-user must wait for step 3 and 4 to be completed. Step 3 is to fetch the children object details from the database.

    To tackle the problem, we can move the Elasticsearch indexing to the background jobs. Usually, for Rails apps in production, there are separate app servers, database servers, background job processing servers, and Elasticsearch servers (in this scenario).

    This is how the request flow looks when we move Elasticsearch indexing:

    Let’s get to coding!

    For demo purposes, we will have a Rails app with models: `User` and `Blogpost`. The stack used here:

    • Rails 5.2
    • Elasticsearch 6.6.7
    • MySQL 5.6
    • Searchkick (gem for writing Elasticsearch queries in Ruby)
    • Sidekiq (gem for background processing)

    This approach does not require  any specific version of Rails, Elasticsearch or Mysql. Moreover, this approach is database agnostic. You can go through the code from this Github repo for reference.

    Let’s take a look at the user model with Elasticsearch index.

    # == Schema Information
    #
    # Table name: users
    #
    #  id            :bigint           not null, primary key
    #  name          :string(255)
    #  email         :string(255)
    #  mobile_number :string(255)
    #  created_at    :datetime         not null
    #  updated_at    :datetime         not null
    #
    class User < ApplicationRecord
     searchkick
    
     has_many :blogposts
     def search_data
       {
         name: name,
         email: email,
         total_blogposts: blogposts.count,
         last_published_blogpost_date: last_published_blogpost_date
       }
     end
     ...
    end

    Anytime a user object is inserted, updated, or deleted, Searchkick reindexes the data in the Elasticsearch user index synchronously.

    Searchkick already provides four ways to sync Elasticsearch index:

    • Inline (default)
    • Asynchronous
    • Queuing
    • Manual

    For more detailed information on this, refer to this page. In this post, we are looking in the manual approach to reindex the model data.

    To manually reindex, the user model will look like:

    class User < ApplicationRecord
     searchkick callbacks: false
    
     def search_data
       ...
     end
    end

    Now, we will need to define a callback that can sync the data to the Elasticsearch index. Typically, this callback must be written in all the models that have the Elasticsearch index. Instead, we can write a common concern and include it to required models.

    Here is what our concern will look like:

    module ElasticsearchIndexer
     extend ActiveSupport::Concern
    
     included do
       after_commit :reindex_model
       def reindex_model
         ElasticsearchWorker.perform_async(self.id, self.class.name)
       end
     end
    end

    In the above active support concern, we have called the Sidekiq worker named ElasticsearchWorker. After adding this concern, don’t forget to include the Elasticsearch indexer concern in the user model, like so:

    include ElasticsearchIndexer

    Now, let’s see the Elasticsearch Sidekiq worker:

    class ElasticsearchWorker
     include Sidekiq::Worker
     def perform(id, klass)
       begin
         klass.constantize.find(id.to_s).reindex
       rescue => e
         # Handle exception
       end
     end
    end

    That’s it, we’ve done it. Cool, huh? Now, whenever a user creates, updates, or deletes web request, a background job will be created. The background job can be seen in the Sidekiq web UI at localhost:3000/sidekiq

    Now, there is little problem in the Elasticsearch indexer concern. To reproduce this, go to your user edit page, click save, and look at localhost:3000/sidekiq—a job will be queued.

    We can handle this case by tracking the dirty attributes. 

    module ElasticsearchIndexer
     extend ActiveSupport::Concern
     included do
       after_commit :reindex_model
       def reindex_model
         return if self.previous_changes.keys.blank?
         ElasticsearchWorker.perform_async(self.id, klass)
       end
     end
    end

    Furthermore, there are few more areas of improvement. Suppose you are trying to update the field of user model that is not part of the Elasticsearch index, the Elasticsearch worker Sidekiq job will still get created and reindex the associated model object. This can be modified to create the Elasticsearch indexing worker Sidekiq job only if the Elasticsearch index fields are updated.

    module ElasticsearchIndexer
     extend ActiveSupport::Concern
     included do
       after_commit :reindex_model
       def reindex_model
         updated_fields = self.previous_changes.keys
        
         # For getting ES Index fields you can also maintain constant
         # on model level or get from the search_data method.
         es_index_fields = self.search_data.stringify_keys.keys
         return if (updated_fields & es_index_fields).blank?
         ElasticsearchWorker.perform_async(self.id, klass)
       end
     end
    end

    Conclusion

    Moving the Elasticsearch indexing to background jobs is a great way to boost the performance of the web app by reducing the response time of any web request. Implementing this approach for every model would not be ideal. I would recommend this approach only if the Elasticsearch index data are not needed in real-time.

    Since the execution of background jobs depends on the number of jobs it must perform, it might take time to reflect the changes in the Elasticsearch index if there are lots of jobs queued up. To solve this problem to some extent, the Elasticsearch indexing jobs can be added in a queue with high priority. Also, make sure you have a different app server and background job processing server. This approach works best if the app server is different than the background job processing server.

  • Managing Secrets Using AWS Systems Manager Parameter Store and IAM Roles

    Amazon Web Services (AWS) has an extremely wide variety of services which cover almost all our infrastructure requirements. Among the given services, there is AWS Systems Manager which is a collection of services to manage AWS instances, hybrid environment, resources, and virtual machines by providing a common UI interface for all of them. Services are divided into categories such as Resource Groups, Insights, Actions and Shared Resource. Among Shared Resources one is Parameter Store, which is our topic of discussion today. There are many services that may require SSM agents to be installed on the system but the Parameter store can be used as standalone as well.

    What is Parameter Store?

    Parameter Store is a service which helps you arrange your data in a systematic hierarchical format for better reference. Data can be of any type like Passwords, Keys, URLs or Strings. Data can be stored as encrypted or as plain text. Storage is done in Key-Value Format. Parameter store comes integrated with AWS KMS. It provides a key by default and gives you an option to change it, in this blog we will be using the default one.

    Why Parameter Store?

    Let’s compare its competitors, these include Hashicorp Vault and AWS Secrets Manager.

    Vault stores secrets in Database/File-System but requires one to manage the root token and Unseal Keys. And it is not easy to use.

    Next, is the AWS owned Secrets Manager, this service is not free and would require Lambda functions to be written for secret rotation. Which might become an overhead. Also, the hierarchy is taken as a String only, which can’t be iterated.

    Some Key Features of Parameter Store include:

    • As KMS is integrated the encryption takes place automatically without providing extra parameters.
    • It arranges your data hierarchically and it is pretty simple, just apply “/” to form the hierarchy and by applying recursive search we can fetch required parameters.
    • This helps us in removing all those big Config files, which were previously holding our secrets and causing a severe security risk. Helping us in modularizing our applications.
    • Simple Data like Name can be stored as String.
    • Secured Data as SecureString.
    • Even Array data can be stored using StringList.
    • Access configuration is manageable with IAM.
    • Linked with other services like AWS ECS, Lambda, and CloudFormation
    • AWS backed
    • Easy to use
    • Free of cost

    Note: Parameter Store is region specific service thus might not be available in all regions.

    How to Use it?

    Initial Setup:

    Parameter Store can be used both via GUI and terminal.

    AWS console:

    1. Login into your account and select your preferred region.
    2. In Services select Systems Manager and after that select Parameter Store.
    3. If there are already some credentials created than Keys of that credentials will be displayed.
    4. If Not, then you will be asked to “Create Parameter.”

    On CLI:

    1. Download the AWS CLI, it comes along with inbuilt support for Systems Manager (SSM).
    2. Make sure to have your credentials file is configured.

    Use: Both on Console and CLI

    1. Create

    a. Enter the name of the key that you wish to store. If it is hierarchical then apply “/” without quotes and in place of value enter Value.

    Eg: This
      |- is
        | - Key : Value

    Then in name enter “ /This/is/Key” and in value write “Value”

    b. Select the type of storage, if it can be stored as plain text then use String, if the value is in Array format then choose StringList and mention the complete array in value and if you want to secure it then use SecureString.

    c. CLI:

    $aws ssm put-paramater --name "/This/is/Key" --value "Value" --type String  
    {  
    "Version": 1  
    }

    d. If you want to make it secure:

    $aws ssm put-parameter --name "/This" --value "SecureValue" --type SecureString
    {
    "Version": 1
    }

    2. Read

    a. Once Stored, parameters get listed on the console.

    b. To check any of them, just click on the key. If not secured, the value will be directly visible and if it is secured, then the value would be hidden and you will have to explicitly press “Show”.

    AWS Parameter Overview

    c. CLI:

    $aws ssm get-parameter --name /This/is/Key
    {
    "Parameter":
    {
    "Name": "/This/is/Key",
    "LastModifiedDate": 1535362148.994,
    "Value": "Value",
    "Version": 1,
    "Type": "String",
    "ARN": "arn:aws:ssm:us-east-1:275829625285:parameter/This/is/Key"
    } }

    d. For Secured String:  

    $aws ssm get-parameter --name /This --with-decrypt
    {
    "Parameter":
    {
    "Name": "/This",
    "LastModifiedDate": 1535362296.062,
    "Value": "SecureValue",
    "Version": 1,
    "Type": "SecureString",
    "ARN": "arn:aws:ssm:us-east-1:275829625285:parameter/This
    }
    }

    e. If you observe the above command you will realize that despite providing “/This” we did not receive the complete tree. In order to get that provide modify the command as follows:

    $aws ssm get-parameters-by-path --path /This --recursive
    {
    "Parameters": [
    {
    "Name": "/This/is/Key",
    "LastModifiedDate": 1535362148.994,
    "Value": "Value",
    "Version": 1,
    "Type": "String",
    "ARN": "arn:aws:ssm:us-east-1:275829625285:parameter/This/is/Key"
    } ]
    }

    3. Rotate/Modify:

    a. Once a value is saved it automatically gets versioned as 1, if you click on the parameter and EDIT it, then version gets incremented and the new value is stored as version 2. In this way, we achieve rotation of credentials as well.

    b. Type of parameters cannot be changed, you will have to create a new one.

    c. CLI:
    The command itself is clear, just observe the version:

    $aws ssm put-parameter --name "/This/is/Key" --value "NewValue" --type String --overwrite
    {
    "Version": 2
    }

    4. Delete:

    a. Select the parameter or select all the required parameters and click delete

    b. CLI:

    $aws ssm delete-parameter --name "/This/is/Key"

    As you can see commands are pretty simple and if you have observed, ARN information is also getting populated. Below we will discuss IAM role that we can configure, to help us with access control.

    IAM (AWS Identity and Access Management)

    Remember that we are storing some very critical data in Param Store, therefore access to that data should also be well maintained. If by mistake a new developer comes in the team and is given full access over the parameters, chances are he might end up modifying or deleting production parameters. This is something we really don’t want.

    Generally, it is a good practice to have roles and policies predefined such that only the person responsible has access to required data. Control over the parameters can be done to a granular level. But for this blog, we will take a simple use case. That being said we can take reference from the policies mentioned below.

    By using the resource we can specify the path for parameters, that can be accessed by a particular policy. For example, only System Admin should be able to fetch Production credentials, then in order to achieve this, we will be placing “parameter/production” on the policy, where production represents the top level hierarchy. Thus making anything stored under production become accessible, if we want to more fine tune it then we can do so by adding parameters after – parameter/production/<till>/<the>/<last>/<level></level></last></the></till>

    Below are some of the policies that can be applied to a group or user on a server level. Depending on the requirement, explicit deny can also be applied to Developers for Production.

    For Production Servers:

    SSMProdReadOnly:

    "ssm:GetParameterHistory",
    "ssm:ListTagsForResource",
    "ssm:GetParametersByPath",
    "ssm:GetParameters",
    "ssm:GetParameter"
    "Resource": "arn:aws:ssm:<Region>:<Role-ID>:parameter/production"

    SSMProdWriteOnly:

    "ssm:GetParameterHistory",
    "ssm:ListTagsForResource",
    "ssm:GetParametersByPath",
    "ssm:GetParameters",
    "ssm:GetParameter",
    "ssm:PutParameter",
    "ssm:DeleteParameter",
    "ssm:AddTagsToResource",
    "ssm:DeleteParameters" "Resource": "arn:aws:ssm:<Region>:<Role-ID>:parameter/production"

    For Dev Servers:

    SSMDevelopmentReadWrite

    "ssm:PutParameter",
    "ssm:DeleteParameter",
    "ssm:RemoveTagsFromResource",
    "ssm:AddTagsToResource",
    "ssm:GetParameterHistory",
    "ssm:ListTagsForResource",
    "ssm:GetParametersByPath",
    "ssm:GetParameters",
    "ssm:GetParameter"
    "Resource": "arn:aws:ssm:<Region>:<Role-ID>:parameter/development"

    Conclusion

    This was all about the AWS systems manager parameter store and the IAM roles. Now that you know what the parameter store is, why should you use it, and how to use it, I hope this helps you in kick-starting your credential management using AWS Parameter Store. Start using it already and share your experiences or suggestions in the comments section below.

  • An Introduction to React Fiber – The Algorithm Behind React

    In this article, we will learn about React Fiber—the core algorithm behind React. React Fiber is the new reconciliation algorithm in React 16. You’ve most likely heard of the virtualDOM from React 15. It’s the old reconciler algorithm (also known as the Stack Reconciler) because it uses stack internally. The same reconciler is shared with different renderers like DOM, Native, and Android view. So, calling it virtualDOM may lead to confusion.

    So without any delay, let’s see what React Fiber is.

    Introduction

    React Fiber is a completely backward-compatible rewrite of the old reconciler. This new reconciliation algorithm from React is called Fiber Reconciler. The name comes from fiber, which it uses to represent the node of the DOM tree. We will go through fiber in detail in later sections.

    The main goals of the Fiber reconciler are incremental rendering, better or smoother rendering of UI animations and gestures, and responsiveness of the user interactions. The reconciler also allows you to divide the work into multiple chunks and divide the rendering work over multiple frames. It also adds the ability to define the priority for each unit of work and pause, reuse, and abort the work. 

    Some other features of React include returning multiple elements from a render function, supporting better error handling(we can use the componentDidCatch method to get clearer error messages), and portals.

    While computing new rendering updates, React refers back to the main thread multiple times. As a result, high-priority work can be jumped over low-priority work. React has priorities defined internally for each update. 

    Before going into technical details, I would recommend you learn the following terms, which will help understand React Fiber.

    Prerequisites

    Reconciliation

    As explained in the official React documentation, reconciliation is the algorithm for diffing two DOM trees. When the UI renders for the first time, React creates a tree of nodes. Every individual node represents the React element. It creates a virtual tree (which is known as virtualDOM) that’s a copy of the rendered DOM tree. After any update from the UI, it recursively compares every tree node from two trees. The cumulative changes are then passed to the renderer.

    Scheduling

    As explained in the React documentation, suppose we have some low-priority work (like a large computing function or the rendering of recently fetched elements), and some high-priority work (such as animation). There should be an option to prioritize the high-priority work over low-priority work. In the old stack reconciler implementation, recursive traversal and calling the render method of the whole updated tree happens in single flow. This can lead to dropping frames. 

    Scheduling can be time-based or priority-based. The updates should be scheduled according to the deadline. The high-priority work should be scheduled over low-priority work.

    requestIdleCallback 

    requestAnimationFrame schedules the high-priority function to be called before the next animation frame. Similarly, requestIdleCallback schedules the low-priority or non-essential function to be called in the free time at the end of the frame. 

     requestIdleCallback(lowPriorityWork);

    This shows the usage of requestIdleCallback. lowPriorityWork is a callback function that will be called in the free time at the end of the frame.

    function lowPriorityWork(deadline) {
        while (deadline.timeRemaining() > 0 && workList.length > 0)
          performUnitOfWork();
      
        if (workList.length > 0)
          requestIdleCallback(lowPriorityWork);
      }

    When this callback function is called, it gets the argument deadline object. As you can see in the snippet above, the timeRemaining function returns the latest idle time remaining. If this time is greater than zero, we can do the work needed. And if the work is not completed, we can schedule it again at the last line for the next frame.

    So, now we are good to proceed with how the fiber object itself looks and see how React Fiber works

    Structure of fiber

    A fiber(lowercase ‘f’) is a simple JavaScript object. It represents the React element or a node of the DOM tree. It’s a unit of work. In comparison, Fiber is the React Fiber reconciler.

    This example shows a simple React component that renders in root div.

    function App() {
        return (
          <div className="wrapper">
            <div className="list">
              <div className="list_item">List item A</div>
              <div className="list_item">List item B</div>
            </div>
            <div className="section">
              <button>Add</button>
              <span>No. of items: 2</span>
            </div>
          </div>
        );
      }
     
      ReactDOM.render(<App />, document.getElementById('root'));

    It’s a simple component that shows a list of items for the data we have got from the component state. (I have replaced the .map and iteration over data with two list items just to make this example look simpler.) There is also a button and the span,which shows the number of list items.

    As mentioned earlier, fiber represents the React element. While rendering for the first time, React goes through each of the React elements and creates a tree of fibers. (We will see how it creates this tree in later sections.) 

    It creates a fiber for each individual React element, like in the example above. It will create a fiber, such as W, for the div, which has the class wrapper. Then, fiber L for the div, which has a class list, and so on. Let’s name the fibers for two list items as LA and LB.

    In the later section, we will see how it iterates and the final structure of the tree. Though we call it a tree, React Fiber creates a linked list of nodes where each node is a fiber. And there is a relationship between parent, child, and siblings. React uses a return key to point to the parent node, where any of the children fiber should return after completion of work. So, in the above example, LA’s return is L, and the sibling is LB.

    So, how does this fiber object actually look?

    Below is the definition of type, as defined in the React codebase. I have removed some extra props and kept some comments to understand the meaning of the properties. You can find the detailed structure in the React codebase.

    export type Fiber = {
        // Tag identifying the type of fiber.
        tag: TypeOfWork,
     
        // Unique identifier of this child.
        key: null | string,
     
        // The value of element.type which is used to preserve the identity during
        // reconciliation of this child.
        elementType: any,
     
        // The resolved function/class/ associated with this fiber.
        type: any,
     
        // The local state associated with this fiber.
        stateNode: any,
     
        // Remaining fields belong to Fiber
     
        // The Fiber to return to after finishing processing this one.
        // This is effectively the parent.
        // It is conceptually the same as the return address of a stack frame.
        return: Fiber | null,
     
        // Singly Linked List Tree Structure.
        child: Fiber | null,
        sibling: Fiber | null,
        index: number,
     
        // The ref last used to attach this node.
        ref: null | (((handle: mixed) => void) & {_stringRef: ?string, ...}) | RefObject,
     
        // Input is the data coming into process this fiber. Arguments. Props.
        pendingProps: any, // This type will be more specific once we overload the tag.
        memoizedProps: any, // The props used to create the output.
     
        // A queue of state updates and callbacks.
        updateQueue: mixed,
     
        // The state used to create the output
        memoizedState: any,
     
        mode: TypeOfMode,
     
        // Effect
        effectTag: SideEffectTag,
        subtreeTag: SubtreeTag,
        deletions: Array<Fiber> | null,
     
        // Singly linked list fast path to the next fiber with side-effects.
        nextEffect: Fiber | null,
     
        // The first and last fiber with side-effect within this subtree. This allows
        // us to reuse a slice of the linked list when we reuse the work done within
        // this fiber.
        firstEffect: Fiber | null,
        lastEffect: Fiber | null,
     
        // This is a pooled version of a Fiber. Every fiber that gets updated will
        // eventually have a pair. There are cases when we can clean up pairs to save
        // memory if we need to.
        alternate: Fiber | null,
      };

    How does React Fiber work?

    Next, we will see how the React Fiber creates the linked list tree and what it does when there is an update.

    Before that, let’s explain what a current tree and workInProgress tree is and how the tree traversal happens. 

    The tree, which is currently flushed to render the UI, is called current. It’s one that was used to render the current UI. Whenever there is an update, Fiber builds a workInProgress tree, which is created from the updated data from the React elements. React performs work on this workInProgress tree and uses this updated tree for the next render. Once this workInProgress tree is rendered on the UI, it becomes the current tree.

    Fig:- Current and workInProgress trees

    Fiber tree traversal happens like this:

    • Start: Fiber starts traversal from the topmost React element and creates a fiber node for it. 
    • Child: Then, it goes to the child element and creates a fiber node for this element. This continues until the leaf element is reached. 
    • Sibling: Now, it checks for the sibling element if there is any. If there is any sibling, it traverses the sibling subtree until the leaf element of the sibling. 
    • Return: If there is no sibling, then it returns to the parent. 

    Every fiber has a child (or a null value if there is no child), sibling, and parent property (as you have seen the structure of fiber in the earlier section). These are the pointers in the Fiber to work as a linked list.

    Fig:- React Fiber tree traversal

    Let’s take the same example, but let’s name the fibers that correspond to the specific React elements.

    function App() {    // App
        return (
          <div className="wrapper">    // W
            <div className="list">    // L
              <div className="list_item">List item A</div>    // LA
              <div className="list_item">List item B</div>    // LB
            </div>
            <div className="section">   // S
              <button>Add</button>   // SB
              <span>No. of items: 2</span>   // SS
            </div>
          </div>
        );
      }
     
      ReactDOM.render(<App />, document.getElementById('root'));  // HostRoot

    First, we will quickly cover the mounting stage where the tree is created, and after that, we will see the detailed logic behind what happens after any update.

    Initial render

    The App component is rendered in root div, which has the id of root.

    Before traversing further, React Fiber creates a root fiber. Every Fiber tree has one root node. Here in our case, it’s HostRoot. There can be multiple roots if we import multiple React Apps in the DOM.

    Before rendering for the first time, there won’t be any tree. React Fiber traverses through the output from each component’s render function and creates a fiber node in the tree for each React element. It uses createFiberFromTypeAndProps to convert React elements to fiber. The React element can be a class component or a host component like div or span. For the class component, it creates an instance, and for the host component, it gets the data/props from the React Element.

    So, as shown in the example, it creates a fiber App. Going further, it creates one more fiber, W, and then it goes to child div and creates a fiber L. So on, it creates a fiber, LA  and LB, for its children. The fiber, LA, will have return (can also be called as a parent in this case) fiber as L, and sibling as LB.

    So, this is how the final fiber tree will look.

    Fig:- React Fiber Relationship

    This is how the nodes of a tree are connected using the child, sibling, and return pointers.

    Update Phase

    Now, let’s cover the second case, which is update—say due to setState. 

    So, at this time, Fiber already has the current tree. For every update, it builds a workInProgress tree. It starts with the root fiber and traverses the tree until the leaf node. Unlike the initial render phase, it doesn’t create a new fiber for every React element. It just uses the preexisting fiber for that React element and merges the new data/props from the updated element in the update phase. 

    Earlier, in React 15, the stack reconciler was synchronous. So, an update would traverse the whole tree recursively and make a copy of the tree. Suppose in between this, if some other update comes that has a higher priority than this, then there is no chance to abort or pause the first update and perform the second update. 

    React Fiber divides the update into units of works. It can assign the priority to each unit of work, and has the ability to pause, reuse, or abort the unit of work if not needed. React Fiber divides the work into multiple units of work, which is fiber. It schedules the work in multiple frames and uses the deadline from the requestIdleCallback. Every update has its priority defined like animation, or user input has a higher priority than rendering the list of items from the fetched data. Fiber uses requestAnimationFrame for higher priority updates and requestIdleCallback for lower priority updates. So, while scheduling a work, Fiber checks the priority of the current update and the deadline (free time after the end of the frame).

    Fiber can schedule multiple units of work after a single frame if the priority is higher than the pending work—or if there is no deadline or the deadline has yet to be reached. And the next set of units of work is carried over the further frames. This is what makes it possible for Fiber to pause, reuse, and abort the unit of work.

    So, let’s see what actually happens in the scheduled work. There are two phases to complete the work: render and commit.

    Render Phase

    The actual tree traversal and the use of deadline happens in this phase. This is the internal logic of Fiber, so the changes made on the Fiber tree in this phase won’t be visible to the user. So Fiber can pause, abort, or divide work on multiple frames. 

    We can call this phase the reconciliation phase. Fiber traverses from the root of the fiber tree and processes each fiber. The workLoop function is called for every unit of work to perform the work. We can divide this processing of the work into two steps: begin and complete.

    Begin Step

    If you find the workLoop function from the React codebase, it calls the performUnitOfWork, which takes the nextUnitOfWork as a parameter. It is nothing but the unit of work, which will be performed. The performUnitOfWork function internally calls the beginWork function. This is where the actual work happens on the fiber, and performUnitOfWork is just where the iteration happens. 

    Inside the beginWork function, if the fiber doesn’t have any pending work, it just bails out(skips) the fiber without entering the begin phase. This is how, while traversing the large tree, Fiber skips already processed fibers and directly jumps to the fiber, which has pending work. If you see the large beginWork function code block, we will find a switch block that calls the respective fiber update function, depending on the fiber tag. Like updateHostComponent for host components. These functions update the fiber. 

    The beginWork function returns the child fiber if there is any or null if there is no child. The performUnitOfWork function keeps on iterative and calls the child fibers till the leaf node reaches. In the case of a leaf node, beginWork returns null as there is no any child and performUnitOfWork function calls a completeUnitOfWork function. Let’s see the complete step now.

    Complete Step

    This completeUnitOfWork function completes the current unit of work by calling a completeWork function. completeUnitOfWork returns a sibling fiber if there is any to perform the next unit of work else completes the return(parent) fiber if there is no work on it. This goes till the return is null, i.e.,  until it reaches the root node. Like beginWork, completeWork is also a function where actual work happens, and completeUnitOfWork is for the iterations.

    The result of the render phase creates an effect list (side-effects). These effects are like insert, update, or delete a node of host components, or calling the lifecycle methods for the node of class components. The fibers are marked with the respective effect tag.

    After the render phase, Fiber will be ready to commit the updates. 

    Commit Phase

    This is the phase where the finished work will be used to render it on the UI. As the result of this phase will be visible to the user, it can’t be divided in partial renders. This phase is a synchronous phase. 

    At the beginning of this phase, Fiber has the current tree that’s already rendered on the UI, finishedWork, or the workInProgress tree, which is built during the render phase and the effect list.

    The effect list is the linked list of fibers, which has side-effects. So, it’s a subset of nodes of the workInProgress tree from the render phase, which has side-effects(updates). The effect list nodes are linked using a nextEffect pointer.

    The function called during this phase is completeRoot

    Here, the workInProgress tree becomes the current tree as it is used to render the UI. The actual DOM updates like insert, update, delete, and calls to lifecycle methods—or updates related to refs—happen for the nodes present in the effect list.

    That’s how the Fiber reconciler works.

    Conclusion

    This is how the React Fiber reconciler makes it possible to divide the work into multiple units of work. It sets the priority of each work, and makes it possible to pause, reuse, and abort the unit of work. In the fiber tree, the individual node keeps track of which are needed to make the above things possible. Every fiber is a node of linked lists, which are connected through the child, sibling, and return references. 

    Here is a well documented list of resources you can find to know more about the React Fiber.

    Related Articles

    1. Using Formik To Build Dynamic Forms In React – Faster & Better

    2. Cleaner, Efficient Code with Hooks and Functional Programming

  • Set Up Simple S3 Deployment Workflow with Github Actions and CircleCI

    In this article, we’ll implement a continuous delivery (referred to as CD going forward) workflow using the Serverless framework for our demo React SPA application using Serverless Finch.

    Deploying single-page applications to AWS S3 is a common use case. Manual deployment and bucket configuration can be tedious and unreliable. By using Serverless and CD platforms, we can simplify this commonly faced CD challenge.

    In almost every project we have worked on, we have built a general-purpose continuous integration (referred to as CI through the rest of this article) setup as part of our basic setups. The CI requirements might range from simple test workflows to cluster deployments.

    In this article, we’ll be focusing on a simple deployment workflow using Github Actions and CircleCI. Github Actions brought CI/CD to a wider community by simplifying the setup for CI pipelines. 

    Prerequisites

    This article assumes you have a basic understanding of CICD and AWS services such as IAM and S3. The sample application uses a basic Create React Application for the deployment demo. But knowing React.js is not required. You can implement the same flow for any other SPA or bare-bones application.

    Why Github Actions?

    There have always been great tools and CI platforms, such as AWS CodePipeline, Jenkins, Travis CI, CircleCI, etc. What makes Github Actions so compelling is that it’s built inside Github. Many organizations use Github for source control, and they often have to spend time configuring repositories with CI tools. On top of that, starting with Github Actions is free.

    As Github Actions is built inside the Github ecosystem, it’s a piece of cake to get CI pipelines up and running. Github Actions also allow you to build your own actions. However, there are some limitations because the CI platform is quite new compared to others.

    Why CircleCI?

    CircleCI has been in the market for almost a decade providing CICD solutions. One of many reasons to choose CircleCI is its pricing. CircleCI offers free credits each month without any upfront payments or payment details. It also offers a wide-ranging repository of plugins called Orbs. You can even build your own orbs, which are easy. It also offers simple and reliable workflow building tools. You can check other features as well.

    Let’s Get Started

    To introduce the application, we’ll create a simple React application with master-detail flow added to it. We’ll be using React’s official CRA tool to create our project, which creates the boilerplate for us.

    Installing Dependencies

    Let’s install the create-react-app as a global package. We’ll be calling our demo project “Serverless S3”. Now, we will create our react app with the following:

    yarn global add create-react-app
    create-react-app serverless-s3

    Now that we’ve created the frontend application, we can start building something cool with it. If we run the application with yarn start, we should be able to see the default CRA welcome page:

    Source: React

    To implement our master-detail flow of Github repositories, we’ll need to add some navigation to our app. Also, to keep it short, we’ll be using Github’s official SDK package. So, let’s use the react-router for the same.

    yarn add react-router-dom @octakit/core

    Our demo application will consist of two routes: 

    1. A list of all public repos of an organization
    2. The details of the repository after clicking a repo item from the list 

    We’ll be using the Octokit client to fetch the data from Github’s open endpoints. This won’t need any authentication with Github.

    Adding Application Components

    Alright, now that we have our dependencies installed, we can add the routes to our App.js, which is the entry point for our React app.

    import { BrowserRouter as Router, Switch, Route } from 'react-router-dom';
     
    import RepoList from './RepoList';
    import RepoDetails from './RepoDetails';
     
    import './App.css';
     
    function App() {
      return (
       <Router>
         <div className="App">
           <Switch>
             <Route path="/repo/:owner/:repo" component={RepoDetails} />
             <Route path="/" component={RepoList} />
           </Switch>
         </div>
       </Router>
     );
    }
     
    export default App;

    Let’s initialize our Octokit client, which will help us make calls to Github’s open endpoints to get data.

    import { Octokit } from '@octokit/core';
     
    export const octokit = new Octokit({});

    You can even make calls to authorized resources with the Octokit client. Octokit client supports both GraphQL and REST API. You can learn more about the client through the official documentation.

    Let’s add the RepoList.js component to the application, which will fetch the list of repositories of a given organization and display hyperlinks to the details page.

    import React, { useEffect, useState } from 'react';
    import { Link } from 'react-router-dom';
    import { octokit } from './client';
     
    function RepoList() {
     const [repos, setRepos] = useState([]);
     useEffect(() => {
       octokit
         .request('GET /orgs/:org/repos', {
           org: 'octokit',
         })
         .then((data) => setRepos(data.data));
     }, []);
     
     return (
       <div className="repo-list-container">
         <h1>Repositories</h1>
         <ul>
           {repos.map((repo) => (
             <li key={repo.id} className="repo-list-item">
               <Link to={`/repo/${repo.owner.login}/${repo.name}`}>{repo.full_name}</Link>
             </li>
           ))}
         </ul>
       </div>
     );
    }
     
    export default RepoList;

    Now that we have our list of repositories ready, we can now allow users to see some of their general details. Let’s create our details component called RepoDetails:

    import { useEffect, useState } from 'react';
    import { useParams } from 'react-router-dom';
    import { octokit } from './client';
    function RepoDetails() {
      const [repo, setRepo] = useState();
      const { repo: repoName, owner } = useParams();
      useEffect(() => {
        octokit
          .request('GET /repos/{owner}/{repo}', {
            owner,
            repo: repoName,
          })
          .then((data) => setRepo(data.data));
      }, [repoName, owner]);
      if (!repo) {
        return <b>loading...</b>;
      }
      return (
        <div className="repo-container">
          <h1>{repo.full_name}</h1>
          <p>Description: {repo.description}</p>
          <ul>
            <li><b>Forks:</b> {repo.forks}</li>
            <li><b>Subscribers:</b> {repo.subscribers_count}</li>
            <li><b>Watchers:</b> {repo.watchers}</li>
            <li><b>License:</b> {repo.license.name}</li>
          </ul>
        </div>
      );
    }
    export default RepoDetails;

    Setting up Serverless

    With this done, we have our repositories master-detail flow ready. Assuming we have an AWS account setup, we can start adding the Serverless config to our project. Let’s start with the CD setup. As we said before, we’ll be using the Serverless framework to achieve our deployment workflow. Let’s add it.

    We’ll also install the Serverless plugin called serverless-finch, which allows us to configure and deploy to S3 buckets.

    yarn global add serverless
    yarn add serverless-finch --save-dev

    Now that we have our Serverless CLI installed, we init the serverless service in our project by running the following command to create a hello-world serverless service:

    serverless create -t hello-world

    This will create a configuration yaml file and a handler lambda function. We don’t need the handler, so we can delete handler.js. Our serverless.yml should look like this:

    service: serverless-s3
    frameworkVersion: '2'
     
    # The `provider` block defines where your service will be deployed
    provider:
     name: aws
     runtime: nodejs12.x
     
    functions:
     helloWorld:
       handler: handler.hello-world
         events:
         - http:
             path: helloWorld
             method: get
             cors: true

    The serverless.yml file contains configurations for a lambda function called hello-world. We can remove the functions block completely. After doing that, let’s register our Serverless Finch plugin:

    service: serverless-s3
    frameworkVersion: '2'
     
    provider:
     name: aws
     runtime: nodejs12.x
     
    plugins:
     - serverless-finch

    Alright, now that our plugin is ready to be used, we can add details about our S3 buckets so it can deploy to it. Let’s add this block, which tells Serverless to use the serverless-s3-galileo bucket to deploy our code from the build directory. Make sure you use a different bucket name, as S3 bucket names are unique globally.

    custom:
     client:
       bucketName: serverles-s3-galileo
       distributionFolder: build
       indexDocument: index.html
       errorDocument: index.html

    That is it! We’re ready to deploy our app on our bucket. Haven’t created a bucket yet? No problem—serverless-finch will automatically create it. The last thing we need to add is bucket-policy so our app can be accessed publicly. Let’s create our bucket policy.

    Note: The indexDocument is the entry point for our web application, which is index.html in this case. We also need to add the same to errorDocument so our React routing works well in S3 hosting.

    {
       "Version": "2012-10-17",
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
                   "AWS": "*"
               },
               "Action": "s3:GetObject",
               "Resource": "arn:aws:s3:::serverles-s3-galileo/*"
           }
       ]
    }

    As the default access to S3 assets is private, we need to set up a bucket policy for our deployment bucket. The policy gives read-only access to the public for our app so we can browse the deployed assets in the browser. You can learn more about bucket policies. Let’s update our Serverless config to use our policy. This is how our serverless.yml should look:

    service: serverless-s3
    frameworkVersion: '2'
     
    provider:
     name: aws
     runtime: nodejs12.x
     
    plugins:
     - serverless-finch
     
    custom:
     client:
       bucketName: serverles-s3-galileo
       distributionFolder: build
       indexDocument: index.html
       errorDocument: index.html
       bucketPolicyFile: config/bucket-policy.json

    Creating Github Actions Workflow

    Assuming you’ve created your repo and pushed the code to it, we can start setting up our first workflow using Github Actions. As we’re using AWS for our Serverless deployments to S3, we need to provide the details of our IAM role. The env block allows us to insert custom env variables into the CI build. In this case, we need the AWS access key and secret access key to deploy build files to the S3 bucket. 

    Github allows us to store secret values that can be used in the CI environment of Github Actions. You can easily set up these secrets for your repositories. This is how they should look when configured:

    Now, we can move ahead and add a Github Action workflow. Let’s create a workflow file at the .github/deploy.yml location and add the following to it.

    name: Serverless S3 Deploy
    on:
     push:
       branches: [ master ]
     pull_request:
       branches: [ master ]

    Alright, so the Github Actions config above tells Github to trigger this workflow whenever someone pushes to the master branch or creates a PR against it.

    As of now, our action config is incomplete and does nothing. Let’s add our first and only job to the workflow:

    name: Serverless S3
     
    on:
     push:
       branches: [ master ]
     pull_request:
       branches: [ master ]
     
    jobs:
     build:
       runs-on: ubuntu-latest
       strategy:
         matrix:
           node-version: [10.x]
       steps:
       - uses: actions/checkout@v2

    Let’s try to digest the config above:

    runs-on:  ubuntu-latest

    The runs-on statement specifies which executor will be running the job. In this case, it’s the latest release of Linux Ubuntu variant.

    Strategy: 

         Matrix:

            node-version: [10.x]

    The strategy defines the environment we want to run our job on. This is usually useful when we want to run tests on multiple machines. In our case, we don’t want that. So, we’ll be using a single node environment with version 10.x

       steps:

       – uses: actions/checkout@v2

    In the configuration’s steps block, we can define various tasks to be sequentially performed within a job. actions/checkout@v2 does the work of checking out branches for us. This step is required so we can do further work on our source code.

    This bare minimum setup is required for running a job in our Github workflows. After this, we will need to set up the environment and deploy our application. So, let’s add the rest of the steps to it.

    name: Serverless S3
     
    on:
     push:
       branches: [ master ]
     pull_request:
       branches: [ master ]
     
    jobs:
     build:
       runs-on: ubuntu-latest
       strategy:
         matrix:
           node-version: [10.x]
       steps:
       - uses: actions/checkout@v2
       - name: Use Node.js ${{ matrix.node-version }}
         uses: actions/setup-node@v1
         with:
           node-version: ${{ matrix.node-version }}
       - run: yarn install
       - run: yarn build
       - name: serverless deploy s3
         uses: serverless/github-action@master
         with:
           args: client deploy --no-confirm
         env:
           AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
           AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

    These actions need to be executed to deploy our frontend assets to our S3 buckets. As we read through the steps, we’re doing the following things in sequence:

    1. Check out the current branch code

    2. Setting up our node.js environment

    3. Installing our dependencies with yarn install

    4. Building our production build with yarn build

    5. Deploying our build to S3 with serverless deploy –no-confirm

    • The uses block defines which custom action we’re using
    • The args block allows us to pass arguments to the actions
    • The –no-confirm flag is needed so Serverless Finch does not ask us for confirmations while deploying to S3 buckets. 
    • The args allows us to tell action to run it with specific arguments
    • env allows us to pass custom environment variables to an action

    Alright, so now we have the CD workflow setup to deploy our app. We can make a commit and push to the master branch. This should trigger our workflow. You can see your workflow running in the Actions section of your repository like this:

    You can check the output of the serverless deploy step and browse the S3 website URL. It should now show our application running.

    Creating CircleCI Workflow

    To start building a repository, we need to authorize it with our Github account. You can do that by signing up for CircleCI and following the steps here.

    As we did, add the IAM role secret credentials to our actions workflow. We can set up env variables for our workflows in CircleCI. This is how they should look once configured in the project settings:

    Just like the Github Actions workflow, we can create workflows in CircleCI. CircleCI also allows us to use third-party custom plugins. We can use the available plugins called Orbs in our deployment workflows in CircleCI.

    We’ll need the official CircleCI distributions of the aws-cli, serverless-framework, and node.js orbs for our deploy workflow. Let’s create our first job for our workflow:

    version: 2.1
     
    orbs:
     aws-cli: circleci/aws-cli@1.0.0
     serverless: circleci/serverless-framework@1.0.1
     node: circleci/node@4.1.0
     
    jobs:
     deploy:
       executor: serverless/default

    The executor here is a prebuilt image, which allows us to run. 

    Just like we defined steps for our jobs in Github Actions, we can add for CircleCI. Here we’re using commands made available from the node orb to install dependencies, build projects, and set up Serverless with AWS. Just like we set up the secrets for Github Actions, we need to define our AWS credentials under the CircleCI environment variables.

    version: 2.1
     
    orbs:
     aws-cli: circleci/aws-cli@1.0.0
     serverless: circleci/serverless-framework@1.0.1
     node: circleci/node@4.1.0
     
    jobs:
     deploy:
       executor: serverless/default
       steps:
         - checkout
         - node/install-yarn
         - run:
             name: install
             command: yarn install
         - run:
             name: build
             command: yarn build
         - aws-cli/setup
         - serverless/setup:
             app-name: serverless-s3
             org-name: velotio
         - run:
             name: deploy
             command: serverless client deploy --no-confirm
    workflows:
     deploy:
       jobs:
         - deploy:
             filters:
               branches:
                 only:
                   - master

    The workflows section in the above yml file indicates that we want to trigger the deploy workflow whenever our master branch gets updated. Just like we mentioned the steps for the Github Actions deploy job, we did the same for CircleCI jobs.

    1. Check out the code
    2. Install yarn package manager with node/install-yarn 
    3. Install dependencies with yarn install
    4. Build the project with yarn build
    5. Setup AWS and Serverless CLI
    6. Deploy to s3 with serverless client deploy –no-confirm

    The workflow block in the config above tells CircleCI to run the deploy job. The filters block for the deploy job above tells us that we want to run the job only when the master branch gets updated. 

    Once we’re done with the above setup, we can make a test commit and check whether our workflow is running.

    Conclusion

    We can easily integrate build/deployment workflows with simple configurations offered through Github Actions. If we don’t primarily use GitHub as version control, we can opt for CircleCI for our workflows.

    Related Articles

    1. Automating Serverless Framework Deployment using Watchdog
    2. To Go Serverless Or Not Is The Question

    You can find the referenced code at this repo.