Category: Software Engineering & Architecture

  • Implementing gRPC In Python: A Step-by-step Guide

    In the last few years, we saw a great shift in technology, where projects are moving towards “microservice architecture” vs the old “monolithic architecture”. This approach has done wonders for us. 

    As we say, “smaller things are much easier to handle”, so here we have microservices that can be handled conveniently. We need to interact among different microservices. I handled it using the HTTP API call, which seems great and it worked for me.

    But is this the perfect way to do things?

    The answer is a resounding, “no,” because we compromised both speed and efficiency here. 

    Then came in the picture, the gRPC framework, that has been a game-changer.

    What is gRPC?

    Quoting the official documentation

    gRPC or Google Remote Procedure Call is a modern open-source high-performance RPC framework that can run in any environment. It can efficiently connect services in and across data centers with pluggable support for load balancing, tracing, health checking and authentication.”

     

    Credit: gRPC

    RPC or remote procedure calls are the messages that the server sends to the remote system to get the task(or subroutines) done.

    Google’s RPC is designed to facilitate smooth and efficient communication between the services. It can be utilized in different ways, such as:

    • Efficiently connecting polyglot services in microservices style architecture
    • Connecting mobile devices, browser clients to backend services
    • Generating efficient client libraries

    Why gRPC? 

    HTTP/2 based transport – It uses HTTP/2 protocol instead of HTTP 1.1. HTTP/2 protocol provides multiple benefits over the latter. One major benefit is multiple bidirectional streams that can be created and sent over TCP connections parallelly, making it swift. 

    Auth, tracing, load balancing and health checking – gRPC provides all these features, making it a secure and reliable option to choose.

    Language independent communication– Two services may be written in different languages, say Python and Golang. gRPC ensures smooth communication between them.

    Use of Protocol Buffers – gRPC uses protocol buffers for defining the type of data (also called Interface Definition Language (IDL)) to be sent between the gRPC client and the gRPC server. It also uses it as the message interchange format. 

    Let’s dig a little more into what are Protocol Buffers.

    Protocol Buffers

    Protocol Buffers like XML, are an efficient and automated mechanism for serializing structured data. They provide a way to define the structure of data to be transmitted. Google says that protocol buffers are better than XML, as they are:

    • simpler
    • three to ten times smaller
    • 20 to 100 times faster
    • less ambiguous
    • generates data access classes that make it easier to use them programmatically

    Protobuf are defined in .proto files. It is easy to define them. 

    Types of gRPC implementation

    1. Unary RPCs:- This is a simple gRPC which works like a normal function call. It sends a single request declared in the .proto file to the server and gets back a single response from the server.

    rpc HelloServer(RequestMessage) returns (ResponseMessage);

    2. Server streaming RPCs:- The client sends a message declared in the .proto file to the server and gets back a stream of message sequence to read. The client reads from that stream of messages until there are no messages.

    rpc HelloServer(RequestMessage) returns (stream ResponseMessage);

    3. Client streaming RPCs:- The client writes a message sequence using a write stream and sends the same to the server. After all the messages are sent to the server, the client waits for the server to read all the messages and return a response.

    rpc HelloServer(stream RequestMessage) returns (ResponseMessage);

    4. Bidirectional streaming RPCs:- Both gRPC client and the gRPC server use a read-write stream to send a message sequence. Both operate independently, so gRPC clients and gRPC servers can write and read in any order they like, i.e. the server can read a message then write a message alternatively, wait to receive all messages then write its responses, or perform reads and writes in any other combination.

    rpc HelloServer(stream RequestMessage) returns (stream ResponseMessage);

    **gRPC guarantees the ordering of messages within an individual RPC call. In the case of Bidirectional streaming, the order of messages is preserved in each stream.

    Implementing gRPC in Python

    Currently, gRPC provides support for many languages like Golang, C++, Java, etc. I will be focussing on its implementation using Python.

    mkdir grpc_example
    cd grpc_example
    virtualenv -p python3 env
    source env/bin/activate
    pip install grpcio grpcio-tools

    This will install all the required dependencies to implement gRPC.

    Unary gRPC 

    For implementing gRPC services, we need to define three files:-

    • Proto file – Proto file comprises the declaration of the service that is used to generate stubs (<package_name>_pb2.py and <package_name>_pb2_grpc.py). These are used by the gRPC client and the gRPC server.</package_name></package_name>
    • gRPC client – The client makes a gRPC call to the server to get the response as per the proto file.
    • gRPC Server – The server is responsible for serving requests to the client.
    syntax = "proto3";
    
    package unary;
    
    service Unary{
      // A simple RPC.
      //
      // Obtains the MessageResponse at a given position.
     rpc GetServerResponse(Message) returns (MessageResponse) {}
    
    }
    
    message Message{
     string message = 1;
    }
    
    message MessageResponse{
     string message = 1;
     bool received = 2;
    }

    In the above code, we have declared a service named Unary. It consists of a collection of services. For now, I have implemented a single service GetServerResponse(). This service takes an input of type Message and returns a MessageResponse. Below the service declaration, I have declared Message and Message Response.

    Once we are done with the creation of the .proto file, we need to generate the stubs. For that, we will execute the below command:-

    python -m grpc_tools.protoc --proto_path=. ./unary.proto --python_out=. --grpc_python_out=.

    Two files are generated named unary_pb2.py and unary_pb2_grpc.py. Using these two stub files, we will implement the gRPC server and the client.

    Implementing the Server

    import grpc
    from concurrent import futures
    import time
    import unary.unary_pb2_grpc as pb2_grpc
    import unary.unary_pb2 as pb2
    
    
    class UnaryService(pb2_grpc.UnaryServicer):
    
        def __init__(self, *args, **kwargs):
            pass
    
        def GetServerResponse(self, request, context):
    
            # get the string from the incoming request
            message = request.message
            result = f'Hello I am up and running received "{message}" message from you'
            result = {'message': result, 'received': True}
    
            return pb2.MessageResponse(**result)
    
    
    def serve():
        server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
        pb2_grpc.add_UnaryServicer_to_server(UnaryService(), server)
        server.add_insecure_port('[::]:50051')
        server.start()
        server.wait_for_termination()
    
    
    if __name__ == '__main__':
        serve()

    In the gRPC server file, there is a GetServerResponse() method which takes `Message` from the client and returns a `MessageResponse` as defined in the proto file.

    server() function is called from the main function, and makes sure that the server is listening to all the time. We will run the unary_server to start the server

    python3 unary_server.py

    Implementing the Client

    import grpc
    import unary.unary_pb2_grpc as pb2_grpc
    import unary.unary_pb2 as pb2
    
    
    class UnaryClient(object):
        """
        Client for gRPC functionality
        """
    
        def __init__(self):
            self.host = 'localhost'
            self.server_port = 50051
    
            # instantiate a channel
            self.channel = grpc.insecure_channel(
                '{}:{}'.format(self.host, self.server_port))
    
            # bind the client and the server
            self.stub = pb2_grpc.UnaryStub(self.channel)
    
        def get_url(self, message):
            """
            Client function to call the rpc for GetServerResponse
            """
            message = pb2.Message(message=message)
            print(f'{message}')
            return self.stub.GetServerResponse(message)
    
    
    if __name__ == '__main__':
        client = UnaryClient()
        result = client.get_url(message="Hello Server you there?")
        print(f'{result}')

    In the __init__func. we have initialized the stub using ` self.stub = pb2_grpc.UnaryStub(self.channel)’ And we have a get_url function which calls to server using the above-initialized stub  

    This completes the implementation of Unary gRPC service.

    Let’s check the output:-

    Run -> python3 unary_client.py 

    Output:-

    message: “Hello Server you there?”

    message: “Hello I am up and running. Received ‘Hello Server you there?’ message from you”

    received: true

    Bidirectional Implementation

    syntax = "proto3";
    
    package bidirectional;
    
    service Bidirectional {
      // A Bidirectional streaming RPC.
      //
      // Accepts a stream of Message sent while a route is being traversed,
       rpc GetServerResponse(stream Message) returns (stream Message) {}
    }
    
    message Message {
      string message = 1;
    }

    In the above code, we have declared a service named Bidirectional. It consists of a collection of services. For now, I have implemented a single service GetServerResponse(). This service takes an input of type Message and returns a Message. Below the service declaration, I have declared Message.

    Once we are done with the creation of the .proto file, we need to generate the stubs. To generate the stub, we need the execute the below command:-

    python -m grpc_tools.protoc --proto_path=.  ./bidirecctional.proto --python_out=. --grpc_python_out=.

    Two files are generated named bidirectional_pb2.py and bidirectional_pb2_grpc.py. Using these two stub files, we will implement the gRPC server and client.

    Implementing the Server

    from concurrent import futures
    
    import grpc
    import bidirectional.bidirectional_pb2_grpc as bidirectional_pb2_grpc
    
    
    class BidirectionalService(bidirectional_pb2_grpc.BidirectionalServicer):
    
        def GetServerResponse(self, request_iterator, context):
            for message in request_iterator:
                yield message
    
    
    def serve():
        server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
        bidirectional_pb2_grpc.add_BidirectionalServicer_to_server(BidirectionalService(), server)
        server.add_insecure_port('[::]:50051')
        server.start()
        server.wait_for_termination()
    
    
    if __name__ == '__main__':
        serve()

    In the gRPC server file, there is a GetServerResponse() method which takes a stream of `Message` from the client and returns a stream of `Message` independent of each other. server() function is called from the main function and makes sure that the server is listening to all the time.

    We will run the bidirectional_server to start the server:

    python3 bidirectional_server.py

    Implementing the Client

    from __future__ import print_function
    
    import grpc
    import bidirectional.bidirectional_pb2_grpc as bidirectional_pb2_grpc
    import bidirectional.bidirectional_pb2 as bidirectional_pb2
    
    
    def make_message(message):
        return bidirectional_pb2.Message(
            message=message
        )
    
    
    def generate_messages():
        messages = [
            make_message("First message"),
            make_message("Second message"),
            make_message("Third message"),
            make_message("Fourth message"),
            make_message("Fifth message"),
        ]
        for msg in messages:
            print("Hello Server Sending you the %s" % msg.message)
            yield msg
    
    
    def send_message(stub):
        responses = stub.GetServerResponse(generate_messages())
        for response in responses:
            print("Hello from the server received your %s" % response.message)
    
    
    def run():
        with grpc.insecure_channel('localhost:50051') as channel:
            stub = bidirectional_pb2_grpc.BidirectionalStub(channel)
            send_message(stub)
    
    
    if __name__ == '__main__':
        run()

    In the run() function. we have initialised the stub using `  stub = bidirectional_pb2_grpc.BidirectionalStub(channel)’

    And we have a send_message function to which the stub is passed and it makes multiple calls to the server and receives the results from the server simultaneously.

    This completes the implementation of Bidirectional gRPC service.

    Let’s check the output:-

    Run -> python3 bidirectional_client.py 

    Output:-

    Hello Server Sending you the First message

    Hello Server Sending you the Second message

    Hello Server Sending you the Third message

    Hello Server Sending you the Fourth message

    Hello Server Sending you the Fifth message

    Hello from the server received your First message

    Hello from the server received your Second message

    Hello from the server received your Third message

    Hello from the server received your Fourth message

    Hello from the server received your Fifth message

    For code reference, please visit here.

    Conclusion‍

    gRPC is an emerging RPC framework that makes communication between microservices smooth and efficient. I believe gRPC is currently confined to inter microservice but has many other utilities that we will see in the coming years. To know more about modern data communication solutions, check out this blog.

  • How To Use Inline Functions In React Applications Efficiently

    This blog post explores the performance cost of inline functions in a React application. Before we begin, let’s try to understand what inline function means in the context of a React application.

    What is an inline function?

    Simply put, an inline function is a function that is defined and passed down inside the render method of a React component.

    Let’s understand this with a basic example of what an inline function might look like in a React application:

    export default class CounterApp extends React.Component {
      constructor(props) {
        super(props);
        this.state = { count: 0 };
      }
      render() {
        return (
          <div className="App">
            <button
              onClick={() => {
              this.setState({ count: this.state.count + 1 });
              }}
            >COUNT ({this.state.count})</button>
          </div>
        );
      }
    }

    The onClick prop, in the example above, is being passed as an inline function that calls this.setState. The function is defined within the render method, often inline with JSX. In the context of React applications, this is a very popular and widely used pattern.

    Let’s begin by listing some common patterns and techniques where inline functions are used in a React application:

    • Render prop: A component prop that expects a function as a value. This function must return a JSX element, hence the name. Render prop is a good candidate for inline functions.
    render() {
      return (
        <ListView
          items={items}
          render={({ item }) => (<div>{item.label}</div>)}
        />
      );
    }

    • DOM event handlers: DOM event handlers often make a call to setState or invoke some effect in the React application such as sending data to an API server.
    <button
      onClick={() => {
        this.setState({ count: this.state.count + 1 });
      }}>
      COUNT ({this.state.count})
    </button>

    • Custom function or event handlers passed to child: Oftentimes, a child component requires a custom event handler to be passed down as props. Inline function is usually used in this scenario.
    <Button onTap={() => {
      this.nextPage();
    }}>Next<Button>

    Alternatives to inline function

    • Bind in constructor: One of the most common patterns is to define the function within the class component and then bind context to the function in constructor. We only need to bind the current context if we want to use this keyword inside the handler function.
    export default class CounterApp extends React.Component {
      constructor(props) {
        super(props);
        this.state = { count: 0 };
        this.increaseCount = this.increaseCount.bind(this);
      }
    
      increaseCount() {
        this.setState({ count: this.state.count + 1 });
      }
    
      render() {
        return (
          <div className="App">
            <button onClick={this.increaseCount}>COUNT ({this.state.count})</button>
          </div>
        );
     }
    }

    • Bind in render: Another common pattern is to bind the context inline when the function is passed down. Eventually, this gets repetitive and hence the first approach is more popular.
    render() {
      return (
        <div className="App">
          <button onClick={this.increaseCount.bind(this)}>COUNT ({this.state.count})</button>
        </div>
      );
    }

    • Define as public field:
    increaseCount = () => {
      this.setState({ count: this.state.count + 1 });
    };
    
    render() {
      return (
        <div className="App">
          <button onClick={this.increaseCount}>
            COUNT ({this.state.count})
          </button>
        </div>
      );
    }
    view raw

    There are several other approaches that React dev community has come up with, like using a helper method to bind all functions automatically in the constructor.

    After understanding inline functions with its examples and also taking a look at a few alternatives, let’s see why inline functions are so popular and widely used.

    Why use inline function

    Inline function definitions are right where they are invoked or passed down. This means inline functions are easier to write, especially when the body of the function is of a few instructions such as calling setState. This works well within loops as well.

    For example, when rendering a list and assigning a DOM event handler to each list item, passing down an inline function feels much more intuitive. For the same reason, inline functions also make code more organized and readable.

    Inline arrow functions preserve context that means developers can use this without having to worry about current execution context or explicitly bind a context to the function.

    <Button onTap={() => {
      this.prevPage();
    }}>Previous<Button>

    Inline functions make value from parent scope available within the function definition. It results in more intuitive code and developers need to pass down fewer parameters. Let’s understand this with an example.

    render() {
      const { count } = this.state;
      return (
        <div className="App">
          <button
            onClick={() => {
              this.setState({ count: count + 1 });
          }}>
            COUNT ({count})
          </button>
        </div>
      );
    }

    Here, the value of count is readily available to onClick event handlers. This behavior is called closing over.

    For these reasons, React developers make use of inline functions heavily. That said, inline function has also been a hot topic of debate because of performance concerns. Let’s take a look at a few of these arguments.

    Arguments against inline functions

    • A new function is defined every time the render method is called. It results in frequent garbage collection, and hence performance loss.
    • There is an eslint config that advises against using inline function jsx-no-bind. The idea behind this rule is when an inline function is passed down to a child component, React uses reference checks to re-render the component. This can result in child component rendering again and again as a reference to the passed prop value i.e. inline function. In this case, it doesn’t match the original one.

    <listitem onclick=”{()” ==””> console.log(‘click’)}></listitem>

    Suppose ListItem component implements shouldComponentUpdate method where it checks for onClick prop reference. Since inline functions are created every time a component re-renders, this means that the ListItem component will reference a new function every time, which points to a different location in memory. The comparison checks in shouldComponentUpdate and tells React to re-render ListItem even though the inline function’s behavior doesn’t change. This results in unnecessary DOM updates and eventually reduces the performance of applications.

    Performance concerns revolving around the Function.prototype.bind methods: when not using arrow functions, the inline function being passed down must be bound to a context if using this keyword inside the function. The practice of calling .bind before passing down an inline function raises performance concerns, but it has been fixed. For older browsers, Function.prototype.bind can be supplemented with a polyfill for performance.

    Now that we’ve summarized a few arguments in favor of inline functions and a few arguments against it, let’s investigate and see how inline functions really perform.

    render() {
      return (
        <div>
          {this.state.timeThen > this.state.timeNow ? (
           <>
             <button onClick={() => { /* some action */ }} />
             <button onClick={() => { /* another action */ }} />
           </>
          ) : (
            <button onClick={() => { /* yet another action */ }} />
          )}
        </div>
      );
    }

    Pre-optimization can often lead to bad code. For instance, let’s try to get rid of all the inline function definitions in the component above and move them to the constructor because of performance concerns.

    We’d then have to define 3 custom event handlers in the class definition and bind context to all three functions in the constructor.

    export default class CounterApp extends React.Component {
      constructor(props) {
        super(props);
        this.state = {
          timeThen: ...,
          timeNow: Date.now()
        };
        this.someAction = this.someAction.bind(this);
        this.anotherAction = this.anotherAction.bind(this);
        this.yetAnotherAction = this.yetAnotherAction.bind(this);
      }
    
      someAction() { /* some action */ }
      anotherAction() { /* another action */ }
      yetAnotherAction() { /* yet another action */ }
    
      render() {
        return (<div>
          {this.state.timeThen > this.state.timeNow ? (
            <>
              <button onClick={this.someAction} />
              <button onClick={this.anotherAction} />
            </>
          ) : (
            <button onClick={this.yetAnotherAction} />
          )}</div>);
      }
    }

    This would increase the initialization time of the component significantly as opposed to inline function declarations where only one or two functions are defined and used at a time based on the result of condition timeThen > timeNow.

    Concerns around render props: A render prop is a method that returns a React element and is used to share state among React components.

    Render props are meant to be invoked on each render since they share state between parent components and enclosed React elements. Inline functions are a good candidate for use in render prop and won’t cause any performance concern.

    render() {
      return (
        <ListView
          items={items}
          render={({ item }) => (<div>{item.label}</div>)}
        />
      )
    }

    Here, the render prop of ListView component returns a label enclosed in a div. Since the enclosed component can never know what the value of the item variable is, it can never be a PureComponent or have a meaningful implementation of shouldComponentUpdate(). This eliminates the concerns around use of inline function in render prop. In fact, promotes it in most cases.

    In my experience, inline render props can sometimes be harder to maintain especially when render prop returns a larger more complicated component in terms of code size. In such cases, breaking down the component further or having a separate method that gets passed down as render prop has worked well for me.

    Concerns around PureComponents and shouldComponentUpdate(): Pure components and various implementations of shouldComponentUpdate both do a strict type comparison of props and state. These act as performance enhancers by letting React know when or when not to trigger a render based on changes to state and props. Since inline functions are created on every render, when such a method is passed down to a pure component or a component that implements the shouldComponentUpdate method, it can lead to an unnecessary render. This is because of the changed reference of the inline function.

    To overcome this, consider skipping checks on all function props in shouldComponentUpdate(). This assumes that inline functions passed to the component are only different in reference and not behavior. If there is a difference in the behavior of the function passed down, it will result in a missed render and eventually lead to bugs in the component’s state and effects.

    Conclusion‍

    A common rule of thumb is to measure performance of the app and only optimize if needed. Performance impact of inline function, often categorized under micro-optimizations, is always a tradeoff between code readability, performance gain, code organization, etc that must be thought out carefully on a case by case basis and pre-optimization should be avoided.

    In this blog post, we observed that inline functions don’t necessarily bring in a lot of performance cost. They are widely used because of ease of writing, reading and organizing inline functions, especially when inline function definitions are short and simple.

  • Setting up S3 & CloudFront to Deliver Static Assets Across the Web

    If you have a web application, you probably have static content. Static content might include files like images, videos, and music. One of the simpler approaches to serve your content on the internet is Amazon AWS’s “S3 Bucket.” S3 is very easy to set up and use.

    Problems with only using S3 to serve your resources

    But there are a few limitations of serving content directly using S3. Using S3, you will need:

    • Either keep the bucket public, which is not at all recommended
    • Or, create pre-signed urls to access the private resources. Now, if your application has tons of resources to be loaded, then it will add a lot of latency to pre-sign each and every resource before serving on the UI.

    For these reasons, we will also use AWS’s CloudFront.

    Why use CloudFront with S3?

    Amazon CloudFront (CDN) is designed to work seamlessly with S3 to serve your S3 content in a faster way. Also, using CloudFront to serve s3 content gives you a lot more flexibility and control.

    It has below advantages:

    • Using CloudFront provides authentication, so there’s no need to generate pre-signed urls for each resource.
    • Improved Latency, which results in a better end-user experience.
    • CloudFront provides caching, which can reduce the running costs as content is not always served from S3 when cached.
    • Another case for using CloudFront over S3 is that you can use an SSL certificate to a custom domain in CloudFront.

    Setting up S3 & CloudFront

    Creating an S3 bucket

    1. Navigate to S3 from the AWS console and click on Create Bucket. Enter a unique bucket name and select the AWS Region.

    2. Make sure the Block Public Access settings for this bucket is set to “Block All Public Access,” as it is recommended and we don’t need public access to buckets.

    3. Review other options and create a bucket. Once a bucket is created, you can see it on the S3 dashboard. Open the bucket to view its details, and next, let’s add some assets.

    4. Click on upload and add/drag all the files or folders you want to upload. 

    5. Review the settings and upload. You can see the status on successful upload. Go to bucket details, and, after opening up the uploaded asset, you can see the details of the uploaded asset.

    If you try to copy the object URL and open it in the browser, you will get the access denied error as we have blocked direct public access. 

    We will be using CloudFront to serve the S3 assets in the next step. CloudFront will restrict access to your S3 bucket to CloudFront endpoints rendering your content and application will become more secure and performant.

    Creating a CloudFront

    1. Navigate to CloudFront from AWS console and click on Create Distribution. For the Origin domain, select the bucket from which we want to serve the static assets.

    2. Next, we need Use a CloudFront origin access identity (OAI) to access the S3 bucket. This will enable us to access private S3 content via CloudFront. To enable this, under S3 bucket access, select “Yes use OAI.” Select an existing origin access identity or create a new identity.
    You can also choose to update the S3 bucket policy to allow read access to the OAI if it is not already configured previously.

    3. Review all the settings and create distribution. You can see the domain name once it is successfully created.

    4. The basic setup is done. If you can try to access the asset we uploaded via the CloudFront domain in your browser, it should serve the asset. You can access assets at {cloudfront domain name}/{s3 asset}
    for e.g.https://d1g71lhh75winl.cloudfront.net/sample.jpeg

    Even though we successfully served the assets via CloudFront. One thing to note is that all the assets are publicly accessible and not secured. In the next section, we will see how you can secure your CloudFront assets.

    Restricting public access

    Previously, while configuring CloudFront, we set Restrict Viewer access to No, which enabled us to access the assets publicly.

    Let’s see how to configure CloudFront to enable signed URLs for assets that should have restricted access. We will be using Trusted key groups, which is the AWS recommended way for restricting access.

    Creating key group

    To create a key pair for a trusted key group, perform the following steps:

    1. Creating the public–private key pair.

    The below commands will generate an RSA key pair and will store the public key & private key in public_key.pem & private_key.pem files respectively.

    openssl genrsa -out private_key.pem 2048
    openssl rsa -pubout -in private_key.pem -out public_key.pem

    Note: The above steps use OpenSSL as an example to create a key pair. There are other ways to create an RSA key pair as well.

    2. Uploading the Public Key to CloudFront.

    To upload, in the AWS console, open CloudFront console and navigate to Public Key. Choose Create Public Key. Add name and copy and paste the contents of public_key.pem file under Key. Once done, click Create Public Key.

    3. Adding the public key to a Key Group.

    To do this, navigate to Key Groups. Add name and select the public key we created. Once done, click Create Key Group.

    Adding key group signer to distribution

    1. Navigate to CloudFront and choose the distribution whose files you want to protect with signed URLs or signed cookies.
    2. Navigate to the Behaviors tab. Select the cache behavior, and then choose Edit.
    3. For Restrict Viewer Access (Use Signed URLs or Signed Cookies), choose Yes and choose Trusted Key Groups.
    4. For Trusted Key Groups, select the key group, and then choose Add.
    5. Once done, review and Save Changes.

    Cheers, you have successfully restricted public access to assets. If you try to open any asset urls in the browser, you will see something like this:

    You can either create signed urls or cookies using the private key to access the assets.

    Setting cookies and accessing CloudFront private urls

    You need to create and set cookies on the domain to access your content. Once cookies are set,  they will be sent along with every request by the browser.

    The cookies to be set are:

    • CloudFront-Policy: Your policy statement in JSON format, with white space removed, then base64 encoded.
    • CloudFront-Signature: A hashed, signed using the private key, and base64-encoded version of the JSON policy statement.
    • CloudFront-Key-Pair-Id: The ID for a CloudFront public key, e.g., K4EGX7PEAN4EN. The public key ID tells CloudFront which public key to use to validate the signed URL.

    Please note that the cookie names are case-sensitive. Make sure cookies are http only and secure.

    Set-Cookie: 
    CloudFront-Policy=base64 encoded version of the policy statement; 
    Domain=optional domain name; 
    Path=/optional directory path; 
    Secure; 
    HttpOnly
    
    
    Set-Cookie: 
    CloudFront-Signature=hashed and signed version of the policy statement; 
    Domain=optional domain name; 
    Path=/optional directory path; 
    Secure; 
    HttpOnly
    
    Set-Cookie: 
    CloudFront-Key-Pair-Id=public key ID for the CloudFront public key whose corresponding private key you're using to generate the signature; 
    Domain=optional domain name; 
    Path=/optional directory path; 
    Secure; 
    HttpOnly

    Cookies can be created in any language you are working on with help of the AWS SDK. For this blog, we will create cookies in python using the botocore module.

    import functools
    
    import rsa
    from botocore.signers import CloudFrontSigner
    
    CLOUDFRONT_RESOURCE = # IN format "{protocol}://{domain}/{resource}" for e.g. "https://d1g71lhh75winl.cloudfront.net/*"
    CLOUDFRONT_PUBLIC_KEY_ID = # The ID for a CloudFront public key
    CLOUDFRONT_PRIVATE_KEY = # contents of the private_key.pem file associated to public key e.g. open('private_key.pem','rb').read()
    EXPIRES_AT = # Enter datetime for expiry of cookies e.g.: datetime.datetime.now() + datetime.timedelta(hours=1)
    
    # load the private key
    key = rsa.PrivateKey.load_pkcs1(CLOUDFRONT_PRIVATE_KEY)
    # create a signer function that can sign message with the private key
    rsa_signer = functools.partial(rsa.sign, priv_key=key, hash_method="SHA-1")
    # Create a CloudFrontSigner boto3 object
    signer = CloudFrontSigner(CLOUDFRONT_PUBLIC_KEY_ID, rsa_signer)
    
    # build the CloudFront Policy
    policy = signer.build_policy(CLOUDFRONT_RESOURCE, EXPIRES_AT).encode("utf8")
    CLOUDFRONT_POLICY = signer._url_b64encode(policy).decode("utf8")
    
    # create CloudFront Signature
    signature = rsa_signer(policy)
    CLOUDFRONT_SIGNATURE = signer._url_b64encode(signature).decode("utf8")
    
    # you can set this cookies on response
    COOKIES = {
        "CloudFront-Policy": CLOUDFRONT_POLICY,
        "CloudFront-Signature": CLOUDFRONT_SIGNATURE,
        "CloudFront-Key-Pair-Id": CLOUDFRONT_PUBLIC_KEY_ID,
    }

    For more details, you can follow AWS official docs.

    Once you set cookies using the above guide, you should be able to access the asset.

    This is how you can effectively use CloudFront along with S3 to securely serve your content.

  • Web Scraping: Introduction, Best Practices & Caveats

    Web scraping is a process to crawl various websites and extract the required data using spiders. This data is processed in a data pipeline and stored in a structured format. Today, web scraping is widely used and has many use cases:

    • Using web scraping, Marketing & Sales companies can fetch lead-related information.
    • Web scraping is useful for Real Estate businesses to get the data of new projects, resale properties, etc.
    • Price comparison portals, like Trivago, extensively use web scraping to get the information of product and price from various e-commerce sites.

    The process of web scraping usually involves spiders, which fetch the HTML documents from relevant websites, extract the needed content based on the business logic, and finally store it in a specific format. This blog is a primer to build highly scalable scrappers. We will cover the following items:

    1. Ways to scrape: We’ll see basic ways to scrape data using techniques and frameworks in Python with some code snippets.
    2. Scraping at scale: Scraping a single page is straightforward, but there are challenges in scraping millions of websites, including managing the spider code, collecting data, and maintaining a data warehouse. We’ll explore such challenges and their solutions to make scraping easy and accurate.
    3. Scraping Guidelines: Scraping data from websites without the owner’s permission can be deemed as malicious. Certain guidelines need to be followed to ensure our scrappers are not blacklisted. We’ll look at some of the best practices one should follow for crawling.

    So let’s start scraping. 

    Different Techniques for Scraping

    Here, we will discuss how to scrape a page and the different libraries available in Python.

    Note: Python is the most popular language for scraping.  

    1. Requests – HTTP Library in Python: To scrape the website or a page, first find out the content of the HTML page in an HTTP response object. The requests library from Python is pretty handy and easy to use. It uses urllib inside. I like ‘requests’ as it’s easy and the code becomes readable too.

    #Example showing how to use the requests library
    import requests
    r = requests.get("https://velotio.com") #Fetch HTML Page

    2. BeautifulSoup: Once you get the webpage, the next step is to extract the data. BeautifulSoup is a powerful Python library that helps you extract the data from the page. It’s easy to use and has a wide range of APIs that’ll help you extract the data. We use the requests library to fetch an HTML page and then use the BeautifulSoup to parse that page. In this example, we can easily fetch the page title and all links on the page. Check out the documentation for all the possible ways in which we can use BeautifulSoup.

    from bs4 import BeautifulSoup
    import requests
    r = requests.get("https://velotio.com") #Fetch HTML Page
    soup = BeautifulSoup(r.text, "html.parser") #Parse HTML Page
    print "Webpage Title:" + soup.title.string
    print "Fetch All Links:" soup.find_all('a')

    3. Python Scrapy Framework:

    Scrapy is a Python-based web scraping framework that allows you to create different kinds of spiders to fetch the source code of the target website. Scrapy starts crawling the web pages present on a certain website, and then you can write the extraction logic to get the required data. Scrapy is built on the top of Twisted, a Python-based asynchronous library that performs the requests in an async fashion to boost up the spider performance. Scrapy is faster than BeautifulSoup. Moreover, it is a framework to write scrapers as opposed to BeautifulSoup, which is just a library to parse HTML pages.

    Here is a simple example of how to use Scrapy. Install Scrapy via pip. Scrapy gives a shell after parsing a website:

    $ pip install scrapy #Install Scrapy"
    $ scrapy shell https://velotio.com
    In [1]: response.xpath("//a").extract() #Fetch all a hrefs

    Now, let’s write a custom spider to parse a website.

    $cat > myspider.py <import scrapy
    
    class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']
    
    def parse(self, response):
    for title in response.css('h2.entry-title'):
    yield {'title': title.css('a ::text').extract_first()}
    EOF
    scrapy runspider myspider.py

    That’s it. Your first custom spider is created. Now. let’s understand the code.

    • name: Name of the spider. In this case, it’s “blogspider”.
    • start_urls: A list of URLs where the spider will begin to crawl from.
    • parse(self, response): This function is called whenever the crawler successfully crawls a URL. The response object used earlier in the Scrapy shell is the same response object that is passed to the parse(..).

    When you run this, Scrapy will look for start URL and will give you all the divs of the h2.entry-title class and extract the associated text from it. Alternatively, you can write your extraction logic in a parse method or create a separate class for extraction and call its object from the parse method.

    You’ve seen how to extract simple items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient. Here is a tutorial for Scrapy and the additional documentation for LinkExtractor by which you can instruct Scrapy to extract links from a web page.

    4. Python lxml.html library:  This is another library from Python just like BeautifulSoup. Scrapy internally uses lxml. It comes with a list of APIs you can use for data extraction. Why will you use this when Scrapy itself can extract the data? Let’s say you want to iterate over the ‘div’ tag and perform some operation on each tag present under “div”, then you can use this library which will give you a list of ‘div’ tags. Now you can simply iterate over them using the iter() function and traverse each child tag inside the parent div tag. Such traversing operations are difficult in scraping. Here is the documentation for this library.

    Challenges while Scraping at Scale

    Let’s look at the challenges and solutions while scraping at large scale, i.e., scraping 100-200 websites regularly:

    1. Data warehousing: Data extraction at a large scale generates vast volumes of information. Fault-tolerant, scalability, security, and high availability are the must-have features for a data warehouse. If your data warehouse is not stable or accessible then operations, like search and filter over data would be an overhead. To achieve this, instead of maintaining own database or infrastructure, you can use Amazon Web Services (AWS). You can use RDS (Relational Database Service) for a structured database and DynamoDB for the non-relational database. AWS takes care of the backup of data. It automatically takes a snapshot of the database. It gives you database error logs as well. This blog explains how to set up infrastructure in the cloud for scraping.  

    2. Pattern Changes: Scraping heavily relies on user interface and its structure, i.e., CSS and Xpath. Now, if the target website gets some adjustments then our scraper may crash completely or it can give random data that we don’t want. This is a common scenario and that’s why it’s more difficult to maintain scrapers than writing it. To handle this case, we can write the test cases for the extraction logic and run them daily, either manually or from CI tools, like Jenkins to track if the target website has changed or not.

    3. Anti-scraping Technologies: Web scraping is a common thing these days, and every website host would want to prevent their data from being scraped. Anti-scraping technologies would help them in this. For example, if you are hitting a particular website from the same IP address on a regular interval then the target website can block your IP. Adding a captcha on a website also helps. There are methods by which we can bypass these anti-scraping methods. For e.g., we can use proxy servers to hide our original IP. There are several proxy services that keep on rotating the IP before each request. Also, it is easy to add support for proxy servers in the code, and in Python, the Scrapy framework does support it.

    4. JavaScript-based dynamic content:  Websites that heavily rely on JavaScript and Ajax to render dynamic content, makes data extraction difficult. Now, Scrapy and related frameworks/libraries will only work or extract what it finds in the HTML document. Ajax calls or JavaScript are executed at runtime so it can’t scrape that. This can be handled by rendering the web page in a headless browser such as Headless Chrome, which essentially allows running Chrome in a server environment. You can also use PhantomJS, which provides a headless Webkit-based environment.

    5. Honeypot traps: Some websites have honeypot traps on the webpages for the detection of web crawlers. They are hard to detect as most of the links are blended with background color or the display property of CSS is set to none. To achieve this requires large coding efforts on both the server and the crawler side, hence this method is not frequently used.

    6. Quality of data: Currently, AI and ML projects are in high demand and these projects need data at large scale. Data integrity is also important as one fault can cause serious problems in AI/ML algorithms. So, in scraping, it is very important to not just scrape the data, but verify its integrity as well. Now doing this in real-time is not possible always, so I would prefer to write test cases of the extraction logic to make sure whatever your spiders are extracting is correct and they are not scraping any bad data

    7. More Data, More Time:  This one is obvious. The larger a website is, the more data it contains, the longer it takes to scrape that site. This may be fine if your purpose for scanning the site isn’t time-sensitive, but that isn’t often the case. Stock prices don’t stay the same over hours. Sales listings, currency exchange rates, media trends, and market prices are just a few examples of time-sensitive data. What to do in this case then? Well, one solution could be to design your spiders carefully. If you’re using Scrapy like framework then apply proper LinkExtractor rules so that spider will not waste time on scraping unrelated URLs.

    You may use multithreading scraping packages available in Python, such as Frontera and Scrapy Redis. Frontera lets you send out only one request per domain at a time, but can hit multiple domains at once, making it great for parallel scraping. Scrapy Redis lets you send out multiple requests to one domain. The right combination of these can result in a very powerful web spider that can handle both the bulk and variation for large websites.

    8. Captchas: Captchas is a good way of keeping crawlers away from a website and it is used by many website hosts. So, in order to scrape the data from such websites, we need a mechanism to solve the captchas. There are packages, software that can solve the captcha and can act as a middleware between the target website and your spider. Also, you may use libraries like Pillow and Tesseract in Python to solve the simple image-based captchas.

    9. Maintaining Deployment: Normally, we don’t want to limit ourselves to scrape just a few websites. We need the maximum amount of data that are present on the Internet and that may introduce scraping of millions of websites. Now, you can imagine the size of the code and the deployment. We can’t run spiders at this scale from a single machine. What I prefer here is to dockerize the scrapers and take advantage of the latest technologies, like AWS ECS, Kubernetes to run our scraper containers. This helps us keeping our scrapers in high availability state and it’s easy to maintain. Also, we can schedule the scrapers to run at regular intervals.

    Scraping Guidelines/ Best Practices

    1. Respect the robots.txt file:  Robots.txt is a text file that webmasters create to instruct search engine robots on how to crawl and index pages on the website. This file generally contains instructions for crawlers. Now, before even planning the extraction logic, you should first check this file. Usually, you can find this at the website admin section. This file has all the rules set on how crawlers should interact with the website. For e.g., if a website has a link to download critical information then they probably don’t want to expose that to crawlers. Another important factor is the frequency interval for crawling, which means that crawlers can only hit the website at specified intervals. If someone has asked not to crawl their website then we better not do it. Because if they catch your crawlers, it can lead to some serious legal issues.

    2. Do not hit the servers too frequently:  As I mentioned above, some websites will have the frequency interval specified for crawlers. We better use it wisely because not every website is tested against the high load. If you are hitting at a constant interval then it creates huge traffic on the server-side, and it may crash or fail to serve other requests. This creates a high impact on user experience as they are more important than the bots. So, we should make the requests according to the specified interval in robots.txt or use a standard delay of 10 seconds. This also helps you not to get blocked by the target website.

    3. User Agent Rotation and Spoofing: Every request consists of a User-Agent string in the header. This string helps to identify the browser you are using, its version, and the platform. If we use the same User-Agent in every request then it’s easy for the target website to check that request is coming from a crawler. So, to make sure we do not face this, try to rotate the User and the Agent between the requests. You can get examples of genuine User-Agent strings on the Internet very easily, try them out. If you’re using Scrapy, you can set USER_AGENT property in settings.py.

    4. Disguise your requests by rotating IPs and Proxy Services: We’ve discussed this in the challenges above. It’s always better to use rotating IPs and proxy service so that your spider won’t get blocked.

    5. Do not follow the same crawling pattern: Now, as you know many websites use anti-scraping technologies, so it’s easy for them to detect your spider if it’s crawling in the same pattern. Normally, we, as a human, would not follow a pattern on a particular website. So, to have your spiders run smoothly, we can introduce actions like mouse movements, clicking a random link, etc, which gives the impression of your spider as a human.

    6. Scrape during off-peak hours: Off-peak hours are suitable for bots/crawlers as the traffic on the website is considerably less. These hours can be identified by the geolocation from where the site’s traffic originates. This also helps to improve the crawling rate and avoid the extra load from spider requests. Thus, it is advisable to schedule the crawlers to run in the off-peak hours.

    7. Use the scraped data responsibly: We should always take the responsibility of the scraped data. It is not acceptable if someone is scraping the data and then republish it somewhere else. This can be considered as breaking the copyright laws and may lead to legal issues. So, it is advisable to check the target website’s Terms of Service page before scraping.

    8. Use Canonical URLs: When we scrape, we tend to scrape duplicate URLs, and hence the duplicate data, which is the last thing we want to do. It may happen in a single website where we get multiple URLs having the same data. In this situation, duplicate URLs will have a canonical URL, which points to the parent or the original URL. By this, we make sure, we don’t scrape duplicate contents. In frameworks like Scrapy, duplicate URLs are handled by default.

    9. Be transparent: Don’t misrepresent your purpose or use deceptive methods to gain access. If you have a login and a password that identifies you to gain access to a source, use it.  Don’t hide who you are. If possible, share your credentials.

    Conclusion

    We’ve seen the basics of scraping, frameworks, how to crawl, and the best practices of scraping. To conclude:

    • Follow target URLs rules while scraping. Don’t make them block your spider.
    • Maintenance of data and spiders at scale is difficult. Use Docker/ Kubernetes and public cloud providers, like AWS to easily scale your web-scraping backend.
    • Always respect the rules of the websites you plan to crawl. If APIs are available, always use them first.
  • Building a Progressive Web Application in React [With Live Code Examples]

    What is PWA:

    A Progressive Web Application or PWA is a web application that is built to look and behave like native apps, operates offline-first, is optimized for a variety of viewports ranging from mobile, tablets to FHD desktop monitors and more. PWAs are built using front-end technologies such as HTML, CSS and JavaScript and bring native-like user experience to the web platform. PWAs can also be installed on devices just like native apps.

    For an application to be classified as a PWA, it must tick all of these boxes:

    • PWAs must implement service workers. Service workers act as a proxy between the web browsers and API servers. This allows web apps to manage and cache network requests and assets
    • PWAs must be served over a secure network, i.e. the application must be served over HTTPS
    • PWAs must have a web manifest definition, which is a JSON file that provides basic information about the PWA, such as name, different icons, look and feel of the app, splash screen, version of the app, description, author, etc

    Why build a PWA?

    Businesses and engineering teams should consider building a progressive web app instead of a traditional web app. Here are some of the most prominent arguments in favor of PWAs:

    • PWAs are responsive. The mobile-first design approach enables PWAs to support a variety of viewports and orientation
    • PWAs can work on slow Internet or no Internet environment. App developers can choose how a PWA will behave when there’s no Internet connectivity, whereas traditional web apps or websites simply stop working without an active Internet connection
    • PWAs are secure because they are always served over HTTPs
    • PWAs can be installed on the home screen, making the application more accessible
    • PWAs bring in rich features, such as push notification, application updates and more

    PWA and React

    There are various ways to build a progressive web application. One can just use Vanilla JS, HTML and CSS or pick up a framework or library. Some of the popular choices in 2020 are Ionic, Vue, Angular, Polymer, and of course React, which happens to be my favorite front-end library.

    Building PWAs with React

    To get started, let’s create a PWA which lists all the users in a system.

    npm init react-app users
    cd users
    yarn add react-router-dom
    yarn run start

    Next, we will replace the default App.js file with our own implementation.

    import React from "react";
    import { BrowserRouter, Route } from "react-router-dom";
    import "./App.css";
    const Users = () => {
     // state
     const [users, setUsers] = React.useState([]);
     // effects
     React.useEffect(() => {
       fetch("https://jsonplaceholder.typicode.com/users")
         .then((res) => res.json())
         .then((users) => {
           setUsers(users);
         })
         .catch((err) => {});
     }, []);
     // render
     return (
       <div>
         <h2>Users</h2>
         <ul>
           {users.map((user) => (
             <li key={user.id}>
               {user.name} ({user.email})
             </li>
           ))}
         </ul>
       </div>
     );
    };
    const App = () => (
     <BrowserRouter>
       <Route path="/" exact component={Users} />
     </BrowserRouter>
    );
    export default App;

    This displays a list of users fetched from the server.

    Let’s also remove the logo.svg file inside the src directory and truncate the App.css file that is populated as a part of the boilerplate code.

    To make this app a PWA, we need to follow these steps:

    1. Register service worker

    • In the file /src/index.js, replace serviceWorker.unregister() with serviceWorker.register().
    import React from 'react';
    import ReactDOM from 'react-dom';
    import './index.css';
    import App from './App';
    import * as serviceWorker from './serviceWorker';
    ReactDOM.render(
     <React.StrictMode>
       <App />
     </React.StrictMode>,
     document.getElementById('root')
    );
    serviceWorker.register();

    • The default behavior here is to not set up a service worker, i.e. the CRA boilerplate allows the users to opt-in for the offline-first experience.

    2. Update the manifest file

    • The CRA boilerplate provides a manifest file out of the box. This file is located at /public/manifest.json and needs to be modified to include the name of the PWA, description, splash screen configuration and much more. You can read more about available configuration options in the manifest file here.

    Our modified manifest file looks like this:

    {
     "short_name": "User Mgmt.",
     "name": "User Management",
     "icons": [
       {
         "src": "favicon.ico",
         "sizes": "64x64 32x32 24x24 16x16",
         "type": "image/x-icon"
       },
       {
         "src": "logo192.png",
         "type": "image/png",
         "sizes": "192x192"
       },
       {
         "src": "logo512.png",
         "type": "image/png",
         "sizes": "512x512"
       }
     ],
     "start_url": ".",
     "display": "standalone",
     "theme_color": "#aaffaa",
     "background_color": "#ffffff"
    }

    PWA Splash Screen

    Here the display mode selected is “standalone” which tells the web browsers to give this PWA the same look and feel as that of a standalone app. Other display options include, “browser,” which is the default mode and launches the PWA like a traditional web app and “fullscreen,” which opens the PWA in fullscreen mode – hiding all other elements such as navigation, the address bar and the status bar.

    The manifest can be inspected using Chrome dev tools > Application tab > Manifest.

    1. Test the PWA:

    • To test a progressive web app, build it completely first. This is because PWA features, such as caching aren’t enabled while running the app in dev mode to ensure hassle-free development  
    • Create a production build with: npm run build
    • Change into the build directory: cd build
    • Host the app locally: http-server or python3 -m http.server 8080
    • Test the application by logging in to http://localhost:8080

    2. Audit the PWA: If you are testing the app for the first time on a desktop or laptop browser, PWA may look like just another website. To test and audit various aspects of the PWA, let’s use Lighthouse, which is a tool built by Google specifically for this purpose.

    PWA on mobile

    At this point, we already have a simple PWA which can be published on the Internet and made available to billions of devices. Now let’s try to enhance the app by improving its offline viewing experience.

    1. Offline indication: Since service workers can operate without the Internet as well, let’s add an offline indicator banner to let users know the current state of the application. We will use navigator.onLine along with the “online” and “offline” window events to detect the connection status.

     // state
      const [offline, setOffline] = React.useState(false);
      // effects
      React.useEffect(() => {
        window.addEventListener("offline", offlineListener);
        return () => {
          window.removeEventListener("offline", offlineListener);
        };
      }, []);
      
      {/* add to jsx */}
      {offline ? (
        <div className="banner-offline">The app is currently offline</div>
      ) : null}

    The easiest way to test this is to just turn off the Wi-Fi on your dev machine. Chrome dev tools also provide an option to test this without actually going offline. Head over to Dev tools > Network and then select “Offline” from the dropdown in the top section. This should bring up the banner when the app is offline.

    2. Let’s cache a network request using service worker

    CRA comes with its own service-worker.js file which caches all static assets such as JavaScript and CSS files that are a part of the application bundle. To put custom logic into the service worker, let’s create a new file called ‘custom-service-worker.js’ and combine the two.

    • Install react-app-rewired and update package.json:
    1. yarn add react-app-rewired
    2. Update the package.json as follows:
    "scripts": {
       "start": "react-app-rewired start",
       "build": "react-app-rewired build",
       "test": "react-app-rewired test",
       "eject": "react-app-rewired eject"
    },

    • Create a config file to override how CRA generates service workers and inject our custom service worker, i.e. combine the two service worker files.
    const WorkboxWebpackPlugin = require("workbox-webpack-plugin");
    module.exports = function override(config, env) {
      config.plugins = config.plugins.map((plugin) => {
        if (plugin.constructor.name === "GenerateSW") {
          return new WorkboxWebpackPlugin.InjectManifest({
           swSrc: "./src/service-worker-custom.js",
           swDest: "service-worker.js"
          });
        }
        return plugin;
      });
      return config;
    };

    • Create service-worker-custom.js file and cache network request in there:
    workbox.skipWaiting();
    workbox.clientsClaim();
    workbox.routing.registerRoute(
      new RegExp("/users"),
      workbox.strategies.NetworkFirst()
    );
    workbox.precaching.precacheAndRoute(self.__precacheManifest || [])

    Your app should now work correctly in the offline mode.

    Distributing and publishing a PWA

    PWAs can be published just like any other website and only have one additional requirement, i.e. it must be served over HTTPs. When a user visits PWA from mobile or tablet, a pop-up is displayed asking the user if they’d like to install the app to their home screen.

    Conclusion

    Building PWAs with React enables engineering teams to develop, deploy and publish progressive web apps for billions of devices using technologies they’re already familiar with. Existing React apps can also be converted to a PWA. PWAs are fun to build, easy to ship and distribute, and add a lot of value to customers by providing native-live experience, better engagement via features, such as add to homescreen, push notifications and more without any installation process. 

  • How to Implement Server Sent Events Using Python Flask and React

    A typical Request Response cycle works such that client sends request to server and server responds to that request. But there are few use cases where we might need to send data from server without request or client is expecting a data that can arrive at anonymous time. There are few mechanisms available to solve this problem.

    Server Sent Events

    Broadly we can classify these as client pull and server push mechanisms. Websockets is a bi directional mechanism where data is transmitted via full duplex TCP protocol. Client Pull can be done using various mechanisms like –

    1. Manual refresh – where client is refreshed manually
    2. Long polling where a client sends request to server and waits until response is received, as soon as it gets response, a new request is sent.
    3. Short Polling is when a client continuously sends request to server in a definite short intervals.

    Server sent events are a type of Server Push mechanism, where client subscribes to a stream of updates generated by a server and, whenever a new event occurs, a notification is sent to the client.

    Why ServerSide events are better than polling:

    • Scaling and orchestration of backend in real time needs to be managed as users grow.
    • When mobile devices rapidly switch between WiFi and cellular networks or lose connections, and the IP address changes, long polling needs to re-establish connections.
    • With long polling, we need to manage the message queue and catch up missed message.
    • Long polling needs to provide load balancing or fail-over support across multiple servers.

    SSE vs Websockets

    SSEs cannot provide bidirectional client-server communication as opposed to WebSockets. Use cases that require such communication are real-time multiplayer games and messaging and chat apps. When there’s no need for sending data from a client, SSEs might be a better option than WebSockets. Examples of such use cases are status updates, news feeds and other automated data push mechanisms. And backend implementation could be easy with SSE than with Websockets. Also number of open connections is limited for browser for SSE.

    Also, learn about WS vs SSE here.

    Implementation

    The server side code for this can be implemented in any of the high level language. Here is a sample code for Python Flask SSE. Flask SSE requires a broker such as Redis to store the message. Here we are also using Flask APScheduler, to schedule background processes with flask .

    Here we need to install and import ‘flask_sse’ and ‘apscheduler.’

    from flask import Flask, render_template
    from flask_sse import sse
    from apscheduler.schedulers.background import BackgroundScheduler

    Now we need to initialize flask app and provide config for Redis and a route or an URL where the client would be listening to this event.

    app = Flask(__name__)
    app.config["REDIS_URL"] = "redis://localhost"
    app.register_blueprint(sse, url_prefix='/stream')

    To publish data to a stream we need to call publish method from SSE and provide a type of stream.

    sse.publish({"message": datetime.datetime.now()}, type='publish')

    In client, we need to add an event listener which would listen to our stream and read messages.

    var source = new EventSource("{{ url_for('sse.stream') }}");
        source.addEventListener('publish', function(event) {
            var data = JSON.parse(event.data);
            console.log("The server says " + data.message);
        }, false);
        source.addEventListener('error', function(event) {
            console.log("Error"+ event)
            alert("Failed to connect to event stream. Is Redis running?");
        }, false);

    Check out a sample Flask-React-Redis based application demo for server side events.

    Here are some screenshots of client –

    Fig: First Event

     Fig: Second Event

    Server logs:

    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 31, 0, 24564))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 31, 14, 30164))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 31, 28, 37840))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 31, 42, 58162))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 31, 56, 46456))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 32, 10, 56276))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 32, 24, 58445))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 32, 38, 57183))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 32, 52, 65886))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 33, 6, 49818))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 33, 20, 22731))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 33, 34, 59084))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 33, 48, 70346))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 34, 2, 58889))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 34, 16, 26020))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 34, 30, 44040))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 34, 44, 61620))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 34, 58, 38699))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 35, 12, 26067))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 35, 26, 71504))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 35, 40, 31429))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 35, 54, 74451))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 36, 8, 63001))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 36, 22, 47671))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 36, 36, 55458))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 36, 50, 68975))
    api_1    | ('Event Scheduled at ', datetime.datetime(2019, 5, 1, 7, 37, 4, 62491))
    api_1    | ('Event SchedINFO:apscheduler.executors.default:Job "server_side_event (trigger: interval[0:00:14], next run at: 2019-05-01 07:37:31 UTC)" executed successfully
    api_1    | INFO:apscheduler.executors.default:Running job "server_side_event (trigger: interval[0:00:16], next run at: 2019-05-01 07:37:38 UTC)" (scheduled at 2019-05-01 07:37:22.362874+00:00)
    api_1    | INFO:apscheduler.executors.default:Job "server_side_event (trigger: interval[0:00:16], next run at: 2019-05-01 07:37:38 UTC)" executed successfully
    api_1    | INFO:apscheduler.executors.default:Running job "server_side_event (trigger: interval[0:00:14], next run at: 2019-05-01 07:37:31 UTC)" (scheduled at 2019-05-01 07:37:31.993944+00:00)
    api_1    | INFO:apscheduler.executors.default:Job "server_side_event (trigger: interval[0:00:14], next run at: 2019-05-01 07:37:45 UTC)" executed successfully
    api_1    | INFO:apscheduler.executors.default:Running job "server_side_event (trigger: interval[0:00:16], next run at: 2019-05-01 07:37:54 UTC)" (scheduled at 2019-05-01 07:37:38.362874+00:00)
    api_1    | INFO:apscheduler.executors.default:Job "server_side_event (trigger: interval[0:00:16], next run at: 2019-05-01 07:37:54 UTC)" executed successfully

    Use Cases of Server Sent Events

    Let’s see the use case with an example – Consider we have a real time graph showing on our web app, one of the possible options is polling where continuously client will poll the server to get new data. Other option would be to use server sent events, which are asynchronous, here the server will send data when updates happen.

    Other applications could be

    • Real time stock price analysis system
    • Real time social media feeds
    • Resource monitoring for health, uptime

    Conclusion

    In this blog, we have covered how we can implement server sent events using Python Flask and React and also how we can use background schedulers with that. This can be used to implement a data delivery from the server to the client using server push.

  • Blockchain 101: The Simplest Guide You Will Ever Read

    Blockchain allows digital information to be distributed over multiple nodes in the network. It powers the backbone of bitcoin and cryptocurrency.

    The concept of a distributed ledger found its use case beyond crypto and is now used in other infrastructure.

    What is Blockchain?

    Blockchain is a distributed ledger that powers bitcoin. Satoshi invented bitcoin, and blockchain was the key component. Blockchain is highly secured and works around a decentralized consensus algorithm where no one can own the control completely.

    Let’s divide the word blockchain into two parts: block and chain. A block is a set of transactions that happen over the network. The chain is where blocks are linked to each other in a way that the next block contains hash of the previous one. Even a small change in the previous block can change its hash and break the whole chain, making it difficult to tamper data.

    Image source

    Blockchain prerequisites:

    These are some prerequisites that will help you understand the concepts better.

    Public-key Cryptography– Used to claim the authenticity of the user. It involves a pair of public and private keys. The user creates a signature with the private key, and the network uses the public key of the user to validate that the content is untouched.

    Digital Signatures:

    Digital signatures employ asymmetric key cryptography.

    • Authentication: Digital signature makes the receiver believe that the data was created and sent by the claimed user.
    • Non-Repudiation: The sender cannot deny sending a message later on.
    • Integrity: This ensures that the message was not altered during the transfer.

    Cryptographic hash functions:

    • One way function: This is a mathematical function that takes an input and transforms it into an output. There is no way to recover the message from the hash value.
    • No collision: No two or more messages can have the same hash(message digest). This ensures that no two account transactions can collide. 
    • Fixed hash length: Irrespective of the data size, this function returns the same hash length.

    Why Blockchain?

    There are a few problem statements that we can quickly solve using a distributed consensus system rather than a conventional centralized system.

    Let me share some blockchain applications:

    Consider an auction where people bet on artifacts, and the winner pays and takes out those artifacts. But if we try to implement the same auction over the internet, there would be trust issues. What if one wins the bet saying 10000$ and at the time of payment, he doesn’t respond.

    We can handle such events easily using blockchain. During betting, a token amount will be deducted from an account and will be stored in the smart contract (business logic code deployed on Ethereum). Bid transactions use a private key to sign transactions, so this way, one can not revert by saying that those transactions never happened.

    Another simple but amazing solution we can develop using Ethereum is online games like tic-tac-toe, where both players will deposit `X` amount in the smart contract. Each move done by a player gets recorded on blockchain (each movement will be digitally signed), and smart contract logic will verify a player’s move every time. In the end, the smart contract will decide the winner. And the winner can claim his reward.

    – No one controls your game

    – There is no way one can cheat 

    – No frauds, the winner always gets a reward.

    Bitcoin is the biggest and most well-known implementation of blockchain technology.

    The list of applications based on distributed consensus systems goes on.

    Note: Ethereum smart contract is the code that is deployed over Ethereum blockchain. It is written as a transaction on the block so no one can alter the logic. This is also known as Code is Law.

    Check out some of the smart contract examples.

    Bitcoin is the base and ideal implementation for all other cryptocurrencies. Let’s dig deep into blockchain technology and cryptocurrency.

    Let’s reinvent Bitcoin:

    • Bitcoin is distributed ledger technology where the ledger is a set of transactions
    • No single entity controls the system
    • High level of trust 

    We have to design our bitcoin to meet above requirements.

    1) Consider that bitcoin is just a string that we will send from one node to the other. Here, the string is: “I, Alice, am giving Bob one bitcoin.” It shows Alice is sending Bob one bitcoin.

     

    2) Sam uses the fake identity of Alice and sends bitcoin on her behalf.

     

    3) We can solve this fake identity problem using Digital signature. Sam can not use a fake identity.

    But there is still one problem, double spending. This occurs when Alice sends one transaction multiple times. It’s difficult to check if Alice wants to send multiple bitcoins or just retrying transactions due to high network latency or any other issue.

    4) But a simple solution is to add a unique transaction ID to each transaction.

    5) It’s time to add more complexity to our system. Let’s check how we can validate the transaction between Alice and Bob.

    In cryptocurrency, every node knows everything (nodes are the systems where blockchain clients are installed, like Geth for Ethereum).

     

    Every node maintains a local ledger containing whole blockchain data. Here, Alice, Sam, and Bob know how much bitcoins everyone has. This helps validate all transactions happening over the network.

    As Bob receives an event from Alice containing a bitcoin transaction. He checks the local copy of the blockchain and verifies if Alice owns that one bitcoin that she wants to send. If Bob finds out that the transaction is valid, he broadcasts that to all networks and waits for others to confirm. Other peers also check their local copy and acknowledge the transaction. If maximum peers confirm the transaction valid then that transaction gets added to the blockchain. And everyone will update their copy of ledger now, and Alice has one less bitcoin.

    Note: In actual cryptocurrency, validation occurs at a block-level rather than validating one transaction. Bob will validate a set of transactions and creates one block from it and will broadcast that transaction over the network to validate.

    6) Still, there is one problem with this approach. Here, we are using Bob as a validator. But what if he is a fraud. He might say that transaction is valid even if its invalid, and he has thousands of automated bots to support him. This way the whole blockchain will follow bots and accept the invalid transaction (Majority wins).

    In this example, Alice has one bitcoin. Still, she creates two transactions: one to Bob and another to Sam. Alice waits for the network to accept the transaction to Bob. Now Alice has 0 bitcoins. If Alice validates her own transaction to Sam and says it’s valid (Alice has no bitcoin left to spend), and she has a large number of bots to support her, then eventually the whole network will accept that transaction, and Alice will double spend the bitcoin.

    7) We can solve this problem with the POW (Proof of Work) consensus algorithm.

    This is a puzzle that one has to solve while validating the transactions present in the block.

    Here you can see that the block has a size of around 1MB. So, you need to append any random number to the block and calculate hash, so that the hash value will have a starting string of zeros as shown in the image.

    Blockchain decides this number, and then the next block miner has to calculate a random number so that hash has that many zeros in the beginning. To solve this puzzle, the miner has to try Peta combinations to get the answer. As this is a very complex process, miners get rewarded after the validation of the block.

    But how can we solve the above problem using mining?

    Suppose the blockchain network has 10,000 active mining nodes with the same computational power. The probability that one can mine is only 0.01%. If one wants to do fraud transactions, he should have huge mining power to validate the block and convince other nodes to accept the invalid block. To do this, one needs to own more than 50% of computational power, that is very difficult. 

    Now we have a prototype cryptocurrency model ready with us.

    Note: Each blockchain node follows the majority. Even if a transaction is invalid, but with more than 51% of nodes say it’s valid, the whole network will be convinced and go rogue. This means that any group owns 51% of computational power(hash power), controls the whole blockchain network. This is known as a 51% attack.

    Blockchain-based services:

    • Golem: Distributed computing platform.
    • Iexec: Distributed computing platform.
    • Sia distributed storage: Distributed storage, SLA is managed over the blockchain.
    • Ethrise: Insurance platform developed on Ethereum ecosystem.
    • Maecenas: Blockchain-based Auction of file arts.
    • More than 2000 cryptocurrency platforms.

    Disadvantages of Blockchain:

    There are a few limitations to blockchain-based solutions.

    • Crime: Due to its encryption and anonymous nature, blockchain solutions influence crimes.
    • Data size: Full nodes stores all transactional data and requires over 100 GB disk space.
    • Throughput: Blockchain systems are very slow.

    Bitcoin can perform only 5 transactions per sec, while any financial bank can do more than 24000 transactions per sec.

  • Building A Scalable API Testing Framework With Jest And SuperTest

    Focus on API testing

    Before starting off, below listed are the reasons why API testing should be encouraged:

    • Identifies bugs before it goes to UI
    • Effective testing at a lower level over high-level broad-stack testing
    • Reduces future efforts to fix defects
    • Time-saving

    Well, QA practices are becoming more automation-centric with evolving requirements, but identifying the appropriate approach is the primary and the most essential step. This implies choosing a framework or a tool to develop a test setup which should be:

    • Scalable 
    • Modular
    • Maintainable
    • Able to provide maximum test coverage
    • Extensible
    • Able to generate test reports
    • Easy to integrate with source control tool and CI pipeline

    To attain the goal, why not develop your own asset rather than relying on the ready-made tools like Postman, JMeter, or any? Let’s have a look at why you should choose ‘writing your own code’ over depending on the API testing tools available in the market:

    1. Customizable
    2. Saves you from the trap of limitations of a ready-made tool
    3. Freedom to add configurations and libraries as required and not really depend on the specific supported plugins of the tool
    4. No limit on the usage and no question of cost
    5. Let’s take Postman for example. If we are going with Newman (CLI of Postman), there are several efforts that are likely to evolve with growing or changing requirements. Adding a new test requires editing in Postman, saving it in the collection, exporting it again and running the entire collection.json through Newman. Isn’t it tedious to repeat the same process every time?

    We can overcome such annoyance and meet our purpose using a self-built Jest framework using SuperTest. Come on, let’s dive in!

    Source: school.geekwall

    Why Jest?

    Jest is pretty impressive. 

    • High performance
    • Easy and minimal setup
    • Provides in-built assertion library and mocking support
    • Several in-built testing features without any additional configuration
    • Snapshot testing
    • Brilliant test coverage
    • Allows interactive watch mode ( jest –watch or jest –watchAll )

    Hold on. Before moving forward, let’s quickly visit Jest configurations, Jest CLI commands, Jest Globals and Javascript async/await for better understanding of the coming content.

    Ready, set, go!

    Creating a node project jest-supertest in our local and doing npm init. Into the workspace, we will install Jest, jest-stare for generating custom test reports, jest-serial-runner to disable parallel execution (since our tests might be dependent) and save these as dependencies.

    npm install jest jest-stare jest-serial-runner --save-dev

    Tags to the scripts block in our package.json. 

    
    "scripts": {
        "test": "NODE_TLS_REJECT_UNAUTHORIZED=0 jest --reporters default jest-stare --coverage --detectOpenHandles --runInBand --testTimeout=60000",
        "test:watch": "jest --verbose --watchAll"
      }

    npm run test command will invoke the test parameter with the following:

    • NODE_TLS_REJECT_UNAUTHORIZED=0: ignores the SSL certificate
    • jest: runs the framework with the configurations defined under Jest block
    • –reporters: default jest-stare 
    • –coverage: invokes test coverage
    • –detectOpenHandles: for debugging
    • –runInBand: serial execution of Jest tests
    • –forceExit: to shut down cleanly
    • –testTimeout = 60000 (custom timeout, default is 5000 milliseconds)

    Jest configurations:

    [Note: This is customizable as per requirements]

    "jest": {
        "verbose": true,
        "testSequencer": "/home/abc/jest-supertest/testSequencer.js",
        "coverageDirectory": "/home/abc/jest-supertest/coverage/my_reports/",
        "coverageReporters": ["html","text"],
        "coverageThreshold": {
          "global": {
            "branches": 100,
            "functions": 100,
            "lines": 100,
            "statements": 100
          }
        }
      }

    testSequencer: to invoke testSequencer.js in the workspace to customize the order of running our test files

    touch testSequencer.js

    Below code in testSequencer.js will run our test files in alphabetical order.

    const Sequencer = require('@jest/test-sequencer').default;
    
    class CustomSequencer extends Sequencer {
      sort(tests) {
        // Test structure information
        // https://github.com/facebook/jest/blob/6b8b1404a1d9254e7d5d90a8934087a9c9899dab/packages/jest-runner/src/types.ts#L17-L21
        const copyTests = Array.from(tests);
        return copyTests.sort((testA, testB) => (testA.path > testB.path ? 1 : -1));
      }
    }
    
    module.exports = CustomSequencer;

    • verbose: to display individual test results
    • coverageDirectory: creates a custom directory for coverage reports
    • coverageReporters: format of reports generated
    • coverageThreshold: minimum and maximum threshold enforcements for coverage results

    Testing endpoints with SuperTest

    SuperTest is a node library, superagent driven, to extensively test Restful web services. It hits the HTTP server to send requests (GET, POST, PATCH, PUT, DELETE ) and fetch responses.

    Install SuperTest and save it as a dependency.

    npm install supertest --save-dev

    "devDependencies": {
        "jest": "^25.5.4",
        "jest-serial-runner": "^1.1.0",
        "jest-stare": "^2.0.1",
        "supertest": "^4.0.2"
      }

    All the required dependencies are installed and our package.json looks like:

    {
      "name": "supertestjest",
      "version": "1.0.0",
      "description": "",
      "main": "index.js",
      "jest": {
        "verbose": true,
        "testSequencer": "/home/abc/jest-supertest/testSequencer.js",
        "coverageDirectory": "/home/abc/jest-supertest/coverage/my_reports/",
        "coverageReporters": ["html","text"],
        "coverageThreshold": {
          "global": {
            "branches": 100,
            "functions": 100,
            "lines": 100,
            "statements": 100
          }
        }
      },
      "scripts": {
        "test": "NODE_TLS_REJECT_UNAUTHORIZED=0 jest --reporters default jest-stare --coverage --detectOpenHandles --runInBand --testTimeout=60000",
        "test:watch": "jest --verbose --watchAll"
      },
      "author": "",
      "license": "ISC",
      "devDependencies": {
        "jest": "^25.5.4",
        "jest-serial-runner": "^1.1.0",
        "jest-stare": "^2.0.1",
        "supertest": "^4.0.2"
      }
    }

    Now we are ready to create our Jest tests with some defined conventions:

    • describe block assembles multiple tests or its
    • test block – (an alias usually used is ‘it’) holds single test 
    • expect() –  performs assertions 

    It recognizes the test files in __test__/ folder

    • with .test.js extension
    • with .spec.js extension

    Here is a reference app for API tests.

    Let’s write commonTests.js which will be required by every test file. This hits the app through SuperTest, logs in (if required) and saves authorization token. The aliases are exported from here to be used in all the tests. 

    [Note: commonTests.js, be created or not, will vary as per the test requirements]

    touch commonTests.js

    var supertest = require('supertest'); //require supertest
    const request = supertest('https://reqres.in/'); //supertest hits the HTTP server (your app)
    
    /*
    This piece of code is for getting the authorization token after login to your app.
    const token;
    test("Login to the application", function(){
        return request.post(``).then((response)=>{
            token = response.body.token  //to save the login token for further requests
        })
    }); 
    */
    
    module.exports = 
    {
        request
            //, token     -- export if token is generated
    }

    Moving forward to writing our tests on POST, GET, PUT and DELETE requests for the basic understanding of the setup. For that, we are creating two test files to also see and understand if the sequencer works.

    mkdir __test__/
    touch __test__/postAndGet.test.js __test__/putAndDelete.test.js

    As mentioned above and sticking to Jest protocols, we have our tests written.

    postAndGet.test.js test file:

    • requires commonTests.js into ‘request’ alias
    • POST requests to api/users endpoint, calls supertest.post() 
    • GET requests to api/users endpoint, calls supertest.get()
    • uses file system to write globals and read those across all the tests
    • validates response returned on hitting the HTTP endpoints
    const request = require('../commonTests');
    const fs = require('fs');
    let userID;
    
    //Create a new user
    describe("POST request", () => {
      
      try{
        let userDetails;
        beforeEach(function () {  
            console.log("Input user details!")
            userDetails = {
              "name": "morpheus",
              "job": "leader"
          }; //new user details to be created
          });
        
        afterEach(function () {
          console.log("User is created with ID : ", userID)
        });
    
    	  it("Create user data", async done => {
    
            return request.request.post(`api/users`) //post() of supertest
                    //.set('Authorization', `Token $  {request.token}`) //Authorization token
                    .send(userDetails) //Request header
                    .expect(201) //response to be 201
                    .then((res) => {
                        expect(res.body).toBeDefined(); //test if response body is defined
                        //expect(res.body.status).toBe("success")
                        userID = res.body.id;
                        let jsonContent = JSON.stringify({userId: res.body.id}); // create a json
                        fs.writeFile("data.json", jsonContent, 'utf8', function (err) //write user id into global json file to be used 
                        {
                        if (err) {
                            return console.log(err);
                        }
                        console.log("POST response body : ", res.body)
                        done();
                        });
                      })
                    })
                  }
                  catch(err){
                    console.log("Exception : ", err)
                  }
            });
    
    //GET all users      
    describe("GET all user details", () => {
      
      try{
          beforeEach(function () {
            console.log("GET all users details ")
        });
              
          afterEach(function () {
            console.log("All users' details are retrieved")
        });
    
          test("GET user output", async done =>{
            await request.request.get(`api/users`) //get() of supertest
                                    //.set('Authorization', `Token ${request.token}`) 
                                    .expect(200).then((response) =>{
                                    console.log("GET RESPONSE : ", response.body);
                                    done();
                        })
          })
        }
      catch(err){
        console.log("Exception : ", err)
        }
    });

    putAndDelete.test.js file:

    • requires commonsTests into ‘request’ alias
    • calls data.json into ‘data’ alias which was created by the file system in our previous test to write global variables into it
    • PUT sto api/users/${data.userId} endpoint, calls supertest.put() 
    • DELETE requests to api/users/${data.userId} endpoint, calls supertest.delete() 
    • validates response returned by the endpoints
    • removes data.json (similar to unsetting global variables) after all the tests are done
    const request = require('../commonTests');
    const fs = require('fs'); //file system
    const data = require('../data.json'); //data.json containing the global variables
    
    //Update user data
    describe("PUT user details", () => {
    
        try{
            let newDetails;
            beforeEach(function () {
                console.log("Input updated user's details");
                newDetails = {
                    "name": "morpheus",
                    "job": "zion resident"
                }; // details to be updated
      
            });
            afterEach(function () {
                console.log("user details are updated");
            });
      
            test("Update user now", async done =>{
    
                console.log("User to be updated : ", data.userId)
    
                const response = await request.request.put(`api/users/${data.userId}`).send(newDetails) //call put() of supertest
                                    //.set('Authorization', `Token ${request.token}`) 
                                            .expect(200)
                expect(response.body.updatedAt).toBeDefined();
                console.log("UPDATED RESPONSE : ", response.body);
                done();
        })
      }
        catch(err){
            console.log("ERROR : ", err)
        }
    });
    
    //DELETE the user
    describe("DELETE user details", () =>{
        try{
            beforeAll(function (){
                console.log("To delete user : ", data.userId)
            });
    
            test("Delete request", async done =>{
                const response = await request.request.delete(`api/users/${data.userId}`) //invoke delete() of supertest
                                            .expect(204) 
                console.log("DELETE RESPONSE : ", response.body);
                done(); 
            });
    
            afterAll(function (){
                console.log("user is deleted!!")
                fs.unlinkSync('data.json'); //remove data.json after all tests are run
            });
        }
    
        catch(err){
            console.log("EXCEPTION : ", err);
        }
    });

    And we are done with setting up a decent framework and just a command away!

    npm test

    Once complete, the test results will be immediately visible on the terminal.

    Test results HTML report is also generated as index.html under jest-stare/ 

    And test coverage details are created under coverage/my_reports/ in the workspace.

    Similarly, other HTTP methods can also be tested, like OPTIONS – supertest.options() which allows dealing with CORS, PATCH – supertest.patch(), HEAD – supertest.head() and many more.

    Wasn’t it a convenient and successful journey?

    Conclusion

    So, wrapping it up with a note that API testing needs attention, and as a QA, let’s abide by the concept of a testing pyramid which is nothing but the mindset of a tester and how to combat issues at a lower level and avoid chaos at upper levels, i.e. UI. 

    Testing Pyramid

    I hope you had a good read. Kindly spread the word. Happy coding!

  • What is Gatsby.Js and What Problems Does it Solve?

    According to their site, “Gatsby is a free and open source framework based on React that helps developers build blazing fast websites and apps”. Gatsby allows the developers to make a site using React and work with any data source (CMSs, Markdown, etc) of their choice. And then at the build time it pulls the data from these sources and spits out a bunch of static files that are optimized by Gatsby for performance. Gatsby loads only the critical HTML, CSS and JavaScript so that the site loads as fast as possible. Once loaded, Gatsby prefetches resources for other pages so clicking around the site feels incredibly fast.

    What Gatsby Tries to Achieve?

    • Construct new, higher-level web building blocks: Gatsby is trying to build abstractions like gatsby-image, gatsby-link which will make web development easier by providing building blocks instead of making a new component for each project.

    • Create a cohesive “content mesh system”: The Content Management System (CMS) was developed to make the content sites possible. Traditionally, a CMS solution was a monolith application to store content, build sites and deliver them to users. But with time, the industry moved to using specialized tools to handle the key areas like search, analytics, payments, etc which have improved rapidly, while the quality of monolithic enterprise CMS applications like Adobe Experience Manager and Sitecore has stayed roughly the same.

      To tackle this modular CMS architecture, Gatsby aims to build a “content mesh” – the infrastructure layer for a decoupled website. The content mesh stitches together content systems in a modern development environment while optimizing website delivery for performance. The content mesh empowers developers while preserving content creators’ workflows. It gives you access to best-of-breed services without the pain of manual integration.

    Image Source: Gatsby

    Make building websites fun by making them simple: Each of the stakeholder in a website project should be able to see their creation quickl Using these building blocks along with the content mesh, website building feels fun no matter how big it gets. As Alan Kay said, “you get simplicity by finding slightly more sophisticated building blocks”.

    An example of this can be seen in gatsby-image component. First lets consider how a single image gets on a website:

    1. A page is designed
    2. Specific images are chosen
    3. The images are resized (with ideally multiple thumbnails to fit different devices)
    4. And finally, the image(s) are included in the HTML/CSS/JS (or React component) for the page.

    gatsby-image is integrated into Gatsby’s data layer and uses its image processing capabilities along with graphql to query for differently sized and shaped images.

    We also skip the complexity around lazy-loading the images which are placed within placeholders. Also the complexity in generating the right sized image thumbnails is also taken care of.

    So instead of a long pipeline of tasks to setup optimized images for your site, the steps now are:
    1. Install gatsby-image
    2. Decide what size of image you need
    3. Add your query and the gatsby-image component to your page
    4. And…that’s it!

    Now images are fun!

    • Build a better web – qualities like speed, security, maintainability SEO, etc should be baked into the framework being used. If they are implemented on a per-site basis then it is a luxury. Gatsby bakes these qualities by default so that the right thing is the easy thing. The most high-impact way to make the web better is to make it high-quality by default.

    It is More Than Just a Static Site Generator

    Gatsby is not just for creating static sites. Gatsby is fully capable of generating a PWA with all the things we think that a modern web app can do, including auth, dynamic interactions, fetching data etc.

    Gatsby does this by generating the static content using React DOM server-side APIs. Once this basic HTML is generated by Gatsby, React picks up where we left off. That basically means that Gatsby renders as much, upfront, as possible statically then client side React picks up and now we can do whatever a traditional React web app can do.

    Best of Both Worlds

    Generating statically generated HTML and then giving client-side React to do whatever it needs to do, using Gatsby gives us the best of both the worlds.

    Statically rendered pages maximize SEO, provide a better TTI, general web performance, etc. Static sites have an easy global distribution and are easier to deploy

    Conclusion

    If the code runs successfully in the development mode (Gatsby develop) it doesn’t mean that there will be no issues with the build version. An easy solution is to build the code regularly and solve the issues. It is easy enough for where the build has to be generated after every change and the build time is a couple of minutes. But if there are frequent changes and the build gets created a few times a week or month, then it might be harder to do as multiple issues will have to be solved at the build time.

    If you have a very big site with a lot of styled components and libraries then the build time increases substantially. If the build takes half an hour to build then it is no longer feasible to run the build after every change which makes finding the build issues regularly complicated.

  • SEO for Web Apps: How to Boost Your Search Rankings

    The responsibilities of a web developer are not just designing and developing a web application but adding the right set of features that allow the site get higher traffic. One way of getting traffic is by ensuring your web page is listed in top search results of Google. Search engines consider certain factors while ranking the web page (which are covered in this guide below), and accommodating these factors in your web app is called search engine optimization. 

    A web app that is search engine optimized loads faster, has a good user experience, and is shown in the top search results of Google. If you want your web app to have these features, then this essential guide to SEO will provide you with a checklist to follow when working on SEO improvements.

    Key Facts:

    • 75% of visitors only visit the first three links listed and results from the second page get only 0.78% of clicks.
    • 95% of visitors visit only the links from the first page of Google.
    • Search engines give 300% more traffic than social media.
    • 8% of searches from browsers are in the form of a question.
    • 40% of visitors will leave a website if it takes more than 3 seconds to load. And more shocking is that 80% of those visitors will not visit the same site again.

    How Search Works:

     

     

    1. Crawling: These are the automated scripts that are often referred to as web crawlers, web spiders, Googlebot, and sometimes shortened to crawlers. These scripts look for the past crawls and look for the sitemap file, which is found at the root directory of the web application. We will cover more on the sitemap later. For now, just understand that the sitemap file has all the links to your website, which are ordered hierarchically. Crawlers add those links to the crawl queue so that they can be crawled later. Crawlers pay special attention to newly added sites and frequently updated/visited sites, and they use several algorithms to find how often the existing site should be recrawled.
    2. Indexing: Let us first understand what indexing means. Indexing is collecting, parsing, and storing data to enable a super-fast response to queries. Now, Google uses the same steps to perform web indexing. Google visits each page from the crawl queue and analyzes what the page is about and analyzes the content, images, and video, then parses the analyzed result and stores it into their database called Google Index.
    3. Serving: When a user makes a search query on Google, Google tries to determine the highest quality result and considers other criteria before serving the result, like user’s location, user’s submitted data, language, and device (desktop/mobile). That is why responsiveness is also considered for SEO. Unresponsive sites might have a higher ranking for desktop but will have a lower ranking for mobile because, while analyzing the page content, these bots see the pages as what the user sees and assign the ranking accordingly.

    Factors that affect SEO ranking:

    1. Sitemap: The sitemap file has two types: HTML & XML, and both files are placed at the root of the web app. The HTML sitemap guides users around the website pages, and it has the pages listed hierarchically  to help users understand the flow of the website. The XML sitemap helps the search engine bots crawl the pages of the site, and it helps the crawlers to understand the website structure. It has different types of data, which helps the bots to perform crawling cleverly.

    loc: The URL of the webpage.

    lastmod: When the content of the URL got updated.

    changefreq: How often the content of the page gets changed.

    priority: It has the range from 0 to 1—0 represents the lowest priority, and 1 represents the highest. 1 is generally given to the home or landing page. Setting 1 to every URL will cause search engines to ignore this field.

    Click here to see how a sitemap.xml looks like.

    The below example shows how the URL will be written along with the fields.

     

    2. Meta tags: Meta tags are very important because they indirectly affect the SEO ranking,  and they contain important information about the web page, and this information is shown as the snippet in Google search results. Users see this snippet and decide whether to click this link, and search engines consider the click rates parameter when serving the results. Meta tags are not visible to the user on the web page, but they are part of HTML code.

    A few important meta tags for SEO are:

    • Meta title: This is the primary content shown by the search results, and it plays a huge role in deciding the click rates because it gives users a quick glance at what this page is about. It should ideally be 50-60 characters long, and the title should be unique for each page.
    • Meta description: It summarizes or gives an overview of the page content in short. The description should be precise and of high quality. It should include some targeted keywords the user will likely search and be under 160 characters.
    • Meta robots: It tells search engines whether to index and crawl web pages. The four values it can contain are index, noindex, follow, or nofollow. If these values are not used correctly, then it will negatively impact the SEO.
      index/noindex: Tells whether to index the web page.
      follow/nofollow: Tells whether to crawl links on the web page.
    • Meta viewport: It sends the signal to search engines that the web page is responsive to different screen sizes, and it instructs the browser on how to render the page. This tag presence helps search engines understand that the website is mobile-friendly, which matters because Google ranks the results differently in mobile search. If the desktop version is opened in mobile, then the user will most likely close the page, sending a negative signal to Google that this page has some undesirable content and results in lowering the ranking. This tag should be present on all the web pages.

      Let us look at what a Velotio page would look like with and without the meta viewport tag.


    • Meta charset: It sets the character encoding of the webpage in simple terms, telling how the text should be displayed on the page. Wrong character encoding will make content hard to read for search engines and will lead to a bad user experience. Use UTF-8 character encoding wherever possible.
    • Meta keywords: Search engines don’t consider this tag anymore. Bing considers this tag as spam. If this tag is added to any of the web pages, it may work against SEO. It is advisable not to have this tag on your pages.

    3. Usage of Headers / Hierarchical content: Header tags are the heading tags that are important for user readability and search engines. Headers organize the content of the web page so that it won’t look like a plain wall of text. Bots check for how well the content is organized and assign the ranking accordingly. Headers make the content user-friendly, scannable, and accessible. Header tags are from h1 to h6, with h1 being high importance and h6 being low importance. Googlebot considers h1 mainly because it is typically the title of the page and provides brief information about what this page content has.

    If Velotio’s different pages of content were written on one big page (not good advice, just for example), then hierarchy can be done like the below snapshot.

    4. Usage of Breadcrumb: Breadcrumbs are the navigational elements that allow users to track which page they are currently on. Search engines find this helpful to understand the structure of the website. It lowers the bounce rate by engaging users to explore other pages of the website. Breadcrumbs can be found at the top of the page with slightly smaller fonts. Usage of breadcrumb is always recommended if your site has deeply nested pages.

    If we refer to the MDN pages, then a hierarchical breadcrumb can be found at the top of the page.

    5. User Experience (UX): UX has become an integral component of SEO. A good UX always makes your users stay longer, which lowers the bounce rate and makes them visit your site again. Google recognizes this stay time and click rates and considers the site as more attractive to users, ranking it higher in the search results. Consider the following points to have a good user experience.

    1. Divide content into sections, not just a plain wall of text
    2. Use hierarchical font sizes
    3. Use images/videos that summarize the content
    4. Good theme and color contrast
    5. Responsiveness (desktop/tablet/mobile)

    6. Robots.txt: The robots.txt file prevents crawlers from accessing all pages of the site. It contains some commands that tell the bots not to index the disallowed pages. By doing this, crawlers will not crawl those pages and will not index them. The best example of a page that should not be crawled is the payment gateway page. Robots.txt is kept at the root of the web app and should be public. Refer to Velotio’s robots.txt file to know more about it. User-Agent:* means the given command will be applied to all the bots that support robots.txt.

    7. Page speed: Page speed is the time it takes to get the page fully displayed and interactive. Google also considers page speed an important factor for SEO. As we have seen from the facts section, users tend to close a site if it takes longer than 3 seconds to load. To Googlebot, this is something unfavorable to the user experience, and it will lower the ranking. We will go through some tools later in this section to  know the loading speed of a page, but if your site loads slowly, then look into the recommendations below.

    • Image compression: In a consumer-oriented website, the images contribute to around 50-90% of the page. The images must load quickly. Use compressed images, which lowers the file size without compromising the quality. Cloudinary is a platform that does this job decently.
      If your image size is 700×700 and is shown in a 300x*300 container, then rather than doing this with CSS, load the image at 300x*300 only, because browsers don’t need to load such a big image, and it will take more time to reduce the image through CSS. All this time can be avoided by loading an image of the required size.
      By utilizing deferring/lazy image loading, images are downloaded when they are needed as the user scrolls on the webpage. Doing this allows the images to not be loaded at once, and browsers will have the bandwidth to perform other tasks.
      Using sprite images is also an effective way to reduce the HTTP requests by combining small icons into one sprite image and displaying the section we want to show. This will save load time by avoiding loading multiple images.
    • Code optimization: Every developer should consider reusability while developing code, which will help in reducing the code size. Nowadays, most websites are developed using bundlers. Use bundle analyzers to analyze which piece of code is leading to a size increase. Bundlers are already doing the minification process while generating the build artifacts.
    • Removing render-blocking resources: Browsers build the DOM tree by parsing HTML. During this process, if it finds any scripts, then the creation of the DOM tree is paused and script execution starts. This will increase the page load time, and to make it work without blocking DOM creation, use async & defer in your scripts and load the script at the footer of the body. Keep in mind, though, that some scripts need to be loaded on the header like Google analytics script. Don’t use this suggested step blindly as it may cause some unusual behavior in your site.
    • Implementing a Content Distribution Network (CDN): It helps in loading the resources in a shorter time by figuring out the nearest server located from the user location and delivering the content from the nearest server.
    • Good hosting platform: Optimizing images and code alone can not always improve page speed. Budget-friendly servers serve millions of other websites, which will prevent your site from loading quickly. So, it is always recommended to use the premium hosting service or a dedicated server.
    • Implement caching: If resources are cached on a browser, then they are not fetched from the server; rather the browser picks them from the cache. It is important to have an expiration time while setting cache. And caching should also be done only on the resources that are not updated frequently.
    • Reducing redirects: In redirecting a page, an additional time is added for the HTTP request-response cycle. It is advisable not to use too many redirects.

    Some tools help us find the score of our website and provide information on what areas can be improved. These tools consider SEO, user experience, and accessibility point of view while calculating the score. These tools give results in some technical terms. Let us understand them in short:

    1. Time to first byte: It represents the moment when the web page starts loading. When we see a white screen for some time on page landing, that is TTFB at work.

    2. First contentful paint: It represents when the user sees something on the web page.

    3. First meaningful paint: It tells when the user understands the content, like text/images on the web page.

    4. First CPU idle: It represents the moment when the site has loaded enough information for it to be able to handle the user’s first input.

    5. Largest contentful paint: It represents when everything above the page’s fold (without scrolling) is visible.

    6. Time to interactive: It represents the moment when the web page is fully interactive.

    7. Total blocking time: It is the total amount of time the webpage was blocked.

    8. Cumulative layout shift: It is measured as the time taken in shifting web elements while the page is being rendered.

    Below are some popular tools we can use for performance analysis:

    1. Page speed insights: This assessment tool provides the score and opportunities to improve.

    2. Web page test: This monitoring tool lets you analyze each resource’s loading time.

    3. Gtmetrix: This is also an assessment tool like Lighthouse that gives some more information, and we can set test location as well.

    Conclusion:

    We have seen what SEO is, how it works, and how we can improve it by going through sitemap, meta tags, heading tags, robots.txt, breadcrumb, user experience, and finally the page load speed. For a business-to-consumer application, SEO is highly important. It lets you drive more traffic to your website. Hopefully, this basic guide will help you improve SEO for your existing and future websites.

    Related Articles

    1. Eliminate Render-blocking Resources using React and Webpack

    2. Building High-performance Apps: A Checklist To Get It Right

    3. Building a Progressive Web Application in React [With Live Code Examples]