Author: admin

  • How to Write Jenkinsfile for Angular and .Net Based Applications

    If you landed here directly and want to know how to setup Jenkins master-slave architecture, please visit this post related to Setting-up the Jenkins Master-Slave Architecture.

    The source code that we are using here is also a continuation of the code that was written in this GitHub Packer-Terraform-Jenkins repository.

    Creating Jenkinsfile

    We will create some Jenkinsfile to execute a job from our Jenkins master.

    Here I will create two Jenkinsfile ideally, it is expected that your Jenkinsfile is present in source code repo but it can be passed directly in the job as well.

    There are 2 ways of writing Jenkinsfile – Scripted and Declarative. You can find numerous points online giving their difference. We will be creating both of them to do a build so that we can get a hang of both of them.

    Jenkinsfile for Angular App (Scripted)

    As mentioned before we will be highlighting both formats of writing the Jenkinsfile. For the Angular app, we will be writing a scripted one but can be easily written in declarative format too.

    We will be running this inside a docker container. Thus, the tests are also going to get executed in a headless manner.

    Here is the Jenkinsfile for reference.

    Here we are trying to leverage Docker volume to keep updating our source code on bare metal and use docker container for the environments.

    Dissecting Node App’s Jenkinsfile

    1. We are using CleanWs() to clear the workspace.
    2. Next is the Main build in which we define our complete build process.
    3. We are pulling the required images.
    4. Highlighting the steps that we will be executing.
    5. Checkout SCM: Checking out our code from Git
    6. We are now starting the node container inside of which we will be running npm install and npm run lint.
    7. Get test dependency: Here we are downloading chrome.json which will be used in the next step when starting the container.
    8. Here we test our app. Specific changes for running the test are mentioned below.
    9. Build: Finally we build the app.
    10. Deploy: Once CI is completed we need to start with CD. The CD itself can be a blog of itself but wanted to highlight what basic deployment would do.
    11. Here we are using Nginx container to host our application.
    12. If the container does not exist it will create a container and use the “dist” folder for deployment.
    13. If Nginx container exists, then it will ask for user input to recreate a container or not.
    14. If you select not to create, don’t worry as we are using Nginx it will do a hot reload with new changes.

    The angular application used here was created using the standard generate command given by the CLI itself. Although the build and install give no trouble in a bare metal some tweaks are required for running test in a container.

    In karma.conf.js update browsers withChromeHeadless.

    Next in protractor.conf.js update browserName with chrome and add

    chromeOptions': {
    args': ['--headless', '--disable-gpu', '--window-size=800x600']
    },

    That’s it! And We have our CI pipeline setup for Angular based application.

    Jenkinsfile for .Net App (Declarative)

    For a .Net application, we have to setup MSBuild and MSDeploy. In the blog post mentioned above, we have already setup MSBuild and we will shortly discuss how to setup MSDeploy.

    To do the Windows deployment we have two options. Either setup MSBuild in Jenkins Global Tool Configuration or use the full path of MSBuild on the slave machine.

    Passing the path is fairly simple and here we will discuss how to use global tool configuration in a Jenkinsfile.

    First, get the path of MSBuild from your server. If it is not the latest version then the path is different and is available in Current directory otherwise always in <version> directory.</version>

    As we are using MSBuild 2017. Our MSBuild path is:

    C:Program Files (x86)Microsoft Visual Studio2017BuildToolsMSBuild15.0Bin

    Place this in /configureTools/ —> MSBuild

    Now you have your configuration ready to be used in Jenkinsfile.

    Jenkinsfile to build and test the app is given below.

    As seen above the structure of Declarative syntax is almost same as that of Declarative. Depending upon which one you find easier to read you should opt the syntax.

    Dissecting Dotnet App’s Jenkinsfile

    1. In this case too we are cleaning the workspace as the first step.
    2. Checkout: This is also the same as before.
    3. Nuget Restore: We are downloading dependent required packages for both PrimeService and PrimeService.Tests
    4. Build: Building the Dotnet app using MSBuild tool which we had configured earlier before writing the Jenkinsfile.
    5. UnitTest: Here we have used dotnet test although we could’ve used MSTest as well here just wanted to highlight how easy dotnet utility makes it. We can even use dotnet build for the build as well.
    6. Deploy: Deploying on the IIS server. Creation of IIS we are covering below.

    From the above-given examples, you get a hang of what Jenkinsfile looks like and how it can be used for creating jobs. Above file highlights basic job creation but it can be extended to everything that old-style job creation could do.

    Creating IIS Server

    Unlike our Angular application where we just had to get another image and we were good to go. Here we will have to Packer to create our IIS server. We will be automating the creation process and will be using it to host applications.

    Here is a Powershell script for IIS for reference.

    # To list all Windows Features: dism /online /Get-Features
    # Get-WindowsOptionalFeature -Online 
    # LIST All IIS FEATURES: 
    # Get-WindowsOptionalFeature -Online | where FeatureName -like 'IIS-*'
    
    # NetFx dependencies
    dism /online /Enable-Feature /FeatureName:NetFx4 /All
    
    # ASP dependencies
    dism /online /enable-feature /all /featurename:IIS-ASPNET45
    
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-WebServerRole
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-WebServer 
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-CommonHttpFeatures
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-Security 
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-RequestFiltering 
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-StaticContent
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-DefaultDocument
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-DirectoryBrowsing
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-HttpErrors 
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ApplicationDevelopment
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-WebSockets 
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ApplicationInit
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-NetFxExtensibility45
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ISAPIExtensions
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ISAPIFilter
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ASP
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ASPNET45
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ServerSideIncludes
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-HealthAndDiagnostics
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-HttpLogging 
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-Performance
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-HttpCompressionStatic
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-WebServerManagementTools
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ManagementConsole 
    Enable-WindowsOptionalFeature -Online -FeatureName IIS-ManagementService
    
    # Install Chocolatey
    Set-ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))
    
    # Install WebDeploy (It will deploy 3.6)
    choco install webdeploy -y

    We won’t be deploying any application on it as we have created a sample app for PrimeNumber. But in the real world, you might be deploying Web Based application and you will need IIS. We have covered here the basic idea of how to install IIS along with any dependency that might be required.

    Conclusion

    In this post, we have covered deploying Windows and Linux based applications using Jenkinsfile in both scripted and declarative format.

    Thanks for Reading! Till next time…!!

  • Implementing Federated GraphQL Microservices using Apollo Federation

    Introduction

    GraphQL has revolutionized how a client queries a server. With the thin layer of GraphQL middleware, the client has the ability to query the data more comprehensively than what’s provided by the usual REST APIs.

    One of the key principles of GraphQL involves having a single data graph of the implementing services that will allow the client to have a unified interface to access more data and services through a single query. Having said that, it can be challenging to follow this principle for an enterprise-level application on a single, monolith GraphQL server.

    The Need for Federated Services

    James Baxley III, the Engineering Manager at Apollo, in his talk here, puts forward the rationale behind choosing an independently managed federated set of services very well.

    To summarize his point, let’s consider a very complex enterprise product. This product would essentially have multiple teams responsible for maintaining different modules of the product. Now, if we’re considering implementing a GraphQL layer at the backend, it would only make sense to follow the one graph principle of GraphQL: this says that to maximize the value of GraphQL, we should have a single unified data graph that’s operating at the data layer of this product. With that, it will be easier for a client to query a single graph and get all the data without having to query different graphs for different data portions.

    However, it would be challenging to have all of the huge enterprise data graphs’ layer logic residing on a single codebase. In addition, we want teams to be able to independently implement, maintain, and ship different schemas of the data graph on their own release cycles.

    Though there is only one graph, the implementation of that graph should be federated across multiple teams.

    Now, let’s consider a massive enterprise e-commerce platform as an example. The different schemas of the e-commerce platform look something like:

    Fig:- E-commerce platform set of schemas

    Considering the above example, it would be a chaotic task to maintain the graph implementation logic of all these schemas on a single code base. Another overhead that this would bring is having to scale a huge monolith that’s implementing all these services. 

    Thus, one solution is a federation of services for a single distributed data graph. Each service can be implemented independently by individual teams while maintaining their own release cycles and having their own iterations of their services. Also, a federated set of services would still follow the Onegraph principle of GraphQL, which will allow the client to query a single endpoint for fetching any part of the data graph.

    To further demonstrate the example above, let’s say the client asks for the top-five products, their reviews, and the vendor selling them. In a usual monolith GraphQL server, this query would involve writing a resolver that’s a mesh of the data sources of these individual schemas. It would be a task for teams to collaborate and come up with their individual implementations. Let’s consider a federated approach with separate services implementing products, reviews, and vendors. Each service is responsible for resolving only the part of the data graph that includes the schema and data source. This makes it extremely streamlined to allow different teams managing different schemas to collaborate easily.

    Another advantage would be handling the scaling of individual services rather than maintaining a compute-heavy monolith for a huge data graph. For example, the products service is used the most on the platform, and the vendors service is scarcely used. In case of a monolith approach, the scaling would’ve had to take place on the overall server. This is eliminated with federated services where we can independently maintain and scale individual services like the products service.

    Federated Implementation of GraphQL Services

    A monolith GraphQL server that implements a lot of services for different schemas can be challenging to scale. Instead of implementing the complete data graph on a single codebase, the responsibilities of different parts of the data graph can be split across multiple composable services. Each one will contain the implementation of only the part of the data graph it is responsible for. Apollo Federation allows this division of services and follows a declarative programming model to allow splitting of concerns.

    Architecture Overview

    This article will not cover the basics of GraphQL, such as writing resolvers and schemas. If you’re not acquainted with the basics of GraphQL and setting up a basic GraphQL server using Apollo, I would highly recommend reading about it here. Then, you can come back here to understand the implementation of federated services using Apollo Federation.

    Apollo Federation has two principal parts to it:

    • A collection of services that distinctly define separate GraphQL schemas
    • A gateway that builds the federated data graph and acts as a forefront to distinctly implement queries for different services
    Fig:- Apollo Federation Architecture

    Separation of Concerns

    The usual way of going about implementing federated services would be by splitting an existing monolith based on the existing schemas defined. Although this way seems like a clear approach, it will quickly cause problems when multiple Schemas are involved.

    To illustrate, this is a typical way to split services from a monolith based on the existing defined Schemas:

     

    In the example above, although the tweets field belongs to the User schema, it wouldn’t make sense to populate this field in the User service. The tweets field of a User should be declared and resolved in the Tweet service itself. Similarly, it wouldn’t be right to resolve the creator field inside the Tweet service.

    The reason behind this approach is the separation of concerns. The User service might not even have access to the Tweet datastore to be able to resolve the tweets field of a user. On the other hand, the Tweet service might not have access to the User datastore to resolve the creator field of the Tweet schema.

    Considering the above schemas, each service is responsible for resolving the respective field of each Schema it is responsible for.

    Implementation

    To illustrate an Apollo Federation, we’ll be considering a Nodejs server built with Typescript. The packages used are provided by the Apollo libraries.

    npm i --save apollo-server @apollo/federation @apollo/gateway

    Some additional libraries to help run the services in parallel:

    npm i --save nodemon ts-node concurrently

    Let’s go ahead and write the structure for the gateway service first. Let’s create a file gateway.ts:

    touch gateway.ts

    And add the following code snippet:

    import { ApolloServer } from 'apollo-server';
    import { ApolloGateway } from '@apollo/gateway';
    
    const gateway = new ApolloGateway({
      serviceList: [],
    });
    
    const server = new ApolloServer({ gateway, subscriptions: false });
    
    server.listen().then(({ url }) => {
      console.log(`Server ready at url: ${url}`);
    });

    Note the serviceList is an empty array for now since we’ve yet to implement the individual services. In addition, we pass the subscriptions: false option to the apollo server config because currently, Apollo Federation does not support subscriptions.

    Next, let’s add the User service in a separate file user.ts using:

    touch user.ts

    The code will go in the user service as follows:

    import { buildFederatedSchema } from '@apollo/federation';
    import { ApolloServer, gql } from 'apollo-server';
    import User from './datasources/models/User';
    import mongoStore from './mongoStore';
    
    const typeDefs = gql`
      type User @key(fields: "id") {
        id: ID!
        username: String!
      }
      extend type Query {
        users: [User]
        user(id: ID!): User
      }
      extend type Mutation {
        createUser(userPayload: UserPayload): User
      }
      input UserPayload {
        username: String!
      }
    `;
    
    const resolvers = {
      Query: {
        users: async () => {
          const allUsers = await User.find({});
          return allUsers;
        },
        user: async (_, { id }) => {
          const currentUser = await User.findOne({ _id: id });
          return currentUser;
        },
      },
      User: {
        __resolveReference: async (ref) => {
          const currentUser = await User.findOne({ _id: ref.id });
          return currentUser;
        },
      },
      Mutation: {
        createUser: async (_, { userPayload: { username } }) => {
          const user = new User({ username });
          const createdUser = await user.save();
          return createdUser;
        },
      },
    };
    
    mongoStore();
    
    const server = new ApolloServer({
      schema: buildFederatedSchema([{ typeDefs, resolvers }]),
    });
    
    server.listen({ port: 4001 }).then(({ url }) => {
      console.log(`User service ready at url: ${url}`);
    });

    Let’s break down the code that went into the User service.

    Consider the User schema definition:

    type User @key(fields: "id") {
       id: ID!
       username: String!
    }

    The @key directive helps other services understand the User schema is, in fact, an entity that can be extended within other individual services. The fields will help other services uniquely identify individual instances of the User schema based on the id.

    The Query and the Mutation types need to be extended by all implementing services according to the Apollo Federation documentation since they are always defined on a gateway level.

    As a side note, the User model imported from datasources/model/User

    import User from ‘./datasources/models/User’; is essentially a Mongoose ORM Model for MongoDB that will help in all the CRUD operations of a User entity in a MongoDB database. In addition, the mongoStore() function is responsible for establishing a connection to the MongoDB database server.

    The User model implementation internally in Mongoose ORM looks something like this:

    export const UserSchema = new Schema({
      username: {
        type: String,
      },
    });
    
    export default mongoose.model(
      'User',
      UserSchema
    );

    In the Query type, the users and the user(id: ID!) queries fetch a list or the details of individual users.

    In the resolvers, we define a __resolveReference function responsible for returning an instance of the User entity to all other implementing services, which just have a reference id of a User entity and need to return an instance of the User entity. The ref parameter is an object { id: ‘userEntityId’ } that contains the id of an instance of the User entity that may be passed down from other implementing services that need to resolve the reference of a User entity based on the reference id. Internally, we fire a mongoose .findOne query to return an instance of the User from the users database based on the reference id. To illustrate the resolver, 

    User: {
        __resolveReference: async (ref) => {
          const currentUser = User.findOne({ _id: ref.id });
          return currentUser;
        },
      },

    At the end of the file, we make sure the service is running on a unique port number 4001, which we pass as an option while running the apollo server. That concludes the User service.

    Next, let’s add the tweet service by creating a file tweet.ts using:

    touch tweet.ts

    The following code goes as a part of the tweet service:

    import { buildFederatedSchema } from '@apollo/federation';
    import { ApolloServer, gql } from 'apollo-server';
    import Tweet from './datasources/models/Tweet';
    import TweetAPI from './datasources/tweet';
    import mongoStore from './mongoStore';
    
    const typeDefs = gql`
      type Tweet {
        text: String
        id: ID!
        creator: User
      }
      extend type User @key(fields: "id") {
        id: ID! @external
        tweets: [Tweet]
      }
      extend type Query {
        tweet(id: ID!): Tweet
        tweets: [Tweet]
      }
      extend type Mutation {
        createTweet(tweetPayload: TweetPayload): Tweet
      }
      input TweetPayload {
        userId: String
        text: String
      }
    `;
    
    const resolvers = {
      Query: {
        tweet: async (_, { id }) => {
          const currentTweet = await Tweet.findOne({ _id: id });
          return currentTweet;
        },
        tweets: async () => {
          const tweetsList = await Tweet.find({});
          return tweetsList;
        },
      },
      Tweet: {
        creator: (tweet) => ({ __typename: 'User', id: tweet.userId }),
      },
      User: {
        tweets: async (user) => {
          const tweetsByUser = await Tweet.find({ userId: user.id });
          return tweetsByUser;
        },
      },
      Mutation: {
        createTweet: async (_, { tweetPayload: { text, userId } }) => {
          const newTweet = new Tweet({ text, userId });
          const createdTweet = await newTweet.save();
          return createdTweet;
        },
      },
    };
    
    mongoStore();
    
    const server = new ApolloServer({
      schema: buildFederatedSchema([{ typeDefs, resolvers }]),
    });
    
    server.listen({ port: 4002 }).then(({ url }) => {
      console.log(`Tweet service ready at url: ${url}`);
    });

    Let’s break down the Tweet service as well

    type Tweet {
       text: String
       id: ID!
       creator: User
    }

    The Tweet schema has the text field, which is the content of the tweet, a unique id of the tweet,  and a creator field, which is of the User entity type and resolves into the details of the user that created the tweet:

    extend type User @key(fields: "id") {
       id: ID! @external
       tweets: [Tweet]
    }

    We extend the User entity schema in this service, which has the id field with an @external directive. This helps the Tweet service understand that based on the given id field of the User entity schema, the instance of the User entity needs to be derived from another service (user service in this case).

    As we discussed previously, the tweets field of the extended User schema for the user entity should be resolved in the Tweet service since all the resolvers and access to the data sources with respect to the Tweets entity resides in this service.

    The Query and Mutation types of the Tweet service are pretty straightforward; we have a tweets and a tweet(id: ID!) queries to resolve a list or resolve an individual instance of the Tweet entity.

    Let’s further break down the resolvers:

    Tweet: {
       creator: (tweet) => ({ __typename: 'User', id: tweet.userId }),
    },

    To resolve the creator field of the Tweet entity, the Tweet service needs to tell the gateway that this field will be resolved by the User service. Hence, we pass the id of the User and a __typename for the gateway to be able to call the right service to resolve the User entity instance. In the User service earlier, we wrote a  __resolveReference resolver, which will resolve the reference of a User based on an id.

    User: {
       tweets: async (user) => {
           const tweetsByUser = await Tweet.find({ userId: user.id });
           return tweetsByUser;
       },
    },

    Now, we need to resolve the tweets field of the User entity extended in the Tweet service. We need to write a resolver where we get the parent user entity reference in the first argument of the resolver using which we can fire a Mongoose ORM query to return all the tweets created by the user given its id.

    At the end of the file, similar to the User service, we make sure the Tweet service runs on a different port by adding the port: 4002 option to the Apollo server config. That concludes both our implementing services.

    Now that we have our services ready, let’s update our gateway.ts file to reflect the added services:

    import { ApolloServer } from 'apollo-server';
    import { ApolloGateway } from '@apollo/gateway';
    
    const gateway = new ApolloGateway({
      serviceList: [
          { name: 'users', url: 'http://localhost:4001' },
          { name: 'tweets', url: 'http://localhost:4002' },
        ],
    });
    
    const server = new ApolloServer({ gateway, subscriptions: false });
    
    server.listen().then(({ url }) => {
      console.log(`Server ready at url: ${url}`);
    });

    We’ve added two services to the serviceList with a unique name to identify each service followed by the URL they are running on.

    Next, let’s make some small changes to the package.json file to make sure the services and the gateway run in parallel:

    "scripts": {
        "start": "concurrently -k npm:server:*",
        "server:gateway": "nodemon --watch 'src/**/*.ts' --exec 'ts-node' src/gateway.ts",
        "server:user": "nodemon --watch 'src/**/*.ts' --exec 'ts-node' src/user.ts",
        "server:tweet": "nodemon --watch 'src/**/*.ts' --exec 'ts-node' src/tweet.ts"
      },

    The concurrently library helps run 3 separate scripts in parallel. The server:* scripts spin up a dev server using nodemon to watch and reload the server for changes and ts-node to execute Typescript node.

    Let’s spin up our server:

    npm start

    On visiting the http://localhost:4000, you should see the GraphQL query playground running an Apollo server:

    Querying and Mutation from the Client

    Initially, let’s fire some mutations to create two users and some tweets by those users.

    Mutations

    Here we have created a user with the username “@elonmusk” that returns the id of the user. Fire the following mutations in the GraphQL playground:

     

    We will create another user named “@billgates” and take a note of the ID.

    Here is a simple mutation to create a tweet by the user “@elonmusk”. Now that we have two created users, let’s fire some mutations to create tweets by those users:

    Here is another mutation that creates a tweet by the user“@billgates”.

    After adding a couple of those, we are good to fire our queries, which will allow the gateway to compose the data by resolving fields through different services.

    Queries

    Initially, let’s list all the tweets along with their creator, which is of type User. The query will look something like:

    {
     tweets {
       text
       creator {
         username
       }
     }
    }

    When the gateway encounters a query asking for tweet data, it forwards that query to the Tweet service since the Tweet service that extends the Query type has a tweet query defined in it. 

    On encountering the creator field of the tweet schema, which is of the type User, the creator resolver within the Tweet service is invoked. This is essentially just passing a __typename and an id, which tells the gateway to resolve this reference from another service.

    In the User service, we have a __resolveReference function, which returns the complete instance of a user given it’s id passed from the Tweet service. It also helps all other implementing services that need the reference of a User entity resolved.

    On firing the query, the response should look something like:

    {
      "data": {
        "tweets": [
          {
            "text": "I own Tesla",
            "creator": {
              "username": "@elonmusk"
            }
          },
          {
            "text": "I own SpaceX",
            "creator": {
              "username": "@elonmusk"
            }
          },
          {
            "text": "I own PayPal",
            "creator": {
              "username": "@elonmusk"
            }
          },
          {
            "text": "I own Microsoft",
            "creator": {
              "username": "@billgates"
            }
          },
          {
            "text": "I own XBOX",
            "creator": {
              "username": "@billgates"
            }
          }
        ]
      }
    }

    Now, let’s try it the other way round. Let’s list all users and add the field tweets that will be an array of all the tweets created by that user. The query should look something like:

    {
     users {
       username
       tweets {
         text
       }
     }
    }

    When the gateway encounters the query of type users, it passes down that query to the user service. The User service is responsible for resolving the username field of the query.

    On encountering the tweets field of the users query, the gateway checks if any other implementing service has extended the User entity and has a resolver written within the service to resolve any additional fields of the type User.

    The Tweet service has extended the type User and has a resolver for the User type to resolve the tweets field, which will fetch all the tweets created by the user given the id of the user.

    On firing the query, the response should be something like:

    {
      "data": {
        "users": [
          {
            "username": "@elonmusk",
            "tweets": [
              {
                "text": "I own Tesla"
              },
              {
                "text": "I own SpaceX"
              },
              {
                "text": "I own PayPal"
              }
            ]
          },
          {
            "username": "@billgates",
            "tweets": [
              {
                "text": "I own Microsoft"
              },
              {
                "text": "I own XBOX"
              }
            ]
          }
        ]
      }
    }

    Conclusion

    To scale an enterprise data graph on a monolith GraphQL service brings along a lot of challenges. Having the ability to distribute our data graph into implementing services that can be individually maintained or scaled using Apollo Federation helps to quell any concerns.

    There are further advantages of federated services. Considering our example above, we could have two different kinds of datastores for the User and the Tweet service. While the User data could reside on a NoSQL database like MongoDB, the Tweet data could be on a SQL database like Postgres or SQL. This would be very easy to implement since each service is only responsible for resolving references only for the type they own.

    Final Thoughts

    One of the key advantages of having different services that can be maintained individually is the ability to deploy each service separately. In addition, this also enables deployment of different services independently to different platforms such as Firebase, Lambdas, etc.

    A single monolith GraphQL server deployed on an instance or a single serverless platform can have some challenges with respect to scaling an instance or handling high concurrency as mentioned above.

    By splitting out the services, we could have a separate serverless function for each implementing service that can be maintained or scaled individually and also a separate function on which the gateway can be deployed.

    One popular usage of GraphQL Federation can be seen in this Netflix Technology blog, where they’ve explained how they solved a bottleneck with the GraphQL APIs in Netflix Studio . What they did was create a federated GraphQL microservices architecture, along with a Schema store using Apollo Federation. This solution helped them create a unified schema but with distributed ownership and implementation.

  • Implementing GraphQL with Flutter: Everything you need to know

    Thinking about using GraphQL but unsure where to start? 

    This is a concise tutorial based on our experience using GraphQL. You will learn how to use GraphQL in a Flutter app, including how to create a query, a mutation, and a subscription using the graphql_flutter plugin. Once you’ve mastered the fundamentals, you can move on to designing your own workflow.

    Key topics and takeaways:

    * GraphQL

    * What is graphql_flutter?

    * Setting up graphql_flutter and GraphQLProvider

    * Queries

    * Mutations

    * Subscriptions

    GraphQL

    Looking to call multiple endpoints to populate data for a single screen? Wish you had more control over the data returned by the endpoint? Is it possible to get more data with a single endpoint call, or does the call only return the necessary data fields?

    Follow along to learn how to do this with GraphQL. GraphQL’s goal was to change the way data is supplied from the backend, and it allows you to specify the data structure you want.

    Let’s imagine that we have the table model in our database that looks like this:

    Movie {

     title

     genre

     rating

     year

    }

    These fields represent the properties of the Movie Model:

    • title property is the name of the Movie,
    • genre describes what kind of movie
    • rating represents viewers interests
    • year states when it is released

    We can get movies like this using REST:

    /GET localhost:8080/movies

    [
     {
       "title": "The Godfather",
       "genre":  "Drama",
       "rating": 9.2,
       "year": 1972
     }
    ]

    As you can see, whether or not we need them, REST returns all of the properties of each movie. In our frontend, we may just need the title and genre properties, yet all of them were returned.

    We can avoid redundancy by using GraphQL. We can specify the properties we wish to be returned using GraphQL, for example:

    query movies { Movie {   title   genre }}

    We’re informing the server that we only require the movie table’s title and genre properties. It provides us with exactly what we require:

    {
     "data": [
       {
         "title": "The Godfather",
         "genre": "Drama"
       }
     ]
    }

    GraphQL is a backend technology, whereas Flutter is a frontend SDK for developing mobile apps. We get the data displayed on the mobile app from a backend when we use mobile apps.

    It’s simple to create a Flutter app that retrieves data from a GraphQL backend. Simply make an HTTP request from the Flutter app, then use the returned data to set up and display the UI.

    The new graphql_flutter plugin includes APIs and widgets for retrieving and using data from GraphQL backends.

    What is graphql_flutter?

    The new graphql_flutter plugin includes APIs and widgets that make it simple to retrieve and use data from a GraphQL backend.

    graphql_flutter, as the name suggests, is a GraphQL client for Flutter. It exports widgets and providers for retrieving data from GraphQL backends, such as:

    • HttpLink — This is used to specify the backend’s endpoint or URL.
    • GraphQLClient — This class is used to retrieve a query or mutation from a GraphQL endpoint as well as to connect to a GraphQL server.
    • GraphQLCache — We use this class to cache our queries and mutations. It has an options store where we pass the type of store to it during its caching operation.
    • GraphQLProvider — This widget encapsulates the graphql flutter widgets, allowing them to perform queries and mutations. This widget is given to the GraphQL client to use. All widgets in this provider’s tree have access to this client.
    • Query — This widget is used to perform a backend GraphQL query.
    • Mutation — This widget is used to modify a GraphQL backend.
    • Subscription — This widget allows you to create a subscription.

    Setting up graphql_flutter and GraphQLProvider

    Create a Flutter project:

    flutter create flutter_graphqlcd flutter_graphql

    Next, install the graphql_flutter package:

    flutter pub add graphql_flutter

    The code above will set up the graphql_flutter package. This will include the graphql_flutter package in the dependencies section of your pubspec.yaml file:

    dependencies:graphql_flutter: ^5.0.0

    To use the widgets, we must import the package as follows:

    import 'package:graphql_flutter/graphql_flutter.dart';

    Before we can start making GraphQL queries and mutations, we must first wrap our root widget in GraphQLProvider. A GraphQLClient instance must be provided to the GraphQLProvider’s client property.

    GraphQLProvider( client: GraphQLClient(...))

    The GraphQLClient includes the GraphQL server URL as well as a caching mechanism.

    final httpLink = HttpLink(uri: "http://10.0.2.2:8000/");‍ValueNotifier<GraphQLClient> client = ValueNotifier( GraphQLClient(   cache: InMemoryCache(),   link: httpLink ));

    HttpLink is used to generate the URL for the GraphQL server. The GraphQLClient receives the instance of the HttpLink in the form of a link property, which contains the URL of the GraphQL endpoint.

    The cache passed to GraphQLClient specifies the cache mechanism to be used. To persist or store caches, the InMemoryCache instance makes use of an in-memory database.

    A GraphQLClient instance is passed to a ValueNotifier. This ValueNotifer holds a single value and has listeners that notify it when that value changes. This is used by graphql_flutter to notify its widgets when the data from a GraphQL endpoint changes, which helps graphql_flutter remain responsive.

    We’ll now encase our MaterialApp widget in GraphQLProvider:

    void main() { runApp(MyApp());}‍class MyApp extends StatelessWidget { // This widget is the root of your application. @override Widget build(BuildContext context) {   return GraphQLProvider(       client: client,       child: MaterialApp(         title: 'GraphQL Demo',         theme: ThemeData(primarySwatch: Colors.blue),         home: MyHomePage(title: 'GraphQL Demo'),       )); }}

    Queries

    We’ll use the Query widget to create a query with the graphql_flutter package.

    class MyHomePage extends StatelessWidget { @override Widget build(BuildContext) {   return Query(     options: QueryOptions(       document: gql(readCounters),       variables: {         'counterId': 23,       },       pollInterval: Duration(seconds: 10),     ),     builder: (QueryResult result,         { VoidCallback refetch, FetchMore fetchMore }) {       if (result.hasException) {         return Text(result.exception.toString());       }‍       if (result.isLoading) {         return Text('Loading');       }‍       // it can be either Map or List       List counters = result.data['counter'];‍       return ListView.builder(           itemCount: repositories.length,           itemBuilder: (context, index) {             return Text(counters[index]['name']);           });     },) }}

    The Query widget encloses the ListView widget, which will display the list of counters to be retrieved from our GraphQL server. As a result, the Query widget must wrap the widget where the data fetched by the Query widget is to be displayed.

    The Query widget cannot be the tree’s topmost widget. It can be placed wherever you want as long as the widget that will use its data is underneath or wrapped by it.

    In addition, two properties have been passed to the Query widget: options and builder.

    options

    options: QueryOptions( document: gql(readCounters), variables: {   'conuterId': 23, }, pollInterval: Duration(seconds: 10),),

    The option property is where the query configuration is passed to the Query widget. This options prop is a QueryOptions instance. The QueryOptions class exposes properties that we use to configure the Query widget.

    The query string or the query to be conducted by the Query widget is set or sent in via the document property. We passed in the readCounters string here:

    final String readCounters = """query readCounters($counterId: Int!) {   counter {       name       id   }}""";

    The variables attribute is used to send query variables to the Query widget. There is a ‘counterId’: 23 there. In the readCounters query string, this will be passed in place of $counterId.

    The pollInterval specifies how often the Query widget polls or refreshes the query data. The timer is set to 10 seconds, so the Query widget will perform HTTP requests to refresh the query data every 10 seconds.

    builder

    A function is the builder property. When the Query widget sends an HTTP request to the GraphQL server endpoint, this function is called. The Query widget calls the builder function with the data from the query, a function to re-fetch the data, and a function for pagination. This is used to get more information.

    The builder function returns widgets that are listed below the Query widget. The result argument is a QueryResult instance. The QueryResult class has properties that can be used to determine the query’s current state and the data returned by the Query widget.

    • If the query encounters an error, QueryResult.hasException is set.
    • If the query is still in progress, QueryResult.isLoading is set. We can use this property to show our users a UI progress bar to let them know that something is on its way.
    • The data returned by the GraphQL endpoint is stored in QueryResult.data.

    Mutations

    Let’s look at how to make mutation queries with the Mutation widget in graphql_flutter.

    The Mutation widget is used as follows:

    Mutation( options: MutationOptions(   document: gql(addCounter),   update: (GraphQLDataProxy cache, QueryResult result) {     return cache;   },   onCompleted: (dynamic resultData) {     print(resultData);   }, ), builder: (   RunMutation runMutation,   QueryResult result, ) {   return FlatButton(       onPressed: () => runMutation({             'counterId': 21,           }),       child: Text('Add Counter')); },);

    The Mutation widget, like the Query widget, accepts some properties.

    • options is a MutationOptions class instance. This is the location of the mutation string and other configurations.
    • The mutation string is set using a document. An addCounter mutation has been passed to the document in this case. The Mutation widget will handle it.
    • When we want to update the cache, we call update. The update function receives the previous cache (cache) and the outcome of the mutation. Anything returned by the update becomes the cache’s new value. Based on the results, we’re refreshing the cache.
    • When the mutations on the GraphQL endpoint have been called, onCompleted is called. The onCompleted function is then called with the mutation result builder to return the widget from the Mutation widget tree. This function is invoked with a RunMutation instance, runMutation, and a QueryResult instance result.
    • The Mutation widget’s mutation is executed using runMutation. The Mutation widget causes the mutation whenever it is called. The mutation variables are passed as parameters to the runMutation function. The runMutation function is invoked with the counterId variable, 21.

    When the Mutation’s mutation is finished, the builder is called, and the Mutation rebuilds its tree. runMutation and the mutation result are passed to the builder function.

    Subscriptions

    Subscriptions in GraphQL are similar to an event system that listens on a WebSocket and calls a function whenever an event is emitted into the stream.

    The client connects to the GraphQL server via a WebSocket. The event is passed to the WebSocket whenever the server emits an event from its end. So this is happening in real-time.

    The graphql_flutter plugin in Flutter uses WebSockets and Dart streams to open and receive real-time updates from the server.

    Let’s look at how we can use our Flutter app’s Subscription widget to create a real-time connection. We’ll start by creating our subscription string:

    final counterSubscription = '''subscription counterAdded {   counterAdded {       name       id   }}''';

    When we add a new counter to our GraphQL server, this subscription will notify us in real-time.

    Subscription(   options: SubscriptionOptions(     document: gql(counterSubscription),   ),   builder: (result) {     if (result.hasException) {       return Text("Error occurred: " + result.exception.toString());     }‍     if (result.isLoading) {       return Center(         child: const CircularProgressIndicator(),       );     }‍     return ResultAccumulator.appendUniqueEntries(         latest: result.data,         builder: (context, {results}) => ...     );   }),

    The Subscription widget has several properties, as we can see:

    • options holds the Subscription widget’s configuration.
    • document holds the subscription string.
    • builder returns the Subscription widget’s widget tree.

    The subscription result is used to call the builder function. The end result has the following properties:

    • If the Subscription widget encounters an error while polling the GraphQL server for updates, result.hasException is set.
    • If polling from the server is active, result.isLoading is set.

    The provided helper widget ResultAccumulator is used to collect subscription results, according to graphql_flutter’s pub.dev page.

    Conclusion

    This blog intends to help you understand what makes GraphQL so powerful, how to use it in Flutter, and how to take advantage of the reactive nature of graphql_flutter. You can now take the first steps in building your applications with GraphQL!

  • Setting Up A Single Sign On (SSO) Environment For Your App

    Single Sign On (SSO) makes it simple for users to begin using an application. Support for SSO is crucial for enterprise apps, as many corporate security policies mandate that all applications use certified SSO mechanisms. While the SSO experience is straightforward, the SSO standard is anything but straightforward. It’s easy to get confused when you’re surrounded by complex jargon, including SAML, OAuth 1.0, 1.0a, 2.0, OpenID, OpenID Connect, JWT, and tokens like refresh tokens, access tokens, bearer tokens, and authorization tokens. Standards documentation is too precise to allow generalization, and vendor literature can make you believe it’s too difficult to do it yourself.

    I’ve created SSO for a lot of applications in the past. Knowing your target market, norms, and platform are all crucial.

    Single Sign On

    Single Sign On is an authentication method that allows apps to securely authenticate users into numerous applications by using just one set of login credentials.

    This allows applications to avoid the hassle of storing and managing user information like passwords and also cuts down on troubleshooting login-related issues. With SSO configured, applications check with the SSO provider (Okta, Google, Salesforce, Microsoft) if the user’s identity can be verified.

    Types of SSO

    • Security Access Markup Language (SAML)
    • OpenID Connect (OIDC)
    • OAuth (specifically OAuth 2.0 nowadays)
    • Federated Identity Management (FIM)

    Security Assertion Markup Language – SAML

    SAML (Security Assertion Markup Language) is an open standard that enables identity providers (IdP) to send authorization credentials to service providers (SP). Meaning you can use one set of credentials to log in to many different websites. It’s considerably easier to manage a single login per user than to handle several logins to email, CRM software, Active Directory, and other systems.

    For standardized interactions between the identity provider and service providers, SAML transactions employ Extensible Markup Language (XML). SAML is the link between a user’s identity authentication and authorization to use a service.

    In our example implementation, we will be using SAML 2.0 as the standard for the authentication flow.

    Technical details

    • A Service Provider (SP) is the entity that provides the service, which is in the form of an application. Examples: Active Directory, Okta Inbuilt IdP, Salesforce IdP, Google Suite.
    • An Identity Provider (IdP) is the entity that provides identities, including the ability to authenticate a user. The user profile is normally stored in the Identity Provider typically and also includes additional information about the user such as first name, last name, job code, phone number, address, and so on. Depending on the application, some service providers might require a very simple profile (username, email), while others may need a richer set of user data (department, job code, address, location, and so on). Examples: Google – GDrive, Meet, Gmail.
    • The SAML sign-in flow initiated by the Identity Provider is referred to as an Identity Provider Initiated (IdP-initiated) sign-in. In this flow, the Identity Provider begins a SAML response that is routed to the Service Provider to assert the user’s identity, rather than the SAML flow being triggered by redirection from the Service Provider. When a Service Provider initiates the SAML sign-in process, it is referred to as an SP-initiated sign-in. When end-users try to access a protected resource, such as when the browser tries to load a page from a protected network share, this is often triggered.

    Configuration details

    • Certificate – To validate the signature, the SP must receive the IdP’s public certificate. On the SP side, the certificate is kept and used anytime a SAML response is received.
    • Assertion Consumer Service (ACS) Endpoint – The SP sign-in URL is sometimes referred to simply as the URL. This is the endpoint supplied by the SP for posting SAML responses. This information must be sent by the SP to the IdP.
    • IdP Sign-in URL – This is the endpoint where SAML requests are posted on the IdP side. This information must be obtained by the SP from the IdP.

    OpenID Connect – OIDC

    OIDC protocol is based on the OAuth 2.0 framework. OIDC authenticates the identity of a specific user, while OAuth 2.0 allows two applications to trust each other and exchange data.

    So, while the main flow appears to be the same, the labels are different.

    How are SAML and OIDC similar?

    The basic login flow for both is the same.

    1. A user tries to log into the application directly.

    2. The program sends the user’s login request to the IdP via the browser.

    3. The user logs in to the IdP or confirms that they are already logged in.

    4. The IdP verifies that the user has permission to use the program that initiated the request.

    5. Information about the user is sent from the IdP to the user’s browser.

    6. Their data is subsequently forwarded to the application.

    7. The application verifies that they have permission to use the resources.

    8. The user has been granted access to the program.

    Difference between SAML and OIDC

    1. SAML transmits user data in XML, while OpenID Connect transmits data in JSON.

    2. SAML calls the data it sends an assertion. OAuth2 calls the data it sends a claim.

    3. In SAML, the application or system the user is trying to get into is referred to as the Service Provider. In OIDC, it’s called the Relying Party.

    SAML vs. OIDC

    1. OpenID Connect is becoming increasingly popular. Because it interacts with RESTful API endpoints, it is easier to build than SAML and is easily available through APIs. This also implies that it is considerably more compatible with mobile apps.

    2. You won’t often have a choice between SAML and OIDC when configuring Single Sign On (SSO) for an application through an identity provider like OneLogin. If you do have a choice, it is important to understand not only the differences between the two, but also which one is more likely to be sustained over time. OIDC appears to be the clear winner at this time because developers find it much easier to work with as it is more versatile.

    Use Cases

    1. SAML with OIDC:

    – Log in with Salesforce: SAML Authentication where Salesforce was used as IdP and the web application as an SP.

    Key Reason:

    All users are centrally managed in Salesforce, so SAML was the preferred choice for authentication.

    – Log in with Okta: OIDC Authentication where Okta used IdP and the web application as an SP.

    Key Reason:

    Okta Active Directory (AD) is already used for user provisioning and de-provisioning of all internal users and employees. Okta AD enables them to integrate Okta with any on-premise AD.

    In both the implementation user provisioning and de-provisioning takes place at the IdP side.

    SP-initiated (From web application)

    IdP-initiated (From Okta Active Directory)

    2. Only OIDC login flow:

    • OIDC Authentication where Google, Salesforce, Office365, and Okta are used as IdP and the web application as SP.

    Why not use OAuth for SSO

    1. OAuth 2.0 is not a protocol for authentication. It explicitly states this in its documentation.

    2. With authentication, you’re basically attempting to figure out who the user is when they authenticated, and how they authenticated. These inquiries are usually answered with SAML assertions rather than access tokens and permission grants.

    OIDC vs. OAuth 2.0

    • OAuth 2.0 is a framework that allows a user of a service to grant third-party application access to the service’s data without revealing the user’s credentials (ID and password).
    • OpenID Connect is a framework on top of OAuth 2.0 where a third-party application can obtain a user’s identity information which is managed by a service. OpenID Connect can be used for SSO.
    • In OAuth flow, Authorization Server gives back Access Token only. In the OpenID flow, the Authorization server returns Access Code and ID Token. A JSON Web Token, or JWT, is a specially formatted string of characters that serves as an ID Token. The Client can extract information from the JWT, such as your ID, name, when you logged in, the expiration of the ID Token, and if the JWT has been tampered with.

    Federated Identity Management (FIM)

    Identity Federation, also known as federated identity management, is a system that allows users from different companies to utilize the same verification method for access to apps and other resources.

    In short, it’s what allows you to sign in to Spotify with your Facebook account.

    • Single Sign On (SSO) is a subset of the identity federation.
    • SSO generally enables users to use a single set of credentials to access multiple systems within a single organization, while FIM enables users to access systems across different organizations.

    How does FIM work?

    • To log in to their home network, users use the security domain to authenticate.
    • Users attempt to connect to a distant application that employs identity federation after authenticating to their home domain.
    • Instead of the remote application authenticating the user itself, the user is prompted to authenticate from their home authentication server.
    • The user’s home authentication server authorizes the user to the remote application and the user is permitted to access the app. The user’s home client is authenticated to the remote application, and the user is permitted access to the application.

    A user can log in to their home domain once, to their home domain; remote apps in other domains can then grant access to the user without an additional login process.

    Applications:

    • Auth0: Auth0 uses OpenID Connect and OAuth 2.0 to authenticate users and get their permission to access protected resources. Auth0 allows developers to design and deploy applications and APIs that easily handle authentication and authorization issues such as the OIDC/OAuth 2.0 protocol with ease.
    • AWS Cognito
    • User pools – In Amazon Cognito, a user pool is a user directory. Your users can sign in to your online or mobile app using Amazon Cognito or federate through a third-party identity provider using a user pool (IdP). All members of the user pool have a directory profile that you may access using an SDK, whether they sign indirectly or through a third party.
    • Identity pools – An identity pool allows your users to get temporary AWS credentials for services like Amazon S3 and DynamoDB.

    Conclusion:

    I hope you found the summary of my SSO research beneficial. The optimum implementation approach is determined by your unique situation, technological architecture, and business requirements.

  • Improving Elasticsearch Indexing in the Rails Model using Searchkick

    Searching has become a prominent feature of any web application, and a relevant search feature requires a robust search engine. The search engine should be capable of performing a full-text search, auto completion, providing suggestions, spelling corrections, fuzzy search, and analytics. 

    Elasticsearch, a distributed, fast, and scalable search and analytic engine, takes care of all these basic search requirements.

    The focus of this post is using a few approaches with Elasticsearch in our Rails application to reduce time latency for web requests. Let’s review one of the best ways to improve the Elasticsearch indexing in Rails models by moving them to background jobs.

    In a Rails application, Elasticsearch can be integrated with any of the following popular gems:

    We can continue with any of these gems mentioned above. But for this post, we will be moving forward with the Searchkick gem, which is a much more Rails-friendly gem.

    The default Searchkick gem option uses the object callbacks to sync the data in the respective Elasticsearch index. Being in the callbacks, it costs the request, which has the creation and updation of a resource to take additional time to process the web request.

    The below image shows logs from a Rails application, captured for an update request of a user record. We have added a print statement before Elasticsearch tries to sync in the Rails model so that it helps identify from the logs where the indexing has started. These logs show that the last two queries were executed for indexing the data in the Elasticsearch index.

    Since the Elasticsearch sync is happening while updating a user record, we can conclude that the user update request will take additional time to cover up the Elasticsearch sync.

    Below is the request flow diagram:

    From the request flow diagram, we can say that the end-user must wait for step 3 and 4 to be completed. Step 3 is to fetch the children object details from the database.

    To tackle the problem, we can move the Elasticsearch indexing to the background jobs. Usually, for Rails apps in production, there are separate app servers, database servers, background job processing servers, and Elasticsearch servers (in this scenario).

    This is how the request flow looks when we move Elasticsearch indexing:

    Let’s get to coding!

    For demo purposes, we will have a Rails app with models: `User` and `Blogpost`. The stack used here:

    • Rails 5.2
    • Elasticsearch 6.6.7
    • MySQL 5.6
    • Searchkick (gem for writing Elasticsearch queries in Ruby)
    • Sidekiq (gem for background processing)

    This approach does not require  any specific version of Rails, Elasticsearch or Mysql. Moreover, this approach is database agnostic. You can go through the code from this Github repo for reference.

    Let’s take a look at the user model with Elasticsearch index.

    # == Schema Information
    #
    # Table name: users
    #
    #  id            :bigint           not null, primary key
    #  name          :string(255)
    #  email         :string(255)
    #  mobile_number :string(255)
    #  created_at    :datetime         not null
    #  updated_at    :datetime         not null
    #
    class User < ApplicationRecord
     searchkick
    
     has_many :blogposts
     def search_data
       {
         name: name,
         email: email,
         total_blogposts: blogposts.count,
         last_published_blogpost_date: last_published_blogpost_date
       }
     end
     ...
    end

    Anytime a user object is inserted, updated, or deleted, Searchkick reindexes the data in the Elasticsearch user index synchronously.

    Searchkick already provides four ways to sync Elasticsearch index:

    • Inline (default)
    • Asynchronous
    • Queuing
    • Manual

    For more detailed information on this, refer to this page. In this post, we are looking in the manual approach to reindex the model data.

    To manually reindex, the user model will look like:

    class User < ApplicationRecord
     searchkick callbacks: false
    
     def search_data
       ...
     end
    end

    Now, we will need to define a callback that can sync the data to the Elasticsearch index. Typically, this callback must be written in all the models that have the Elasticsearch index. Instead, we can write a common concern and include it to required models.

    Here is what our concern will look like:

    module ElasticsearchIndexer
     extend ActiveSupport::Concern
    
     included do
       after_commit :reindex_model
       def reindex_model
         ElasticsearchWorker.perform_async(self.id, self.class.name)
       end
     end
    end

    In the above active support concern, we have called the Sidekiq worker named ElasticsearchWorker. After adding this concern, don’t forget to include the Elasticsearch indexer concern in the user model, like so:

    include ElasticsearchIndexer

    Now, let’s see the Elasticsearch Sidekiq worker:

    class ElasticsearchWorker
     include Sidekiq::Worker
     def perform(id, klass)
       begin
         klass.constantize.find(id.to_s).reindex
       rescue => e
         # Handle exception
       end
     end
    end

    That’s it, we’ve done it. Cool, huh? Now, whenever a user creates, updates, or deletes web request, a background job will be created. The background job can be seen in the Sidekiq web UI at localhost:3000/sidekiq

    Now, there is little problem in the Elasticsearch indexer concern. To reproduce this, go to your user edit page, click save, and look at localhost:3000/sidekiq—a job will be queued.

    We can handle this case by tracking the dirty attributes. 

    module ElasticsearchIndexer
     extend ActiveSupport::Concern
     included do
       after_commit :reindex_model
       def reindex_model
         return if self.previous_changes.keys.blank?
         ElasticsearchWorker.perform_async(self.id, klass)
       end
     end
    end

    Furthermore, there are few more areas of improvement. Suppose you are trying to update the field of user model that is not part of the Elasticsearch index, the Elasticsearch worker Sidekiq job will still get created and reindex the associated model object. This can be modified to create the Elasticsearch indexing worker Sidekiq job only if the Elasticsearch index fields are updated.

    module ElasticsearchIndexer
     extend ActiveSupport::Concern
     included do
       after_commit :reindex_model
       def reindex_model
         updated_fields = self.previous_changes.keys
        
         # For getting ES Index fields you can also maintain constant
         # on model level or get from the search_data method.
         es_index_fields = self.search_data.stringify_keys.keys
         return if (updated_fields & es_index_fields).blank?
         ElasticsearchWorker.perform_async(self.id, klass)
       end
     end
    end

    Conclusion

    Moving the Elasticsearch indexing to background jobs is a great way to boost the performance of the web app by reducing the response time of any web request. Implementing this approach for every model would not be ideal. I would recommend this approach only if the Elasticsearch index data are not needed in real-time.

    Since the execution of background jobs depends on the number of jobs it must perform, it might take time to reflect the changes in the Elasticsearch index if there are lots of jobs queued up. To solve this problem to some extent, the Elasticsearch indexing jobs can be added in a queue with high priority. Also, make sure you have a different app server and background job processing server. This approach works best if the app server is different than the background job processing server.

  • How To Use Inline Functions In React Applications Efficiently

    This blog post explores the performance cost of inline functions in a React application. Before we begin, let’s try to understand what inline function means in the context of a React application.

    What is an inline function?

    Simply put, an inline function is a function that is defined and passed down inside the render method of a React component.

    Let’s understand this with a basic example of what an inline function might look like in a React application:

    export default class CounterApp extends React.Component {
      constructor(props) {
        super(props);
        this.state = { count: 0 };
      }
      render() {
        return (
          <div className="App">
            <button
              onClick={() => {
              this.setState({ count: this.state.count + 1 });
              }}
            >COUNT ({this.state.count})</button>
          </div>
        );
      }
    }

    The onClick prop, in the example above, is being passed as an inline function that calls this.setState. The function is defined within the render method, often inline with JSX. In the context of React applications, this is a very popular and widely used pattern.

    Let’s begin by listing some common patterns and techniques where inline functions are used in a React application:

    • Render prop: A component prop that expects a function as a value. This function must return a JSX element, hence the name. Render prop is a good candidate for inline functions.
    render() {
      return (
        <ListView
          items={items}
          render={({ item }) => (<div>{item.label}</div>)}
        />
      );
    }

    • DOM event handlers: DOM event handlers often make a call to setState or invoke some effect in the React application such as sending data to an API server.
    <button
      onClick={() => {
        this.setState({ count: this.state.count + 1 });
      }}>
      COUNT ({this.state.count})
    </button>

    • Custom function or event handlers passed to child: Oftentimes, a child component requires a custom event handler to be passed down as props. Inline function is usually used in this scenario.
    <Button onTap={() => {
      this.nextPage();
    }}>Next<Button>

    Alternatives to inline function

    • Bind in constructor: One of the most common patterns is to define the function within the class component and then bind context to the function in constructor. We only need to bind the current context if we want to use this keyword inside the handler function.
    export default class CounterApp extends React.Component {
      constructor(props) {
        super(props);
        this.state = { count: 0 };
        this.increaseCount = this.increaseCount.bind(this);
      }
    
      increaseCount() {
        this.setState({ count: this.state.count + 1 });
      }
    
      render() {
        return (
          <div className="App">
            <button onClick={this.increaseCount}>COUNT ({this.state.count})</button>
          </div>
        );
     }
    }

    • Bind in render: Another common pattern is to bind the context inline when the function is passed down. Eventually, this gets repetitive and hence the first approach is more popular.
    render() {
      return (
        <div className="App">
          <button onClick={this.increaseCount.bind(this)}>COUNT ({this.state.count})</button>
        </div>
      );
    }

    • Define as public field:
    increaseCount = () => {
      this.setState({ count: this.state.count + 1 });
    };
    
    render() {
      return (
        <div className="App">
          <button onClick={this.increaseCount}>
            COUNT ({this.state.count})
          </button>
        </div>
      );
    }
    view raw

    There are several other approaches that React dev community has come up with, like using a helper method to bind all functions automatically in the constructor.

    After understanding inline functions with its examples and also taking a look at a few alternatives, let’s see why inline functions are so popular and widely used.

    Why use inline function

    Inline function definitions are right where they are invoked or passed down. This means inline functions are easier to write, especially when the body of the function is of a few instructions such as calling setState. This works well within loops as well.

    For example, when rendering a list and assigning a DOM event handler to each list item, passing down an inline function feels much more intuitive. For the same reason, inline functions also make code more organized and readable.

    Inline arrow functions preserve context that means developers can use this without having to worry about current execution context or explicitly bind a context to the function.

    <Button onTap={() => {
      this.prevPage();
    }}>Previous<Button>

    Inline functions make value from parent scope available within the function definition. It results in more intuitive code and developers need to pass down fewer parameters. Let’s understand this with an example.

    render() {
      const { count } = this.state;
      return (
        <div className="App">
          <button
            onClick={() => {
              this.setState({ count: count + 1 });
          }}>
            COUNT ({count})
          </button>
        </div>
      );
    }

    Here, the value of count is readily available to onClick event handlers. This behavior is called closing over.

    For these reasons, React developers make use of inline functions heavily. That said, inline function has also been a hot topic of debate because of performance concerns. Let’s take a look at a few of these arguments.

    Arguments against inline functions

    • A new function is defined every time the render method is called. It results in frequent garbage collection, and hence performance loss.
    • There is an eslint config that advises against using inline function jsx-no-bind. The idea behind this rule is when an inline function is passed down to a child component, React uses reference checks to re-render the component. This can result in child component rendering again and again as a reference to the passed prop value i.e. inline function. In this case, it doesn’t match the original one.

    <listitem onclick=”{()” ==””> console.log(‘click’)}></listitem>

    Suppose ListItem component implements shouldComponentUpdate method where it checks for onClick prop reference. Since inline functions are created every time a component re-renders, this means that the ListItem component will reference a new function every time, which points to a different location in memory. The comparison checks in shouldComponentUpdate and tells React to re-render ListItem even though the inline function’s behavior doesn’t change. This results in unnecessary DOM updates and eventually reduces the performance of applications.

    Performance concerns revolving around the Function.prototype.bind methods: when not using arrow functions, the inline function being passed down must be bound to a context if using this keyword inside the function. The practice of calling .bind before passing down an inline function raises performance concerns, but it has been fixed. For older browsers, Function.prototype.bind can be supplemented with a polyfill for performance.

    Now that we’ve summarized a few arguments in favor of inline functions and a few arguments against it, let’s investigate and see how inline functions really perform.

    render() {
      return (
        <div>
          {this.state.timeThen > this.state.timeNow ? (
           <>
             <button onClick={() => { /* some action */ }} />
             <button onClick={() => { /* another action */ }} />
           </>
          ) : (
            <button onClick={() => { /* yet another action */ }} />
          )}
        </div>
      );
    }

    Pre-optimization can often lead to bad code. For instance, let’s try to get rid of all the inline function definitions in the component above and move them to the constructor because of performance concerns.

    We’d then have to define 3 custom event handlers in the class definition and bind context to all three functions in the constructor.

    export default class CounterApp extends React.Component {
      constructor(props) {
        super(props);
        this.state = {
          timeThen: ...,
          timeNow: Date.now()
        };
        this.someAction = this.someAction.bind(this);
        this.anotherAction = this.anotherAction.bind(this);
        this.yetAnotherAction = this.yetAnotherAction.bind(this);
      }
    
      someAction() { /* some action */ }
      anotherAction() { /* another action */ }
      yetAnotherAction() { /* yet another action */ }
    
      render() {
        return (<div>
          {this.state.timeThen > this.state.timeNow ? (
            <>
              <button onClick={this.someAction} />
              <button onClick={this.anotherAction} />
            </>
          ) : (
            <button onClick={this.yetAnotherAction} />
          )}</div>);
      }
    }

    This would increase the initialization time of the component significantly as opposed to inline function declarations where only one or two functions are defined and used at a time based on the result of condition timeThen > timeNow.

    Concerns around render props: A render prop is a method that returns a React element and is used to share state among React components.

    Render props are meant to be invoked on each render since they share state between parent components and enclosed React elements. Inline functions are a good candidate for use in render prop and won’t cause any performance concern.

    render() {
      return (
        <ListView
          items={items}
          render={({ item }) => (<div>{item.label}</div>)}
        />
      )
    }

    Here, the render prop of ListView component returns a label enclosed in a div. Since the enclosed component can never know what the value of the item variable is, it can never be a PureComponent or have a meaningful implementation of shouldComponentUpdate(). This eliminates the concerns around use of inline function in render prop. In fact, promotes it in most cases.

    In my experience, inline render props can sometimes be harder to maintain especially when render prop returns a larger more complicated component in terms of code size. In such cases, breaking down the component further or having a separate method that gets passed down as render prop has worked well for me.

    Concerns around PureComponents and shouldComponentUpdate(): Pure components and various implementations of shouldComponentUpdate both do a strict type comparison of props and state. These act as performance enhancers by letting React know when or when not to trigger a render based on changes to state and props. Since inline functions are created on every render, when such a method is passed down to a pure component or a component that implements the shouldComponentUpdate method, it can lead to an unnecessary render. This is because of the changed reference of the inline function.

    To overcome this, consider skipping checks on all function props in shouldComponentUpdate(). This assumes that inline functions passed to the component are only different in reference and not behavior. If there is a difference in the behavior of the function passed down, it will result in a missed render and eventually lead to bugs in the component’s state and effects.

    Conclusion‍

    A common rule of thumb is to measure performance of the app and only optimize if needed. Performance impact of inline function, often categorized under micro-optimizations, is always a tradeoff between code readability, performance gain, code organization, etc that must be thought out carefully on a case by case basis and pre-optimization should be avoided.

    In this blog post, we observed that inline functions don’t necessarily bring in a lot of performance cost. They are widely used because of ease of writing, reading and organizing inline functions, especially when inline function definitions are short and simple.

  • Installing Redis Cluster with Persistent Storage on Mesosphere DC/OS

    In the first part of this blog, we saw how to install standalone redis service on DC/OS with Persistent storage using RexRay and AWS EBS volumes.

    A single server is a single point of failure in every system, so to ensure high availability of redis database, we can deploy a master-slave cluster of Redis servers. In this blog, we will see how to setup such 6 node (3 master, 3 slave) Redis cluster and persist data using RexRay and AWS EBS volumes. After that we will see how to import existing data into this cluster.

    Redis Cluster

    It is form of replicated Redis servers in multi-master architecture. All the data is sharded into 16384 buckets, where every master node is assigned subset of buckets out of them (generally evenly sharded) and each master replicated by its slaves.  It provides more resilience and scaling for production grade deployments where heavy workload is expected. Applications can connect to any node in cluster mode and the request will be redirected to respective master node.

     Source:  Octo

         

    Objective: To create a Redis cluster with number of services in DCOC environment with persistent storage and import the existing Redis dump.rdb data to the cluster.

    Prerequisites :  

    • Make sure rexray component is running and is in a healthy state for DCOS cluster.

    Steps:

    • As per Redis doc, the minimal cluster should have at least 3 master and 3 slave nodes, so making it a total 6 Redis services.
    • All services will use similar json configuration except changes in names of service, external volume, and port mappings.
    • We will deploy one Redis service for each Redis cluster node and once all services are running, we will form cluster among them.
    • We will use host network for Redis node containers, for that we will restrict Redis nodes to run on particular node. This will help us to troubleshoot cluster (fixed IP, so we can restart Redis node any time without data loss).
    • Using host network adds a prerequisites that number of dcos nodes >= number of Redis nodes.
    1. First create Redis node services on DCOS:
    2. Click on the Add button in Services tab of DCOS UI
    • Click on JSON configuration
    • Add below json config for Redis service, change the values which are written in BLOCK letters with # as prefix and suffix.
    • #NODENAME# – Name of Redis node (Ex. redis-node-1)
    • #NODEHOSTIP# – IP of dcos node on which this Redis node will run. This ip must be unique for each Redis node. (Ex. 10.2.12.23)
    • #VOLUMENAME# – Name of persistent volume, Give name to identify volume on AWS EBS (Ex. <dcos cluster=”” name=””>-redis-node-<node number=””>)</node></dcos>
    • #NODEVIP# – VIP For the Redis node. It must be ‘Redis’ for first Redis node, for others it can be the same as NODENAME (Ex. redis-node-2)
    {
       "id": "/#NODENAME#",
       "backoffFactor": 1.15,
       "backoffSeconds": 1,
       "constraints": [
         [
           "hostname",
           "CLUSTER",
           "#NODEHOSTIP#"
         ]
       ],
       "container": {
         "type": "DOCKER",
         "volumes": [
           {
             "external": {
               "name": "#VOLUMENAME#",
               "provider": "dvdi",
               "options": {
                 "dvdi/driver": "rexray"
               }
             },
             "mode": "RW",
             "containerPath": "/data"
           }
         ],
         "docker": {
           "image": "parvezkazi13/redis:latest",
           "forcePullImage": false,
           "privileged": false,
           "parameters": []
         }
       },
       "cpus": 0.5,
       "disk": 0,
       "fetch": [],
       "healthChecks": [],
       "instances": 1,
       "maxLaunchDelaySeconds": 3600,
       "mem": 4096,
       "gpus": 0,
       "networks": [
         {
           "mode": "host"
         }
       ],
       "portDefinitions": [
         {
           "labels": {
             "VIP_0": "/#NODEVIP#:6379"
           },
           "name": "#NODEVIP#",
           "protocol": "tcp",
           "port": 6379
         }
       ],
       "requirePorts": true,
       "upgradeStrategy": {
         "maximumOverCapacity": 0,
         "minimumHealthCapacity": 0.5
       },
       "killSelection": "YOUNGEST_FIRST",
       "unreachableStrategy": {
         "inactiveAfterSeconds": 300,
         "expungeAfterSeconds": 600
       }
     }

    • After updating the highlighted fields, copy above json to json configuration box, click on ‘Review & Run’ button in the right corner, this will start the service with above configuration.
    • Once above service is UP and Running, then repeat the step 2 to 4 for each Redis node with respective values for highlighted fields.
    • So if we go with 6 node cluster, at the end we will have 6 Redis nodes UP and Running, like:

    Note: Since we are using external volume for persistent storage, we can not scale our services, i.e. each service will only one instance max. If we try to scale, we will get below error :

    2. Form the Redis cluster between Redis node services:

    • To create or manage Redis-cluster, first deploy redis-cluster-util container on DCOS using below json config:
    {
     "id": "/infrastructure/redis-cluster-util",
     "backoffFactor": 1.15,
     "backoffSeconds": 1,
     "constraints": [],
     "container": {
       "type": "DOCKER",
       "volumes": [
         {
           "containerPath": "/backup",
           "hostPath": "backups",
           "mode": "RW"
         }
       ],
       "docker": {
         "image": "parvezkazi13/redis-util",
         "forcePullImage": true,
         "privileged": false,
         "parameters": []
       }
     },
     "cpus": 0.25,
     "disk": 0,
     "fetch": [],
     "instances": 1,
     "maxLaunchDelaySeconds": 3600,
     "mem": 4096,
     "gpus": 0,
     "networks": [
       {
         "mode": "host"
       }
     ],
     "portDefinitions": [],
     "requirePorts": true,
     "upgradeStrategy": {
       "maximumOverCapacity": 0,
       "minimumHealthCapacity": 0.5
     },
     "killSelection": "YOUNGEST_FIRST",
     "unreachableStrategy": {
       "inactiveAfterSeconds": 300,
       "expungeAfterSeconds": 600
     },
     "healthChecks": []
    }

    This will run service as :

    • Get the IP addresses of all Redis nodes to form the cluster, as Redis-cluster can not be created with node’s hostname / dns. This is an open issue.

    Since we are using host network, we need the dcos node IP on which Redis nodes are running.

    Get all Redis nodes IP using:

    NODE_BASE_NAME=redis-nodedcos task $NODE_BASE_NAME | grep -E "$NODE_BASE_NAME-<[0-9]>" | awk '{print $2":6379"}' | paste -s -d' '  

    Here Redis-node is the prefix used for all Redis nodes.

    Note the output of this command, we will use it in further steps.

    • Get the node where redis-cluster-util container is running and ssh to dcos node using:
    dcos node ssh --master-proxy --private-ip $(dcos task | grep "redis-cluster-util" | awk '{print $2}')

    • Now find the docker container id of redis-cluster-util and exec it using:
    docker exec -it $(docker ps -qf ancestor="parvezkazi13/redis-util") bash  

    • No we are inside the redis-cluster-util container. Run below command to form Redis cluster.
    redis-trib.rb create --replicas 1 <Space separated IP address:PORT pair of all Redis nodes>

    • Here use the Redis nodes IP addresses which retrieved in step 2.
    redis-trib.rb create --replicas 1 10.0.1.90:6379 10.0.0.19:6379 10.0.9.203:6379 10.0.9.79:6379 10.0.3.199:6379 10.0.9.104:6379

    • Parameters:
    • The option –replicas 1 means that we want a slave for every master created.
    • The other arguments are the list of addresses (host:port) of the instances we want to use to create the new cluster.
    • Output:
    • Select ‘yes’ when it prompts to set the slot configuration shown.
    • Run below command to check the status of the newly create cluster
    redis-trib.rb check <Any redis node host:PORT>
    Ex:
    redis-trib.rb check 10.0.1.90:6379

    • Parameters:
    • host:port of any node from the cluster.
    • Output:
    • If all OK, it will show OK with status, else it will show ERR with the error message.

    3. Import existing dump.rdb to Redis cluster

    • At this point, all the Redis nodes should be empty and each one should have an ID and some assigned slots:

    Before reuse an existing dump data, we have to reshard all slots to one instance. We specify the number of slots to move (all, so 16384), the id we move to (here Node 1 – 10.0.1.90:6379) and where we take these slots from (all other nodes).

    redis-trib.rb reshard 10.0.1.90:6379  

    Parameters:

    host:port of any node from the cluster.

    Output:

    It will prompt for number of slots to move – here all. i.e 16384

    Receiving node id – here id of node 10.0.1.90:6379 (redis-node-1)

    Source node IDs  – here all, as we want to shard  all slots to one node.

    And prompt to proceed – press ‘yes’

    • Now check again node 10.0.1.90:6379  
    redis-trib.rb check 10.0.1.90:6379  

    Parameters: host:port of any node from the cluster.

    Output: it will show all (16384) slots moved to node 10.0.1.90:6379

    • Next step is Importing our existing Redis dump data.  

    Now copy the existing dump.rdb to our redis-cluster-util container using below steps:

    – Copy existing dump.rdb to dcos node on which redis-cluster-util container is running. Can use scp from any other public server to dcos node.

    – Now we have dump .rdb in our dcos node, copy this dump.rdb to redis-cluster-util container using below command:

    docker cp dump.rdb $(docker ps -qf ancestor="parvezkazi13/redis-util"):/data

    Now we have dump.rdb in our redis-cluster-util container, we can import it to our Redis cluster. Execute and go to the redis-cluster-util container using:

    docker exec -it $(docker ps -qf ancestor="parvezkazi13/redis-util") bash

    It will execute redis-cluster-util container which is already running and start its bash cmd.

    Run below command to import dump.rdb to Redis cluster:

    rdb --command protocol /data/dump.rdb | redis-cli --pipe -h 10.0.1.90 -p 6379

    Parameters:

    Path to dump.rdb

    host:port of any node from the cluster.

    Output:

    If successful, you’ll see something like:

    All data transferred. Waiting for the last reply...Last reply received from server.errors: 0, replies: 4341259  

    as well as this in the Redis server logs:

    95086:M 01 Mar 21:53:42.071 * 10000 changes in 60 seconds. Saving...95086:M 01 Mar 21:53:42.072 * Background saving started by pid 9822398223:C 01 Mar 21:53:44.277 * DB saved on disk

    WARNING:
    Like our Oracle DB instance can have multiple databases, similarly Redis saves keys in keyspaces.
    Now when Redis is in cluster mode, it does not accept the dumps which has more than one keyspaces. As per documentation:

    Redis Cluster does not support multiple databases like the stand alone version of Redis. There is just database 0 and the SELECT command is not allowed. “

    So while importing such multi-keyspace Redis dump, server fails while starting on below issue :

    23049:M 16 Mar 17:21:17.772 * DB loaded from disk: 5.222 seconds
    23049:M 16 Mar 17:21:17.772 # You can't have keys in a DB different than DB 0 when in Cluster mode. Exiting.
    Solution / WA :

    There is redis-cli command “MOVE” to move keys from one keyspace to another keyspace.

    Also can run below command to move all the keys from keyspace 1 to keyspace 0 :

    redis-cli -h "$HOST" -p "$PORT" -n 1 --raw keys "*" |  xargs -I{} redis-cli -h "$HOST" -p "$PORT" -n 1 move {} 0

    • Verify import status, using below commands : (inside redis-cluster-util container)
    redis-cli -h 10.0.1.90 -p 6379 info keyspace

    It will run Redis info command on node 10.0.1.90:6379 and fetch keyspace information, like below:

    # Keyspace
    db0:keys=33283,expires=0,avg_ttl=0

    • Now reshard all the slots to all instances evenly

    The reshard command will again list the existing nodes, their IDs and the assigned slots.

    redis-trib.rb reshard 10.0.1.90:6379

    Parameters:

    host:port of any node from the cluster.

    Output:

    It will prompt for number of slots to move – here (16384 /3 Masters = 5461)

    Receiving node id – here id of master node 2  

    Source node IDs  – id of first instance which has currently all the slots. (master 1)

    And prompt to proceed – press ‘yes’

    Repeat above step and for receiving node id, give id of master node 3.

    • After the above step, all 3 masters will have equal slots and imported keys will be distributed among the master nodes.
    • Put keys to cluster for verification
    redis-cli -h 10.0.1.90 -p 6379 set foo bar
    OK
    redis-cli -h 10.0.1.90 -p 6379 set foo bar
    (error) MOVED 4813 10.0.9.203:6379

    Above error shows that server saved this key to instance 10.0.9.203:6379, so client redirected it. To follow redirection, use flag “-c” which says it is a cluster mode, like:

    redis-cli -h 10.0.1.90 -p 6379 -c set foo bar
    OK

    Redis Entrypoint

    Application entrypoint for Redis cluster is mostly depends how your Redis client handles cluster support. Generally connecting to one of master nodes should do the work.

    Use below host:port in your applications :

    redis.marathon.l4lb.thisdcos.directory:6379

    Automation of Redis Cluster Creation

    We have automation script in place to deploy 6 node Redis cluster and form a cluster between them.

    Script location: Github

    • It deploys 6 marathon apps for 6 Redis nodes. All nodes are deployed on different nodes with CLUSTER_NAME as prefix to volume name.
    • Once all nodes are up and running, it deploys redis-cluster-util app which will be used to form Redis cluster.
    • Then it will print the Redis nodes and their IP addresses and prompt the user to proceed cluster creation.
    • If user selects to proceed, it will run redis-cluster-util app and create the cluster using IP addresses collected. Util container will prompt for some input that the user has to select.

    Conclusion

    We learned about Redis cluster deployment on DCOS with Persistent storage using RexRay. We also learned how rexray automatically manages volumes over aws ebs and how to integrate them in DCOS apps/services. We saw how to use redis-cluster-util container to manage Redis cluster for different purposes, like forming cluster, resharding, importing existing dump.rdb data etc. Finally, we looked at the automation part of whole cluster setup using dcos cli and bash.

    Reference

  • Installing Redis Service in DC/OS With Persistent Storage Using AWS Volumes

    Redis is an open source (BSD licensed), in-memory data structure store, used as a database, cache, and message broker.

    It supports various data structures such as Strings, Hashes, Lists, Sets etc. DCOS offers Redis as a service

    Why Do We Use External Persistent Storage for Redis Mesos Containers?

    Since Redis is an in-memory database, an instance/service restart will result in loss of data. To counter this, it is always advisable to snapshot the Redis in-memory database from time to time.

    This helps Redis instance to recover from the point in time failure.

    In DCOS, Redis is deployed as a stateless service. To make it a stateful and persistent data, we can configure local volumes or external volumes.

    The disadvantage of having a local volume mapped to Mesos containers is when a slave node goes down, your local volume becomes unavailable, and the data loss occurs.

    However, with external persistent volumes, as they are available on each node of the DCOS cluster, a slave node failure does not impact the data availability.

    Rex-Ray

    REX-Ray is an open source, storage management solution designed to support container runtimes such as Docker and Mesos.

    REX-Ray enables stateful applications such as databases to persist and maintain its data after the life cycle of the container has ended. Built-in high availability enables orchestrators such as Docker Swarm, Kubernetes, and Mesos Frameworks like Marathon to automatically orchestrate storage tasks between hosts in a cluster.

    Built on top of the libStorage framework, REX-Ray’s simplified architecture consists of a single binary and runs as a stateless service on every host using a configuration file to orchestrate multiple storage platforms.

    Objective: To create a Redis service in DC/OS environment with persistent storage.

    Warning: The Persistent Volume feature is still in beta Phase for DC/OS Version 1.11.

    Prerequisites:

    • Make sure the rexray service is running and is in a healthy state for the cluster.

    Steps:

    • Click on the Add button in Services component of DC/OS GUI.
    • Click on JSON Configuration.  

    Note: For persistent storage, below code should be added in the normal Redis service configuration JSON file to mount external persistent volumes.

    "volumes": [
          {
            "containerPath": "/data",
            "mode": "RW",
            "external": {
              "name": "redis4volume",
              "provider": "dvdi",
              "options": {
                "dvdi/driver": "rexray"
              }
            }
          }
        ],

    • Make sure the service is up and in a running state:

    If you look closely, the service was suspended and respawned on a different slave node. We populated the database with dummy data and saved the snapshot in the data directory.

    When the service did come upon a different node 10.0.3.204, the data persisted and the volume was visible on the new node.

    core@ip-10-0-3-204 ~ $ /opt/mesosphere/bin/rexray volume list
    
    - name: datavolume
      volumeid: vol-00aacade602cf960c
      availabilityzone: us-east-1a
      status: in-use
      volumetype: standard
      iops: 0
      size: "16"
      networkname: ""
      attachments:
      - volumeid: vol-00aacade602cf960c
        instanceid: i-0d7cad91b62ec9a64
        devicename: /dev/xvdb
    

    •  Check the volume tab :

    Note: For external volumes, the status will be unavailable. This is an issue with DC/OS.

    The Entire Service JSON file:

    {
      "id": "/redis4.0-new-failover-test",
      "instances": 1,
      "cpus": 1.001,
      "mem": 2,
      "disk": 0,
      "gpus": 0,
      "backoffSeconds": 1,
      "backoffFactor": 1.15,
      "maxLaunchDelaySeconds": 3600,
      "container": {
        "type": "DOCKER",
        "volumes": [
          {
            "containerPath": "/data",
            "mode": "RW",
            "external": {
              "name": "redis4volume",
              "provider": "dvdi",
              "options": {
                "dvdi/driver": "rexray"
              }
            }
          }
        ],
        "docker": {
          "image": "redis:4",
          "network": "BRIDGE",
          "portMappings": [
            {
              "containerPort": 6379,
              "hostPort": 0,
              "servicePort": 10101,
              "protocol": "tcp",
              "name": "default",
              "labels": {
                "VIP_0": "/redis4.0:6379"
              }
            }
          ],
          "privileged": false,
          "forcePullImage": false
        }
      },
      "healthChecks": [
        {
          "gracePeriodSeconds": 60,
          "intervalSeconds": 5,
          "timeoutSeconds": 5,
          "maxConsecutiveFailures": 3,
          "portIndex": 0,
          "protocol": "TCP"
        }
      ],
      "upgradeStrategy": {
        "minimumHealthCapacity": 0.5,
        "maximumOverCapacity": 0
      },
      "unreachableStrategy": {
        "inactiveAfterSeconds": 300,
        "expungeAfterSeconds": 600
      },
      "killSelection": "YOUNGEST_FIRST",
      "requirePorts": true
    }

    Redis entrypoint

    To connect with Redis service, use below host:port in your applications:

    redis.marathon.l4lb.thisdcos.directory:6379

    Conclusion

    We learned about Standalone Redis Service deployment from DCOS catalog on DCOS.  Also, we saw how to add Persistent storage to it using RexRay. We also learned how RexRay automatically manages volumes over AWS ebs and how to integrate them in DCOS apps/services.  Finally, we saw how other applications can communicate with this Redis service.

    References

  • Blockchain 101: The Simplest Guide You Will Ever Read

    Blockchain allows digital information to be distributed over multiple nodes in the network. It powers the backbone of bitcoin and cryptocurrency.

    The concept of a distributed ledger found its use case beyond crypto and is now used in other infrastructure.

    What is Blockchain?

    Blockchain is a distributed ledger that powers bitcoin. Satoshi invented bitcoin, and blockchain was the key component. Blockchain is highly secured and works around a decentralized consensus algorithm where no one can own the control completely.

    Let’s divide the word blockchain into two parts: block and chain. A block is a set of transactions that happen over the network. The chain is where blocks are linked to each other in a way that the next block contains hash of the previous one. Even a small change in the previous block can change its hash and break the whole chain, making it difficult to tamper data.

    Image source

    Blockchain prerequisites:

    These are some prerequisites that will help you understand the concepts better.

    Public-key Cryptography– Used to claim the authenticity of the user. It involves a pair of public and private keys. The user creates a signature with the private key, and the network uses the public key of the user to validate that the content is untouched.

    Digital Signatures:

    Digital signatures employ asymmetric key cryptography.

    • Authentication: Digital signature makes the receiver believe that the data was created and sent by the claimed user.
    • Non-Repudiation: The sender cannot deny sending a message later on.
    • Integrity: This ensures that the message was not altered during the transfer.

    Cryptographic hash functions:

    • One way function: This is a mathematical function that takes an input and transforms it into an output. There is no way to recover the message from the hash value.
    • No collision: No two or more messages can have the same hash(message digest). This ensures that no two account transactions can collide. 
    • Fixed hash length: Irrespective of the data size, this function returns the same hash length.

    Why Blockchain?

    There are a few problem statements that we can quickly solve using a distributed consensus system rather than a conventional centralized system.

    Let me share some blockchain applications:

    Consider an auction where people bet on artifacts, and the winner pays and takes out those artifacts. But if we try to implement the same auction over the internet, there would be trust issues. What if one wins the bet saying 10000$ and at the time of payment, he doesn’t respond.

    We can handle such events easily using blockchain. During betting, a token amount will be deducted from an account and will be stored in the smart contract (business logic code deployed on Ethereum). Bid transactions use a private key to sign transactions, so this way, one can not revert by saying that those transactions never happened.

    Another simple but amazing solution we can develop using Ethereum is online games like tic-tac-toe, where both players will deposit `X` amount in the smart contract. Each move done by a player gets recorded on blockchain (each movement will be digitally signed), and smart contract logic will verify a player’s move every time. In the end, the smart contract will decide the winner. And the winner can claim his reward.

    – No one controls your game

    – There is no way one can cheat 

    – No frauds, the winner always gets a reward.

    Bitcoin is the biggest and most well-known implementation of blockchain technology.

    The list of applications based on distributed consensus systems goes on.

    Note: Ethereum smart contract is the code that is deployed over Ethereum blockchain. It is written as a transaction on the block so no one can alter the logic. This is also known as Code is Law.

    Check out some of the smart contract examples.

    Bitcoin is the base and ideal implementation for all other cryptocurrencies. Let’s dig deep into blockchain technology and cryptocurrency.

    Let’s reinvent Bitcoin:

    • Bitcoin is distributed ledger technology where the ledger is a set of transactions
    • No single entity controls the system
    • High level of trust 

    We have to design our bitcoin to meet above requirements.

    1) Consider that bitcoin is just a string that we will send from one node to the other. Here, the string is: “I, Alice, am giving Bob one bitcoin.” It shows Alice is sending Bob one bitcoin.

     

    2) Sam uses the fake identity of Alice and sends bitcoin on her behalf.

     

    3) We can solve this fake identity problem using Digital signature. Sam can not use a fake identity.

    But there is still one problem, double spending. This occurs when Alice sends one transaction multiple times. It’s difficult to check if Alice wants to send multiple bitcoins or just retrying transactions due to high network latency or any other issue.

    4) But a simple solution is to add a unique transaction ID to each transaction.

    5) It’s time to add more complexity to our system. Let’s check how we can validate the transaction between Alice and Bob.

    In cryptocurrency, every node knows everything (nodes are the systems where blockchain clients are installed, like Geth for Ethereum).

     

    Every node maintains a local ledger containing whole blockchain data. Here, Alice, Sam, and Bob know how much bitcoins everyone has. This helps validate all transactions happening over the network.

    As Bob receives an event from Alice containing a bitcoin transaction. He checks the local copy of the blockchain and verifies if Alice owns that one bitcoin that she wants to send. If Bob finds out that the transaction is valid, he broadcasts that to all networks and waits for others to confirm. Other peers also check their local copy and acknowledge the transaction. If maximum peers confirm the transaction valid then that transaction gets added to the blockchain. And everyone will update their copy of ledger now, and Alice has one less bitcoin.

    Note: In actual cryptocurrency, validation occurs at a block-level rather than validating one transaction. Bob will validate a set of transactions and creates one block from it and will broadcast that transaction over the network to validate.

    6) Still, there is one problem with this approach. Here, we are using Bob as a validator. But what if he is a fraud. He might say that transaction is valid even if its invalid, and he has thousands of automated bots to support him. This way the whole blockchain will follow bots and accept the invalid transaction (Majority wins).

    In this example, Alice has one bitcoin. Still, she creates two transactions: one to Bob and another to Sam. Alice waits for the network to accept the transaction to Bob. Now Alice has 0 bitcoins. If Alice validates her own transaction to Sam and says it’s valid (Alice has no bitcoin left to spend), and she has a large number of bots to support her, then eventually the whole network will accept that transaction, and Alice will double spend the bitcoin.

    7) We can solve this problem with the POW (Proof of Work) consensus algorithm.

    This is a puzzle that one has to solve while validating the transactions present in the block.

    Here you can see that the block has a size of around 1MB. So, you need to append any random number to the block and calculate hash, so that the hash value will have a starting string of zeros as shown in the image.

    Blockchain decides this number, and then the next block miner has to calculate a random number so that hash has that many zeros in the beginning. To solve this puzzle, the miner has to try Peta combinations to get the answer. As this is a very complex process, miners get rewarded after the validation of the block.

    But how can we solve the above problem using mining?

    Suppose the blockchain network has 10,000 active mining nodes with the same computational power. The probability that one can mine is only 0.01%. If one wants to do fraud transactions, he should have huge mining power to validate the block and convince other nodes to accept the invalid block. To do this, one needs to own more than 50% of computational power, that is very difficult. 

    Now we have a prototype cryptocurrency model ready with us.

    Note: Each blockchain node follows the majority. Even if a transaction is invalid, but with more than 51% of nodes say it’s valid, the whole network will be convinced and go rogue. This means that any group owns 51% of computational power(hash power), controls the whole blockchain network. This is known as a 51% attack.

    Blockchain-based services:

    • Golem: Distributed computing platform.
    • Iexec: Distributed computing platform.
    • Sia distributed storage: Distributed storage, SLA is managed over the blockchain.
    • Ethrise: Insurance platform developed on Ethereum ecosystem.
    • Maecenas: Blockchain-based Auction of file arts.
    • More than 2000 cryptocurrency platforms.

    Disadvantages of Blockchain:

    There are a few limitations to blockchain-based solutions.

    • Crime: Due to its encryption and anonymous nature, blockchain solutions influence crimes.
    • Data size: Full nodes stores all transactional data and requires over 100 GB disk space.
    • Throughput: Blockchain systems are very slow.

    Bitcoin can perform only 5 transactions per sec, while any financial bank can do more than 24000 transactions per sec.

  • A Quick Introduction to Data Analysis With Pandas

    Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

    Pandas aims to integrate the functionality of NumPy and matplotlib to give you a convenient tool for data analytics and visualization. Besides the integration,  it also makes the usage far more better.

    In this blog, I’ll give you a list of useful pandas snippets that can be reused over and over again. These will definitely save you some time that you may otherwise need to skim through the comprehensive Pandas docs.

    The data structures in Pandas are capable of holding elements of any type: Series, DataFrame.

    Series

    A one-dimensional object that can hold any data type such as integers, floats, and strings

    A Series object can be created of different values. Series can be remembered similar to a Python list.

    In the below example, NaN is NumPy’s nan symbol which tells us that the element is not a number but it can be used as one numerical type pointing out to be not a number. The type of series is an object because the series has mixed contents of strings and numbers.

    >>> import pandas as pd
    >>> import numpy as np
    >>> series = pd.Series([12,32,54,2, np.nan, "a string", 6])
    >>> series
    0          12
    1          32
    2          54
    3           2
    4         NaN
    5    a string
    6           6
    dtype: object

    Now if we use only numerical values, we get the basic NumPy dtype – float for our series.

    >>> series = pd.Series([1,2,np.nan, 4])
    >>> series
    0    1.0
    1    2.0
    2    NaN
    3    4.0
    dtype: float64

    DataFrame

    A two-dimensional labeled data structure where columns can be of different types.

    Each column in a Pandas DataFrame represents a Series object in memory.

    In order to convert a certain Python object (dictionary, lists, etc) to a DataFrame, it is extremely easy. From the python dictionaries, the keys map to Column names while values correspond to a list of column values.

    >>> d = {
        "stats": pd.Series(np.arange(10,15,1.0)),
        "year": pd.Series(["2012","2007","2012","2003"]),
        "intake": pd.Series(["SUMMER","WINTER","WINTER","SUMMER"]),
    }
    >>> df = pd.DataFrame(d)
    >>> df

    Reading CSV files

    Pandas can work with various file types while reading any file you need to remember.

    pd.read_filetype()

    Now you will have to only replace “filetype” with the actual type of the file, like csv or excel. You will have to give the path of the file inside the parenthesis as the first argument. You can also pass in different arguments that relate to opening the file. (Reading a csv file? See this)

    >>> df = pd.read_csv('companies.csv')
    >>> df.head()
    view raw

    Accessing Columns and Rows

    DataFrame comprises of three sub-components, the indexcolumns, and the data (also known as values).

    The index represents a sequence of values. In the DataFrame, it always on the left side. Values in an index are in bold font. Each individual value of the index is called a label. Index is like positions while the labels are values at that particular index. Sometimes the index is also referred to as row labels. In all the examples below, the labels and indexes are the same and are just integers beginning from 0 up to n-1, where n is the number of rows in the table.

    Selecting rows is done using loc and iloc:

    • loc gets rows (or columns) with particular labels from the index. Raises KeyError when the items are not found.
    • iloc gets rows (or columns) at particular positions/index (so it only takes integers). Raises IndexError if a requested indexer is out-of-bounds.
    >>> df.loc[:5]              #similar to df.head()

    Accessing the data using column names

    Pandas takes an extra step and allows us to access data through labels in DataFrames.

    >>> df.loc[:5, ["name","vertical", "url"]]

    In Pandas, selecting data is very easy and similar to accessing an element from a dictionary or a list.

    You can select a column (df[col_name]) and it will return column with label col_name as a Series, because rows and columns are stored as Series in a DataFrame, If you need to access more columns (df[[col_name_1, col_name_2]]) and it returns columns as a new DataFrame.

    Filtering DataFrames with Conditional Logic

    Let’s say we want all the companies with the vertical as B2B, the logic would be:

    >>> df[(df['vertical'] == 'B2B')]

    If we want the companies for the year 2009, we would use:

    >>> df[(df['year'] == 2009)]

    Need to combine them both? Here’s how you would do it:

    >>> df[(df['vertical'] == 'B2B') & (df['year'] == 2009)]

    Get all companies with vertical as B2B for the year 2009

    Sort and Groupby

    Sorting

    Sort values by a certain column in ascending order by using:

    >>> df.sort_values(colname)

    >>> df.sort_values(colname,ascending=False)

    Furthermore, it’s also possible to sort values by multiple columns with different orders. colname_1 is being sorted in ascending order and colname_2 in descending order by using:

    >>> df.sort_values([colname_1,colname_2],ascending=[True,False])

    Grouping

    This operation involves 3 steps; splitting of the data, applying a function on each of the group, and finally combining the results into a data structure. This can be used to group large amounts of data and compute operations on these groups.

    df.groupby(colname) returns a groupby object for values from one column while df.groupby([col1,col2]) returns a groupby object for values from multiple columns.

    Data Cleansing

    Data cleaning is a very important step in data analysis.

    Checking missing values in the data

    Check null values in the DataFrame by using:

    >>> df.isnull()

    This returns a boolean array (an array of true for missing values and false for non-missing values).

    >>> df.isnull().sum()

    Check non null values in the DataFrame using pd.notnull(). It returns a boolean array, exactly converse of df.notnull()

    Removing Empty Values

    Dropping empty values can be done easily by using:

    >>> df.dropna()

    This drops the rows having empty values or df.dropna(axis=1) to drop the columns.

    Also, if you wish to fill the missing values with other values, use df.fillna(x). This fills all the missing values with the value x (here you can put any value that you want) or s.fillna(s.mean()) which replaces null values with the mean (mean can be replaced with any function from the arithmetic section).

    Operations on Complete Rows, Columns, or Even All Data

    >>> df["url_len"] = df["url"].map(len)

    The .map() operation applies a function to each element of a column.

    .apply() applies a function to columns. Use .apply(axis=1) to do it on the rows.

    Iterating over rows

    Very similar to iterating any of the python primitive types such as list, tuples, dictionaries.

    >>> for i, row in df.iterrows():
            print("Index {0}".format(i))
            print("Row {0}".format(row))

    The .iterrows() loops 2 variables together i.e, the index of the row and the row itself, variable is the index and variable row is the row in the code above.

    Tips & Tricks

    Using ufuncs (also known as Universal Functions). Python has the .apply() which applies a function to columns/rows. Similarly, Ufuncs can be used while preprocessing. What is the difference between ufuncs and .apply()?

    Ufuncs is a numpy library, implemented in C which is highly efficient (ufuncs are around 10 times faster).

    A list of common Ufuncs:

    isinf: Element-wise checks for positive or negative infinity.

    isnan: Element-wise checks for NaN and returns result as a boolean array.

    isnat: Element-wise checks for NaT (not time) and returns result as a boolean array.

    trunc: Return the truncated value of the input, element-wise.

    .dt commands: Element-wise processing for date objects.

    High-Performance Pandas

    Pandas performs various vectorized/broadcasted operations and grouping-type operations. These operations are efficient and effective.

    As of version 0.13, Pandas included tools that allow us to directly access C-speed operations without costly allocation of intermediate arrays. There are two functions, eval() and query().

    DataFrame.eval() for efficient operations:

    >>> import pandas as pd
    >>> nrows, ncols = 100000, 100
    >>> rng = np.random.RandomState(42)
    >>> df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
                          for i in range(4))

    To compute the sum of df1, df2, df3, and df4 DataFrames using the typical Pandas approach, we can just write the sum:

    >>> %timeit df1 + df2 + df3 + df4
    
    10 loops, best of 3: 103.1 ms per loop

    A better and optimized approach for the same operation can be computed via pd.eval():

    >>> %timeit pd.eval('df1 + df2 + df3 + df4')
    
    10 loops, best of 3: 53.6 ms per loop

    %timeit — Measure execution time of small code snippets.

    The eval() expression is about 50% faster (it also consumes mush less memory).

    And it performs the same result:

    >>> np.allclose(df1 + df2 + df3 + df4,d.eval('df1 + df2 + df3 + df4'))
    
    True

    np.allclose() is a numpy function which returns True if two arrays are element-wise equal within a tolerance.

    Column-Wise & Assignment Operations Using df.eval()

    Normal expression to split the first character of a column and assigning it to the same column can be done by using:

    >>> df['batch'] = df['batch'].str[0]

    By using df.eval(), same expression can be performed much faster:

    >>> df.eval("batch=batch.str[0]")

    DataFrame.query() for efficient operations:

    Similar to performing filtering operations with conditional logic, to filter rows with vertical as B2B and year as 2009, we do it by using:

    >>> %timeit df[(df['vertical'] == 'B2B') & (df['year'] == 2009)]
    
    1.69 ms ± 57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

    With .query() the same filtering can be performed about 50% faster.

    >>> %timeit df.query("vertical == 'B2B' and year == 2009")
    
    875 µs ± 24.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

    When to use eval() and query()? 

    Two aspects: computation time and memory usage. 

    Memory usage: Every operation which involves NumPy/Pandas DataFrames results into implicit creation of temporary variables. In such cases, if the memory usage of these temporary variables is greater, using eval() and query() is an appropriate choice to reduce the memory usage.

    Computation time: Traditional method of performing NumPy/Pandas operations is faster for smaller arrays! The real benefit of eval()/query() is achieved mainly because of the saved memory, and also because of the cleaner syntax they offer.

    Conclusion

    Pandas is a powerful and fun library for data manipulation/analysis. It comes with easy syntax and fast operations. The blog highlights the most used pandas implementation and optimizations. Best way to master your skills over pandas is to use real datasets, beginning with Kaggle kernels to learning how to use pandas for data analysis. Check out more on real time text classification using Kafka and Scikit-learn and explanatory vs. predictive models in machine learning here.