Author: admin

  • Micro Frontends: Reinventing UI In The Microservices World

    It is amazing how the software industry has evolved. Back in the day, a software was a simple program. Some of the first software applications like The Apollo Missions Landing modules and Manchester Baby were basic stored procedures. Software was primarily used for research and mathematical purposes.

    The invention of personal computers and the prominence of the Internet changed the software world. Desktop applications like word processors, spreadsheets, and games grew. Websites gradually emerged. Back then, simple pages were delivered to the client as static documents for viewing. By the mid-1990s, with Netscape introducing client-side scripting language, JavaScript and Macromedia bringing in Flash, the browser became more powerful, allowing websites to become richer and more interactive. In 1999, the Java language introduced Servlets. And thus born the Web Application. Nevertheless, these developments and applications were still simpler. Engineers didn’t emphasize enough on structuring them and mostly built unstructured monolithic applications.

    The advent of disruptive technologies like cloud computing and Big data paved the way for more intricate, convolute web and native mobile applications. From e-commerce and video streaming apps to social media and photo editing, we had applications doing some of the most complicated data processing and storage tasks. The traditional monolithic way now posed several challenges in terms of scalability, team collaboration and integration/deployment, and often led to huge and messy The Ball Of Mud codebases.

    Fig: Monolithic Application Problems – Source

    To untangle this ball of software, came in a number of service-oriented architectures. The most promising of them was Microservices – breaking an application into smaller chunks that can be developed, deployed and tested independently but worked as a single cohesive unit. Its benefits of scalability and ease of deployment by multiple teams proved as a panacea to most of the architectural problems. A few front-end architectures also came up, such as MVC, MVVM, Web Components, to name a few. But none of them were fully able to reap the benefits of Microservices.

    Fig: Microservice Architecture – Source

    ‍Micro Frontends: The Concept‍

    Micro Frontends first came up in ThoughtWorks Technology Radar where they assessed, tried and eventually adopted the technology after noticing significant benefits. It is a Microservice approach to front-end web development where independently deliverable front-end applications are composed as a whole. 

    With Microservices, Micro Frontends breaks the last monolith to create a complete Micro-Architecture design pattern for web applications. It is entirely composed of loosely coupled vertical slices of business functionality rather than in horizontals. We can term these verticals as ‘Microapps’. This concept is not new and has appeared in Scaling with Microservices and Vertical Decomposition. It first presented the idea of every vertical being responsible for a single business domain and having its presentation layer, persistence layer, and a separate database. From the development perspective, every vertical is implemented by exactly one team and no code is shared among different systems.

    Fig: Micro Frontends with Microservices (Micro-architecture)

    Why Micro Frontends?

    A microservice architecture has a whole slew of advantages when compared to monolithic architectures.

    Ease of Upgrades – Micro Frontends build strict bounded contexts in the application. Applications can be updated in a more incremental and isolated fashion without worrying about the risks of breaking up another part of the application.

    Scalability – Horizontal scaling is easy for Micro Frontends. Each Micro Frontend has to be stateless for easier scalability.

    Ease of deployability: Each Micro Frontend has its CI/CD pipeline, that builds, tests and deploys it to production. So it doesn’t matter if another team is working on a feature and has pushed a bug fix or if a cutover or refactoring is taking place. There should be no risks involved in pushing changes done on a Micro Frontend as long as there is only one team working on it.

    Team Collaboration and Ownership: The Scrum Guide says that “Optimal Development Team size is small enough to remain nimble and large enough to complete significant work within a Sprint”. Micro Frontends are perfect for multiple cross-functional teams that can completely own a stack (Micro Frontend) of an application from UX to Database design. In case of an E-commerce site, the Product team and the Payment team can concurrently work on the app without stepping on each other’s toes.

    Micro Frontend Integration Approaches

    There is a multitude of ways to implement Micro Frontends. It is recommended that any approach for this should take a Runtime integration route instead of a Build Time integration, as the former has to re-compile and release on every single Micro Frontend to release any one of the Micro Frontend’s changes.

    We shall learn some of the prominent approaches of Micro Frontends by building a simple Pet Store E-Commerce site. The site has the following aspects (or Microapps, if you will) – Home or Search, Cart, Checkout, Product, and Contact Us. We shall only be working on the Front-end aspect of the site. You can assume that each Microapp has a microservice dedicated to it in the backend. You can view the project demo here and the code repository here. Each way of integration has a branch in the repo code that you can check out to view.

    Single Page Frontends –

    The simplest way (but not the most elegant) to implement Micro Frontends is to treat each Micro Frontend as a single page.

    Fig: Single Page Micro Frontends: Each HTML file is a frontend.
    !DOCTYPE html>
    <html lang="zxx">
    <head>
    	<title>The MicroFrontend - eCommerce Template</title>
    </head>
    <body>
      <header class="header-section header-normal">
        <!-- Header is repeated in each frontend which is difficult to maintain -->
        ....
        ....
      </header
      <main>
      </main>
      <footer
        <!-- Footer is repeated in each frontend which means we have to multiple changes across all frontends-->
      </footer>
      <script>
        <!-- Cross Cutting features like notification, authentication are all replicated in all frontends-->
      </script>
    </body>

    It is one of the purest ways of doing Micro Frontends because no container or stitching element binds the front ends together into an application. Each Micro Frontend is a standalone app with each dependency encapsulated in it and no coupling with the others. The flipside of this approach is that each frontend has a lot of duplication in terms of cross-cutting concerns like headers and footers, which adds redundancy and maintenance burden.

    JavaScript Rendering Components (Or Web Components, Custom Element)-

    As we saw above, single-page Micro Frontend architecture has its share of drawbacks. To overcome these, we should opt for an architecture that has a container element that builds the context of the app and the cross-cutting concerns like authentication, and stitches all the Micro Frontends together to create a cohesive application.

    // A virtual class from which all micro-frontends would extend
    class MicroFrontend {
      
      beforeMount() {
        // do things before the micro front-end mounts
      }
    
      onChange() {
        // do things when the attributes of a micro front-end changes
      }
    
      render() {
        // html of the micro frontend
        return '<div></div>';
      }
    
      onDismount() {
        // do things after the micro front-end dismounts 
      }
    }

    class Cart extends MicroFrontend {
      beforeMount() {
        // get previously saved cart from backend
      }
    
      render() {
        return `<!-- Page -->
        <div class="page-area cart-page spad">
          <div class="container">
            <div class="cart-table">
              <table>
                <thead>
                .....
                
         `
      }
    
      addItemToCart(){
        ...
      }
        
      deleteItemFromCart () {
        ...
      }
    
      applyCouponToCart() {
        ...
      }
        
      onDismount() {
        // save Cart for the user to get back to afterwards
      }
    }

    class Product extends MicroFrontend {
      static get productDetails() {
        return {
          '1': {
            name: 'Cat Table',
            img: 'img/product/cat-table.jpg'
          },
          '2': {
            name: 'Dog House Sofa',
            img: 'img/product/doghousesofa.jpg'
          },
        }
      }
      getProductDetails() {
        var urlParams = new URLSearchParams(window.location.search);
        const productId = urlParams.get('productId');
        return this.constructor.productDetails[productId];
      }
      render() {
        const product = this.getProductDetails();
        return `	<!-- Page -->
        <div class="page-area product-page spad">
          <div class="container">
            <div class="row">
              <div class="col-lg-6">
                <figure>
                  <img class="product-big-img" src="${product.img}" alt="">`
      }
      selectProductColor(color) {}
    
      selectProductSize(size) {}
     
      addToCart() {
        // delegate call to MicroFrontend Cart.addToCart function
      }
      
    }

    <!DOCTYPE html>
    <html lang="zxx">
    <head>
    	<title>PetStore - because Pets love pampering</title>
    	<meta charset="UTF-8
      <link rel="stylesheet" href="css/style.css"/>
    
    </head>
    <body>
    	<!-- Header section -->
    	<header class="header-section">
      ....
      </header>
    	<!-- Header section end -->
    	<main id='microfrontend'>
        <!-- This is where the Micro-frontend gets rendered by utility renderMicroFrontend.js-->
    	</main>
                                    <!-- Header section -->
    	<footer class="header-section">
      ....
      </footer>
    	<!-- Footer section end -->
      	<script src="frontends/MicroFrontend.js"></script>
    	<script src="frontends/Home.js"></script>
    	<script src="frontends/Cart.js"></script>
    	<script src="frontends/Checkout.js"></script>
    	<script src="frontends/Product.js"></script>
    	<script src="frontends/Contact.js"></script>
    	<script src="routes.js"></script>
    	<script src="renderMicroFrontend.js"></script>

    function renderMicroFrontend(pathname) {
      const microFrontend = routes[pathname || window.location.hash];
      const root = document.getElementById('microfrontend');
      root.innerHTML = microFrontend ? new microFrontend().render(): new Home().render();
      $(window).scrollTop(0);
    }
    
    $(window).bind( 'hashchange', function(e) { renderFrontend(window.location.hash); });
    renderFrontend(window.location.hash);
    
    utility routes.js (A map of the hash route to the Microfrontend class)
    const routes = {
      '#': Home,
      '': Home,
      '#home': Home,
      '#cart': Cart,
      '#checkout': Checkout,
      '#product': Product,
      '#contact': Contact,
    };

    As you can see, this approach is pretty neat and encapsulates a separate class for Micro Frontends. All other Micro Frontends extend from this. Notice how all the functionality related to Microapp is encapsulated in the respective Micro Frontend. This makes sure that concurrent work on a Micro Frontend doesn’t mess up some other Micro Frontends.

    Everything will work in a similar paradigm when it comes to Web Components and Custom Elements.

    React

    With the client-side JavaScript frameworks being very popular, it is impossible to leave React from any Front End discussion. React being a component-based JS library, much of the things discussed above will also apply to React. I am going to discuss some of the technicalities and challenges when it comes to Micro Frontends with React.

    Styling

    Since there should be minimum sharing of code between any Micro Frontend, styling the React components can be challenging, considering the global and cascading nature of CSS. We should make sure styles are targeted on a specific Micro Frontend without spilling over to other Micro Frontends. Inline CSS, CSS in JS libraries like Radium,  and CSS Modules, can be used with React.

    Redux

    Using React with Redux is kind of a norm in today’s front-end world. The convention is to use Redux as a single global store for the entire app for cross application communication. A Micro Frontend should be self-contained with no dependencies. Hence each Micro Frontend should have its own Redux store, moving towards a multiple Redux store architecture. 

    Other Noteworthy Integration Approaches  –

    Server-side Rendering – One can use a server to assemble Micro Frontend templates before dispatching it to the browser. SSI techniques can be used too.

    iframes – Each Micro Frontend can be an iframe. They also offer a good degree of isolation in terms of styling, and global variables don’t interfere with each other.

    Summary

    With Microservices, Micro Frontends promise to  bring in a lot of benefits when it comes to structuring a complex application and simplifying its development, deployment and maintenance.

    But there is a wonderful saying that goes “there is no one-size-fits-all approach that anyone can offer you. The same hot water that softens a carrot hardens an egg”. Micro Frontend is no silver bullet for your architectural problems and comes with its own share of downsides. With more repositories, more tools, more build/deploy pipelines, more servers, more domains to maintain, Micro Frontends can increase the complexity of an app. It may render cross-application communication difficult to establish. It can also lead to duplication of dependencies and an increase in application size.

    Your decision to implement this architecture will depend on many factors like the size of your organization and the complexity of your application. Whether it is a new or legacy codebase, it is advisable to apply the technique gradually over time and review its efficacy over time.

  • Chatbots With Google DialogFlow: Build a Fun Reddit Chatbot in 30 Minutes

    Google DialogFlow

    If you’ve been keeping up with the current advancements in the world of chat and voice bots, you’ve probably come across Google’s newest acquisition – DialogFlow (formerly, api.ai) – a platform that provides a use-case specific, engaging voice and text-based conversations, powered by AI. While understanding the intricacies of human conversations, where we say one thing but mean the other, is still an art lost on machines, a domain-specific bot is the closest thing we can build.

    What is DialogFlow anyway?

    Natural language understanding (NLU) has always been the painful part while building a chatbot. How do you make sure your bot is actually understanding what the user says, and parsing their requests correctly? Well, here’s where DialogFlow comes in and fills the gap. It actually replaces the NLU parsing bit so that you can focus on other areas like your business logic!

    DialogFlow is simply a tool that allows you to make bots (or assistants or agents) that understand human conversation, string together a meaningful API call with appropriate parameters after parsing the conversation and respond with an adequate reply. You can then deploy this bot to any platform of your choosing – Facebook Messenger, Slack, Google Assistant, Twitter, Skype, etc. Or on your own app or website as well!

    The building blocks of DialogFlow

    Agent: DialogFlow allows you to make NLU modules, called agents (basically the face of your bot). This agent connects to your backend and provides it with business logic.

    Intent: An agent is made up of intents. Intents are simply actions that a user can perform on your agent. It maps what a user says to what action should be taken. They’re entry points into a conversation.

    In short, a user may request the same thing in many ways, re-structuring their sentences. But in the end, they should all resolve to a single intent.

    Examples of intents can be:
    “What’s the weather like in Mumbai today?” or “What is the recipe for an omelet?”

    You can create as many intents as your business logic desires, and even co-relate them, using contexts. An intent decides what API to call, with what parameters, and how to respond back, to a user’s request.

    Entity: An agent wouldn’t know what values to extract from a given user’s input. This is where entities come into play. Any information in a sentence, critical to your business logic, will be an entity. This includes stuff like dates, distance, currency, etc. There are system entities, provided by DialogFlow for simple things like numbers and dates. And then there are developer defined entities. For example, “category”, for a bot about Pokemon! We’ll dive into how to make a custom developer entity further in the post.

    Context: Final concept before we can get started with coding is “Context”. This is what makes the bot truly conversational. A context-aware bot can remember things, and hold a conversation like humans do. Consider the following conversation:

    “Hey, are you coming for piano practice tonight?”
    “Sorry, I’ve got dinner plans.”
    “Okay, what about tomorrow night then?”
    “That works!”

    Did you notice what just happened? The first question is straightforward to parse: The time is “tonight”, and the event, “piano practice”.

    However, the second question,  “Okay, what about tomorrow night then?” doesn’t specify anything about the actual event. It’s implied that we’re talking about “piano practice”. This sort of understanding comes naturally to us humans, but bots have to be explicitly programmed so that they understand the context across these sentences.

    Making a Reddit Chatbot using DialogFlow

    Now that we’re well equipped with the basics, let’s get started! We’re going to make a Reddit bot that tells a joke or an interesting fact from the day’s top threads on specific subreddits. We’ll also sprinkle in some context awareness so that the bot doesn’t feel “rigid”.

    NOTE: You would need a billing-enabled account on Google Cloud Platform(GCP) if you want to follow along with this tutorial. It’s free and just needs your credit card details to set up. 

    Creating an Agent 

    1. Log in to the DialogFlow dashboard using your Google account. Here’s the link for the lazy.
    2. Click on “Create Agent”
    3. Enter the details as below, and hit “Create”. You can select any other Google project if it has billing enabled on it as well.

    Setting up a “Welcome” Intent

    As soon as you create the agent, you see this intents page:

    The “Default Fallback” Intent exists in case the user says something unexpected and is outside the scope of your intents. We won’t worry too much about that right now. Go ahead and click on the “Default Welcome Intent”. We can notice a lot of options that we can tweak.
    Let’s start with a triggering phrase. Notice the “User Says” section? We want our bot to activate as soon as we say something along the lines of:

    Let’s fill that in. After that, scroll down to the “Responses” tab. You can see some generic welcome messages are provided. Get rid of them, and put in something more personalized to our bot, as follows:

    Now, this does a couple of things. Firstly, it lets the user know that they’re using our bot. It also guides the user to the next point in the conversation. Here, it is an “or” question.

    Hit “Save” and let’s move on.

    Creating a Custom Entity

    Before we start playing around with Intents, I want to set up a Custom Entity real quick. If you remember, Entities are what we extract from user’s input to process further. I’m going to call our Entity “content”. As the user request will be a content – either a joke or a fact. Let’s go ahead and create that. Click on the “Entities” tab on left-sidebar and click “Create Entity”.

    Fill in the following details:

    As you can see, we have 2 values possible for our content: “joke” and “fact”. We also have entered synonyms for each of them, so that if the user says something like “I want to hear something funny”, we know he wants a “joke” content. Click “Save” and let’s proceed to the next section!

    Attaching our new Entity to the Intent

    Create a new Intent called “say-content”. Add a phrase “Let’s hear a joke” in the “User Says” section, like so:

    Right off the bat, we notice a couple of interesting things. Dialogflow parsed this input and associated the entity content to it, with the correct value (here, “joke”). Let’s add a few more inputs:

    PS: Make sure all the highlighted words are in the same color and have associated the same entity. Dialogflow’s NLU isn’t perfect and sometimes assigns different Entities. If that’s the case, just remove it, double-click the word and assign the correct Entity yourself!

    Let’s add a placeholder text response to see it work. To do that, scroll to the bottom section “Response”, and fill it like so:

    The “$content” is a variable having a value extracted from user’s response that we saw above.

    Let’s see this in action. On the right side of every page on Dialogflow’s platform, you see a “Try It Now” box. Use that to test your work at any point in time. I’m going to go ahead and type in “Tell a fact” in the box. Notice that the “Tell a fact” phrase wasn’t present in the samples that we gave earlier. Dialogflow keeps training using it’s NLU modules and can extract data from similarly structured sentences:

    A Webhook to process requests

    To keep things simple I’m gonna write a JS app that fulfills the request by querying the Reddit’s website and returning the appropriate content. Luckily for us, Reddit doesn’t need authentication to read in JSON format. Here’s the code:

    'use strict';
    const http = require('https');
    exports.appWebhook = (req, res) => { 
    let content = req.body.result.parameters['content']; 
    getContent(content).then((output) => {   
    res.setHeader('Content-Type', 'application/json');   
    res.send(JSON.stringify({ 'speech': output, 'displayText': output    })); 
    }).catch((error) => {   
    // If there is an error let the user know   
    res.setHeader('Content-Type', 'application/json');   
    res.send(JSON.stringify({ 'speech': error, 'displayText': error     })); 
    });
    };
    function getSubreddit (content) { 
    if (content == "funny" || content == "joke" || content == "laugh")   
    return {sub: "jokes", displayText: "joke"};   
    else {     
    return {sub: "todayILearned", displayText: "fact"};   
    }
    }
    function getContent (content) { 
    let subReddit = getSubreddit(content); 
    return new Promise((resolve, reject) => {   
    console.log('API Request: to Reddit');   
    http.get(`https://www.reddit.com/r/${subReddit["sub"]}/top.json?sort=top&t=day`, (resp) => {     
    let data = '';     
    resp.on('data', (chunk) => {       
    data += chunk;     
    });     
    resp.on('end', () => {       
    let response = JSON.parse(data);       
    let thread = response["data"]["children"][(Math.floor((Math.random() * 24) + 1))]["data"];       
    let output = `Here's a ${subReddit["displayText"]}: ${thread["title"]}`;       
    if (subReddit['sub'] == "jokes") {         
    output += " " + thread["selftext"];       
    }       
    output += "nWhat do you want to hear next, a joke or a fact?"       
    console.log(output);       
    resolve(output);     
    });   
    }).on("error", (err) => {     
    console.log("Error: " + err.message);     
    reject(error);   
    }); 
    });
    }

    Now, before going ahead, follow the steps 1-5 mentioned here religiously.

    NOTE: For step 1, select the same Google Project that you created/used, when creating the agent.

    Now, to deploy our function using gcloud:

    $ gcloud beta functions deploy appWebHook --stage-bucket BUCKET_NAME --trigger-http

    To find the BUCKET_NAME, go to your Google project’s console and click on Cloud Storage under the Resources section.

    After you run the command, make note of the httpsTrigger URL mentioned. On the Dialoglow platform, find the “Fulfilment” tab on the sidebar. We need to enable webhooks and paste in the URL, like this:

    Hit “Done” on the bottom of the page, and now the final step. Visit the “say_content” Intent page and perform a couple of steps.

    1. Make the “content” parameter mandatory. This will make the bot ask explicitly for the parameter to the user if it’s not clear:

    2. Notice a new section has been added to the bottom of the screen called “Fulfilment”. Enable the “Use webhook” checkbox:

    Click “Save” and that’s it! Time to test this Intent out!

    Reddit’s crappy humor aside, this looks neat. Our replies always drive the conversation to places (Intents) that we want it to.

    Adding Context to our Bot

    Even though this works perfectly fine, there’s one more thing I’d like to add quickly. We want the user to be able to say, “More” or “Give me another one” and the bot to be able to understand what this means. This is done by emitting and absorbing contexts between intents.

    First, to emit the context, scroll up on the “say-content” Intent’s page and find the “Contexts” section. We want to output the “context”. Let’s say for a count of 5. The count makes sure the bot remembers what the “content” is in the current conversation for up to 5 back and forths.

    Now, we want to create a new content that can absorb this type of context and make sense of phrases like “More please”:

    Finally, since we want it to work the same way, we’ll make the Action and Fulfilment sections look the same way as the “say-content” Intent does:

    And that’s it! Your bot is ready.

    Integrations

    Dialogflow provides integrations with probably every messaging service in the Silicon Valley, and more. But we’ll use the Web Demo. Go to “Integrations” tab from the sidebar and enable “Web Demo” settings. Your bot should work like this:

    And that’s it! Your bot is ready to face a real person! Now, you can easily keep adding more subreddits, like news, sports, bodypainting, dankmemes or whatever your hobbies in life are! Or make it understand a few more parameters. For example, “A joke about Donald Trump”.

    Consider that your homework. You can also add a “Bye” intent, and make the bot stop. Our bot currently isn’t so great with goodbyes, sort of like real people.

    Debugging and Tips

    If you’re facing issues with no replies from the Reddit script, go to your Google Project and check the Errors and Reportings tab to make sure everything’s fine under the hood. If outbound requests are throwing an error, you probably don’t have billing enabled.

    Also, one caveat I found is that the entities can take up any value from the synonyms that you’ve provided. This means you HAVE to hardcode them in your business app as well. Which sucks right now, but maybe DialogFlow will provide a cleaner solution in the near future!

  • Managing Secrets Using AWS Systems Manager Parameter Store and IAM Roles

    Amazon Web Services (AWS) has an extremely wide variety of services which cover almost all our infrastructure requirements. Among the given services, there is AWS Systems Manager which is a collection of services to manage AWS instances, hybrid environment, resources, and virtual machines by providing a common UI interface for all of them. Services are divided into categories such as Resource Groups, Insights, Actions and Shared Resource. Among Shared Resources one is Parameter Store, which is our topic of discussion today. There are many services that may require SSM agents to be installed on the system but the Parameter store can be used as standalone as well.

    What is Parameter Store?

    Parameter Store is a service which helps you arrange your data in a systematic hierarchical format for better reference. Data can be of any type like Passwords, Keys, URLs or Strings. Data can be stored as encrypted or as plain text. Storage is done in Key-Value Format. Parameter store comes integrated with AWS KMS. It provides a key by default and gives you an option to change it, in this blog we will be using the default one.

    Why Parameter Store?

    Let’s compare its competitors, these include Hashicorp Vault and AWS Secrets Manager.

    Vault stores secrets in Database/File-System but requires one to manage the root token and Unseal Keys. And it is not easy to use.

    Next, is the AWS owned Secrets Manager, this service is not free and would require Lambda functions to be written for secret rotation. Which might become an overhead. Also, the hierarchy is taken as a String only, which can’t be iterated.

    Some Key Features of Parameter Store include:

    • As KMS is integrated the encryption takes place automatically without providing extra parameters.
    • It arranges your data hierarchically and it is pretty simple, just apply “/” to form the hierarchy and by applying recursive search we can fetch required parameters.
    • This helps us in removing all those big Config files, which were previously holding our secrets and causing a severe security risk. Helping us in modularizing our applications.
    • Simple Data like Name can be stored as String.
    • Secured Data as SecureString.
    • Even Array data can be stored using StringList.
    • Access configuration is manageable with IAM.
    • Linked with other services like AWS ECS, Lambda, and CloudFormation
    • AWS backed
    • Easy to use
    • Free of cost

    Note: Parameter Store is region specific service thus might not be available in all regions.

    How to Use it?

    Initial Setup:

    Parameter Store can be used both via GUI and terminal.

    AWS console:

    1. Login into your account and select your preferred region.
    2. In Services select Systems Manager and after that select Parameter Store.
    3. If there are already some credentials created than Keys of that credentials will be displayed.
    4. If Not, then you will be asked to “Create Parameter.”

    On CLI:

    1. Download the AWS CLI, it comes along with inbuilt support for Systems Manager (SSM).
    2. Make sure to have your credentials file is configured.

    Use: Both on Console and CLI

    1. Create

    a. Enter the name of the key that you wish to store. If it is hierarchical then apply “/” without quotes and in place of value enter Value.

    Eg: This
      |- is
        | - Key : Value

    Then in name enter “ /This/is/Key” and in value write “Value”

    b. Select the type of storage, if it can be stored as plain text then use String, if the value is in Array format then choose StringList and mention the complete array in value and if you want to secure it then use SecureString.

    c. CLI:

    $aws ssm put-paramater --name "/This/is/Key" --value "Value" --type String  
    {  
    "Version": 1  
    }

    d. If you want to make it secure:

    $aws ssm put-parameter --name "/This" --value "SecureValue" --type SecureString
    {
    "Version": 1
    }

    2. Read

    a. Once Stored, parameters get listed on the console.

    b. To check any of them, just click on the key. If not secured, the value will be directly visible and if it is secured, then the value would be hidden and you will have to explicitly press “Show”.

    AWS Parameter Overview

    c. CLI:

    $aws ssm get-parameter --name /This/is/Key
    {
    "Parameter":
    {
    "Name": "/This/is/Key",
    "LastModifiedDate": 1535362148.994,
    "Value": "Value",
    "Version": 1,
    "Type": "String",
    "ARN": "arn:aws:ssm:us-east-1:275829625285:parameter/This/is/Key"
    } }

    d. For Secured String:  

    $aws ssm get-parameter --name /This --with-decrypt
    {
    "Parameter":
    {
    "Name": "/This",
    "LastModifiedDate": 1535362296.062,
    "Value": "SecureValue",
    "Version": 1,
    "Type": "SecureString",
    "ARN": "arn:aws:ssm:us-east-1:275829625285:parameter/This
    }
    }

    e. If you observe the above command you will realize that despite providing “/This” we did not receive the complete tree. In order to get that provide modify the command as follows:

    $aws ssm get-parameters-by-path --path /This --recursive
    {
    "Parameters": [
    {
    "Name": "/This/is/Key",
    "LastModifiedDate": 1535362148.994,
    "Value": "Value",
    "Version": 1,
    "Type": "String",
    "ARN": "arn:aws:ssm:us-east-1:275829625285:parameter/This/is/Key"
    } ]
    }

    3. Rotate/Modify:

    a. Once a value is saved it automatically gets versioned as 1, if you click on the parameter and EDIT it, then version gets incremented and the new value is stored as version 2. In this way, we achieve rotation of credentials as well.

    b. Type of parameters cannot be changed, you will have to create a new one.

    c. CLI:
    The command itself is clear, just observe the version:

    $aws ssm put-parameter --name "/This/is/Key" --value "NewValue" --type String --overwrite
    {
    "Version": 2
    }

    4. Delete:

    a. Select the parameter or select all the required parameters and click delete

    b. CLI:

    $aws ssm delete-parameter --name "/This/is/Key"

    As you can see commands are pretty simple and if you have observed, ARN information is also getting populated. Below we will discuss IAM role that we can configure, to help us with access control.

    IAM (AWS Identity and Access Management)

    Remember that we are storing some very critical data in Param Store, therefore access to that data should also be well maintained. If by mistake a new developer comes in the team and is given full access over the parameters, chances are he might end up modifying or deleting production parameters. This is something we really don’t want.

    Generally, it is a good practice to have roles and policies predefined such that only the person responsible has access to required data. Control over the parameters can be done to a granular level. But for this blog, we will take a simple use case. That being said we can take reference from the policies mentioned below.

    By using the resource we can specify the path for parameters, that can be accessed by a particular policy. For example, only System Admin should be able to fetch Production credentials, then in order to achieve this, we will be placing “parameter/production” on the policy, where production represents the top level hierarchy. Thus making anything stored under production become accessible, if we want to more fine tune it then we can do so by adding parameters after – parameter/production/<till>/<the>/<last>/<level></level></last></the></till>

    Below are some of the policies that can be applied to a group or user on a server level. Depending on the requirement, explicit deny can also be applied to Developers for Production.

    For Production Servers:

    SSMProdReadOnly:

    "ssm:GetParameterHistory",
    "ssm:ListTagsForResource",
    "ssm:GetParametersByPath",
    "ssm:GetParameters",
    "ssm:GetParameter"
    "Resource": "arn:aws:ssm:<Region>:<Role-ID>:parameter/production"

    SSMProdWriteOnly:

    "ssm:GetParameterHistory",
    "ssm:ListTagsForResource",
    "ssm:GetParametersByPath",
    "ssm:GetParameters",
    "ssm:GetParameter",
    "ssm:PutParameter",
    "ssm:DeleteParameter",
    "ssm:AddTagsToResource",
    "ssm:DeleteParameters" "Resource": "arn:aws:ssm:<Region>:<Role-ID>:parameter/production"

    For Dev Servers:

    SSMDevelopmentReadWrite

    "ssm:PutParameter",
    "ssm:DeleteParameter",
    "ssm:RemoveTagsFromResource",
    "ssm:AddTagsToResource",
    "ssm:GetParameterHistory",
    "ssm:ListTagsForResource",
    "ssm:GetParametersByPath",
    "ssm:GetParameters",
    "ssm:GetParameter"
    "Resource": "arn:aws:ssm:<Region>:<Role-ID>:parameter/development"

    Conclusion

    This was all about the AWS systems manager parameter store and the IAM roles. Now that you know what the parameter store is, why should you use it, and how to use it, I hope this helps you in kick-starting your credential management using AWS Parameter Store. Start using it already and share your experiences or suggestions in the comments section below.

  • Managing a TLS Certificate for Kubernetes Admission Webhook

    A Kubernetes admission controller is a great way of handling an incoming request, whether to add or modify fields or deny the request as per the rules/configuration defined. To extend the native functionalities, these admission webhook controllers call a custom-configured HTTP callback (webhook server) for additional checks. But the API server only communicates over HTTPS with the admission webhook servers and needs TLS cert’s CA information. This poses a problem for how we handle this webhook server certificate and how to pass CA information to the API server automatically.

    One way to handle these TLS certificate and CA is using Kubernetes cert-manager. However, Kubernetes cert-manager itself is a big application and consists of many CRDs to handle its operation. It is not a good idea to install cert-manager just to handle admission webhook TLS certificate and CA. The second and possibly easier way is to use self-signed certificate and handle CA on our own using the Init Container. This eliminates the dependency on other applications, like cert-manager, and gives us the flexibility to control our application flow.

    How is a custom admission webhook written? We will not cover this in-depth, and only a basic overview of admission controllers and their working will be covered. The main focus for this blog will be to cover the second approach step-by-step: handling admission webhook server TLS certificate and CA on our own using init container so that the API server can communicate with our custom webhook.

    To understand the in-depth working of Admission Controllers, these articles are great: 

    Prerequisites:

    • Knowledge of Kubernetes admission controllers,  MutatingAdmissionWebhook, ValidatingAdmissionWebhook
    • Knowledge of Kubernetes resources like pods and volumes

    Basic Overview: 

    Admission controllers intercept requests to the Kubernetes API server before persistence of objects in the etcd. These controllers are bundled and compiled together into the kube-apiserver binary. They consist of a list of controllers, and in that list, there are two special controllers: MutatingAdmissionWebhook and ValidatingAdmissionWebhook. MutatingAdmissionWebhook, as the name suggests, mutates/adds/modifies some fields in the request object by creating a patch, and ValidatingAdmissionWebhook validates the request by checking if the request object fields are valid or if the operation is allowed, etc., as per custom logic.

    The main reason for these types of controllers is to dynamically add new checks along with the native existing checks in Kubernetes to allow a request, just like the plug-in model. To understand this more clearly, let’s say we want all the deployments in the cluster to have certain required labels. If the deployment does not have required labels, then the create deployment request should be denied. This functionality can be achieved in two ways: 

    1) Add these extra checks natively in Kubernetes API server codebase, compile a new binary, and run with the new binary. This is a tedious process, and every time new checks are needed, a new binary is required. 

    2) Create a custom admission webhook, a simple HTTP server, for these additional checks, and register this admission webhook with the API server using AdmissionRegistration API. To register two configurations, MutatingWebhookconfiguration and ValidatingWebhookConfiguration are used. The second approach is recommended and it’s quite easy as well. We will be discussing it here in detail.

    Custom Admission Webhook Server:

    As mentioned earlier, a custom admission webhook server is a simple HTTP server with TLS that exposes endpoints for mutation and validation. Depending upon the endpoint hit, corresponding handlers process mutation and validation. Once a custom webhook server is ready and deployed in a cluster as a deployment along with webhook service, the next part is to register it with the API server so that the API server can communicate with the custom webhook server. To register, MutatingWebhookconfiguration and ValidatingWebhookConfiguration are used. These configurations have a section to fill custom webhook related information.

    apiVersion: admissionregistration.k8s.io/v1
    kind: MutatingWebhookConfiguration
    metadata:
      name: mutation-config
    webhooks:
      - admissionReviewVersions:
        - v1beta1
        name: mapplication.kb.io
        clientConfig:
          caBundle: ${CA_BUNDLE}
          service:
            name: webhook-service
            namespace: default
            path: /mutate
        rules:
          - apiGroups:
              - apps
          - apiVersions:
              - v1
            resources:
              - deployments
        sideEffects: None

    Here, the service field gives information about the name, namespace, and endpoint path of the webhook server running. An important field here to note is the CA bundle. A custom admission webhook is required to run the HTTP server with TLS only because the API server only communicates over HTTPS. So the webhook server runs with server cert, and key and “caBundle” in the configuration is CA (Certification Authority) information so that API server can recognize server certificate.

    The problem here is how to handle this server certificate and the key—and how to get this CA bundle and pass this information to the API server using MutatingWebhookconfiguration or ValidatingWebhookConfiguration. This will be the main focus of the following part.

    Here, we are going to use a self-signed certificate for the webhook server. Now, this self-signed certificate can be made available to the webhook server using different ways. Two possible ways are:

    • Create a Kubernetes secret containing certificate and key and mount that as volume on to the server pod
    • Somehow create certificate and key in a volume, e.g., emptyDir volume and server consumes those from that volume

    However, even after doing any of the above two possible ways, the remaining important part is to add the CA bundle in mutation/validation configs.

    So, instead of doing all these steps manually, we all make use of Kubernetes init containers to perform all functions for us.

    Custom Admission Webhook Server Init Container:

    The main function of this init container will be to create a self-signed webhook server certificate and provide the CA bundle to the API server via mutation/validation configs. How the webhook server consumes this certificate (via secret volume or emptyDir volume), depends on the use-case. This init container will run a simple Go binary to perform all these functions.

    package main
    
    import (
    	"bytes"
    	cryptorand "crypto/rand"
    	"crypto/rsa"
    	"crypto/x509"
    	"crypto/x509/pkix"
    	"encoding/pem"
    	"fmt"
    	log "github.com/sirupsen/logrus"
    	"math/big"
    	"os"
    	"time"
    )
    
    func main() {
    	var caPEM, serverCertPEM, serverPrivKeyPEM *bytes.Buffer
    	// CA config
    	ca := &x509.Certificate{
    		SerialNumber: big.NewInt(2020),
    		Subject: pkix.Name{
    			Organization: []string{"velotio.com"},
    		},
    		NotBefore:             time.Now(),
    		NotAfter:              time.Now().AddDate(1, 0, 0),
    		IsCA:                  true,
    		ExtKeyUsage:           []x509.ExtKeyUsage{x509.ExtKeyUsageClientAuth, x509.ExtKeyUsageServerAuth},
    		KeyUsage:              x509.KeyUsageDigitalSignature | x509.KeyUsageCertSign,
    		BasicConstraintsValid: true,
    	}
    
    	// CA private key
    	caPrivKey, err := rsa.GenerateKey(cryptorand.Reader, 4096)
    	if err != nil {
    		fmt.Println(err)
    	}
    
    	// Self signed CA certificate
    	caBytes, err := x509.CreateCertificate(cryptorand.Reader, ca, ca, &caPrivKey.PublicKey, caPrivKey)
    	if err != nil {
    		fmt.Println(err)
    	}
    
    	// PEM encode CA cert
    	caPEM = new(bytes.Buffer)
    	_ = pem.Encode(caPEM, &pem.Block{
    		Type:  "CERTIFICATE",
    		Bytes: caBytes,
    	})
    
    	dnsNames := []string{"webhook-service",
    		"webhook-service.default", "webhook-service.default.svc"}
    	commonName := "webhook-service.default.svc"
    
    	// server cert config
    	cert := &x509.Certificate{
    		DNSNames:     dnsNames,
    		SerialNumber: big.NewInt(1658),
    		Subject: pkix.Name{
    			CommonName:   commonName,
    			Organization: []string{"velotio.com"},
    		},
    		NotBefore:    time.Now(),
    		NotAfter:     time.Now().AddDate(1, 0, 0),
    		SubjectKeyId: []byte{1, 2, 3, 4, 6},
    		ExtKeyUsage:  []x509.ExtKeyUsage{x509.ExtKeyUsageClientAuth, x509.ExtKeyUsageServerAuth},
    		KeyUsage:     x509.KeyUsageDigitalSignature,
    	}
    
    	// server private key
    	serverPrivKey, err := rsa.GenerateKey(cryptorand.Reader, 4096)
    	if err != nil {
    		fmt.Println(err)
    	}
    
    	// sign the server cert
    	serverCertBytes, err := x509.CreateCertificate(cryptorand.Reader, cert, ca, &serverPrivKey.PublicKey, caPrivKey)
    	if err != nil {
    		fmt.Println(err)
    	}
    
    	// PEM encode the  server cert and key
    	serverCertPEM = new(bytes.Buffer)
    	_ = pem.Encode(serverCertPEM, &pem.Block{
    		Type:  "CERTIFICATE",
    		Bytes: serverCertBytes,
    	})
    
    	serverPrivKeyPEM = new(bytes.Buffer)
    	_ = pem.Encode(serverPrivKeyPEM, &pem.Block{
    		Type:  "RSA PRIVATE KEY",
    		Bytes: x509.MarshalPKCS1PrivateKey(serverPrivKey),
    	})
    
    	err = os.MkdirAll("/etc/webhook/certs/", 0666)
    	if err != nil {
    		log.Panic(err)
    	}
    	err = WriteFile("/etc/webhook/certs/tls.crt", serverCertPEM)
    	if err != nil {
    		log.Panic(err)
    	}
    
    	err = WriteFile("/etc/webhook/certs/tls.key", serverPrivKeyPEM)
    	if err != nil {
    		log.Panic(err)
    	}
    
    }
    
    // WriteFile writes data in the file at the given path
    func WriteFile(filepath string, sCert *bytes.Buffer) error {
    	f, err := os.Create(filepath)
    	if err != nil {
    		return err
    	}
    	defer f.Close()
    
    	_, err = f.Write(sCert.Bytes())
    	if err != nil {
    		return err
    	}
    	return nil
    }

    The steps to generate self-signed CA and sign webhook server certificate using this CA in Golang:

    • Create a config for the CA, ca in the code above.
    • Create an RSA private key for this CA, caPrivKey in the code above.
    • Generate a self-signed CA, caBytes, and caPEM above. Here caPEM is the PEM encoded caBytes and will be the CA bundle given to the API server.
    • Create a config for webhook server certificate, cert in the code above. The important field in this configuration is the DNSNames and commonName. This name must be the full webhook service name of the webhook server to reach the webhook pod.
    • Create an RS private key for the webhook server, serverPrivKey in the code above.
    • Create server certificate using ca and caPrivKey, serverCertBytes in the code above.
    • Now, PEM encode the serverPrivKey and serverCertBytes. This serverPrivKeyPEM and serverCertPEM is the TLS certificate and key and will be consumed by the webhook server.

    At this point, we have generated the required certificate, key, and CA bundle using init container. Now we will share this server certificate and key with the actual webhook server container in the same pod. 

    • One approach is to create an empty secret resource before-hand, create webhook deployment by passing the secret name as an environment variable. Init container will generate server certificate and key and populate the empty secret with certificate and key information. This secret will be mounted on to webhook server container to start HTTP server with TLS.
    • The second approach (used in the code above) is to use Kubernete’s native pod specific emptyDir volume. This volume will be shared between both the containers. In the code above, we can see the init container is writing these certificate and key information in a file on a particular path. This path will be the one emptyDir volume is mounted to, and the webhook server container will read certificate and key for TLS configuration from that path and start the HTTP webhook server. Refer to the below diagram:

    The pod spec will look something like this:

    spec:
      initContainers:
          image: <webhook init-image name>
          imagePullPolicy: IfNotPresent
          name: webhook-init
          volumeMounts:
            - mountPath: /etc/webhook/certs
              name: webhook-certs
      containers:
          image: <webhook server image name>
          imagePullPolicy: IfNotPresent
          name: webhook-server
          volumeMounts:
            - mountPath: /etc/webhook/certs
              name: webhook-certs
              readOnly: true
      volumes:
        - name: webhook-certs
          emptyDir: {}

    The only part remaining is to give this CA bundle information to the API server using mutation/validation configs. This can be done in two ways:

    • Patch the CA bundle in the existing MutatingWebhookConfiguration or ValidatingWebhookConfiguration using Kubernetes go-client in the init container.
    • Create MutatingWebhookConfiguration or ValidatingWebhookConfiguration in the init container itself with CA bundle information in configs.

    Here, we will create configs through init container. To get certain parameters, like mutation config name, webhook service name, and webhook namespace dynamically, we can take these values from init containers env:

    initContainers:
      image: <webhook init-image name>
      imagePullPolicy: IfNotPresent
      name: webhook-init
      volumeMounts:
        - mountPath: /etc/webhook/certs
          name: webhook-certs
      env:
        - name: MUTATE_CONFIG
          value: mutating-webhook-configuration
        - name: VALIDATE_CONFIG
          value: validating-webhook-configuration
        - name: WEBHOOK_SERVICE
          value: webhook-service
        - name: WEBHOOK_NAMESPACE
          value:  default

    To create MutatingWebhookConfiguration, we will add the below piece of code in init container code below the certificate generation code.

    package main
    
    import (
    	"bytes"
    	admissionregistrationv1 "k8s.io/api/admissionregistration/v1"
    	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    	"k8s.io/client-go/kubernetes"
    	"os"
    	ctrl "sigs.k8s.io/controller-runtime"
    )
    
    func createMutationConfig(caCert *bytes.Buffer) {
    
    	var (
    		webhookNamespace, _ = os.LookupEnv("WEBHOOK_NAMESPACE")
    		mutationCfgName, _  = os.LookupEnv("MUTATE_CONFIG")
    		// validationCfgName, _ = os.LookupEnv("VALIDATE_CONFIG") Not used here in below code
    		webhookService, _ = os.LookupEnv("WEBHOOK_SERVICE")
    	)
    	config := ctrl.GetConfigOrDie()
    	kubeClient, err := kubernetes.NewForConfig(config)
    	if err != nil {
    		panic("failed to set go -client")
    	}
    
    	path := "/mutate"
    	fail := admissionregistrationv1.Fail
    
    	mutateconfig := &admissionregistrationv1.MutatingWebhookConfiguration{
    		ObjectMeta: metav1.ObjectMeta{
    			Name: mutationCfgName,
    		},
    		Webhooks: []admissionregistrationv1.MutatingWebhook{{
    			Name: "mapplication.kb.io",
    			ClientConfig: admissionregistrationv1.WebhookClientConfig{
    				CABundle: caCert.Bytes(), // CA bundle created earlier
    				Service: &admissionregistrationv1.ServiceReference{
    					Name:      webhookService,
    					Namespace: webhookNamespace,
    					Path:      &path,
    				},
    			},
    			Rules: []admissionregistrationv1.RuleWithOperations{{Operations: []admissionregistrationv1.OperationType{
    				admissionregistrationv1.Create},
    				Rule: admissionregistrationv1.Rule{
    					APIGroups:   []string{"apps"},
    					APIVersions: []string{"v1"},
    					Resources:   []string{"deployments"},
    				},
    			}},
    			FailurePolicy: &fail,
    		}},
    	}
      
    	if _, err := kubeClient.AdmissionregistrationV1().MutatingWebhookConfigurations().Create(mutateconfig)
    		err != nil {
    		panic(err)
    	}
    }

    The code above is just a sample code to create MutatingWebhookConfiguration. Here first, we are importing the required packages. Then, we are reading the environment variables like webhookNamespace, etc. Next, we are defining the MutatingWebhookConfiguration struct with CA bundle information (created earlier) and other required information. Finally, we are creating a configuration using the go-client. The same approach can be followed for creating the ValidatingWebhookConfiguration. For cases of pod restart or deletion, we can add extra logic in init containers like delete the existing configs first before creating or updating only the CA bundle if configs already exist.  

    For certificate rotation, the approach will be different for each approach taken for serving this certificate to the server container:

    • If we are using emptyDir volume, then the approach will be to just restart the webhook pod. As emptyDir volume is ephemeral and bound to the lifecycle of the pod, on restart, a new certificate will be generated and served to the server container. A new CA bundle will be added in configs if configs already exist.
    • If we are using secret volume, then, while restarting the webhook pod, the expiration of the existing certificate from the secret can be checked to decide whether to use the existing certificate for the server or create a new one.

    In both cases, the webhook pod restart is required to trigger the certificate rotation/renew process. When you will want to restart the webhook pod and how the webhook pod will be restarted will vary depending on the use-case. A few possible ways can be using cron-job, controllers, etc.

    Now, our custom webhook is registered, the API server can read CA bundle information through configs, and the webhook server is ready to serve the mutation/validation requests as per rules defined in configs. 

    Conclusion:

    We covered how we will add additional checks mutation/validation by registering our own custom admission webhook server. We also covered how we can automatically handle webhook server TLS certificate and key using init containers and passing the CA bundle information to API server through mutation/validation configs.

    Related Articles:

    1. OPA On Kubernetes: An Introduction For Beginners

    2. Prow + Kubernetes – A Perfect Combination To Execute CI/CD At Scale

  • Mesosphere DC/OS Masterclass : Tips and Tricks to Make Life Easier

    DC/OS is an open-source operating system and distributed system for data center built on Apache Mesos distributed system kernel. As a distributed system, it is a cluster of master nodes and private/public nodes, where each node also has host operating system which manages the underlying machine. 

    It enables the management of multiple machines as if they were a single computer. It automates resource management, schedules process placement, facilitates inter-process communication, and simplifies the installation and management of distributed services. Its included web interface and available command-line interface (CLI) facilitate remote management and monitoring of the cluster and its services.

    • Distributed System DC/OS is distributed system with group of private and public nodes which are coordinated by master nodes.
    • Cluster Manager : DC/OS  is responsible for running tasks on agent nodes and providing required resources to them. DC/OS uses Apache Mesos to provide cluster management functionality.
    • Container Platform : All DC/OS tasks are containerized. DC/OS uses two different container runtimes, i.e. docker and mesos. So that containers can be started from docker images or they can be native executables (binaries or scripts) which are containerized at runtime by mesos.
    • Operating System :  As name specifies, DC/OS is an operating system which abstracts cluster h/w and s/w resources and provide common services to applications.

    Unlike Linux, DC/OS is not a host operating system. DC/OS spans multiple machines, but relies on each machine to have its own host operating system and host kernel.

    The high level architecture of DC/OS can be seen below :

    For the detailed architecture and components of DC/OS, please click here.

    Adoption and usage of Mesosphere DC/OS:

    Mesosphere customers include :

    • 30% of the Fortune 50 U.S. Companies
    • 5 of the top 10 North American Banks
    • 7 of the top 12 Worldwide Telcos
    • 5 of the top 10 Highest Valued Startups

    Some companies using DC/OS are :

    • Cisco
    • Yelp
    • Tommy Hilfiger
    • Uber
    • Netflix
    • Verizon
    • Cerner
    • NIO

    Installing and using DC/OS

    A guide to installing DC/OS can be found here. After installing DC/OS on any platform, install dcos cli by following documentation found here.

    Using dcos cli, we can manager cluster nodes, manage marathon tasks and services, install/remove packages from universe and it provides great support for automation process as each cli command can be output to json.

    NOTE: The tasks below are executed with and tested on below tools:

    • DC/OS 1.11 Open Source
    • DC/OS cli 0.6.0
    • jq:1.5-1-a5b5cbe

    DC/OS commands and scripts

    Setup DC/OS cli with DC/OS cluster

    dcos cluster setup <CLUSTER URL>

    Example :

    dcos cluster setup http://dcos-cluster.com

    The above command will give you the link for oauth authentication and prompt for auth token. You can authenticate yourself with any of Google, Github or Microsoft account. Paste the token generated after authentication to cli prompt. (Provided oauth is enabled).

    DC/OS authentication token

    docs config show core.dcos_acs_token

    DC/OS cluster url

    dcos config show core.dcos_url

    DC/OS cluster name

    dcos config show cluster.name

    Access Mesos UI

    <DC/OS_CLUSTER_URL>/mesos

    Example:

    http://dcos-cluster.com/mesos

    Access Marathon UI

    <DC/OS_CLUSTER_URL>/service/marathon

    Example:

    http://dcos-cluster.com/service/marathon

    Access any DC/OS service, like Marathon, Kafka, Elastic, Spark etc.[DC/OS Services]

    <DC/OS_CLUSTER_URL>/service/<SERVICE_NAME>

    Example:

    http://dcos-cluster.com/service/marathon
    http://dcos-cluster.com/service/kafka

    Access DC/OS slaves info in json using Mesos API [Mesos Endpoints]

    curl -H "Authorization: Bearer $(dcos config show 
    core.dcos_acs_token)" $(dcos config show 
    core.dcos_url)/mesos/slaves | jq

    Access DC/OS slaves info in json using DC/OS cli

    dcos node --json

    Note : DC/OS cli ‘dcos node –json’ is equivalent to running mesos slaves endpoint (/mesos/slaves)

    Access DC/OS private slaves info using DC/OS cli

    dcos node --json | jq '.[] | select(.type | contains("agent")) | select(.attributes.public_ip == null) | "Private Agent : " + .hostname ' -r

    Access DC/OS public slaves info using DC/OS cli

    dcos node --json | jq '.[] | select(.type | contains("agent")) | select(.attributes.public_ip != null) | "Public Agent : " + .hostname ' -r

    Access DC/OS private and public slaves info using DC/OS cli

    dcos node --json | jq '.[] | select(.type | contains("agent")) | if (.attributes.public_ip != null) then "Public Agent : " else "Private Agent : " end + " - " + .hostname ' -r | sort

    Get public IP of all public agents

    #!/bin/bash
    
    for id in $(dcos node --json | jq --raw-output '.[] | select(.attributes.public_ip == "true") | .id'); 
    do 
          dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --mesos-id=$id "curl -s ifconfig.co"
    done 2>/dev/null

    Note: As ‘dcos node ssh’ requires private key to be added to ssh. Make sure you add your private key as ssh identity using :

    ssh-add </path/to/private/key/file/.pem>

    Get public IP of master leader

    dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --leader "curl -s ifconfig.co" 2>/dev/null

    Get all master nodes and their private ip

    dcos node --json | jq '.[] | select(.type | contains("master"))
    | .ip + " = " + .type' -r

    Get list of all users who have access to DC/OS cluster

    curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
    "$(dcos config show core.dcos_url)/acs/api/v1/users" | jq ‘.array[].uid’ -r

    Add users to cluster using Mesosphere script (Run this on master)

    Users to add are given in list.txt, each user on new line

    for i in `cat list.txt`; do echo $i;
    sudo -i dcos-shell /opt/mesosphere/bin/dcos_add_user.py $i; done

    Add users to cluster using DC/OS API

    #!/bin/bash
    
    # Uage dcosAddUsers.sh <Users to add are given in list.txt, each user on new line>
    for i in `cat users.list`; 
    do 
      echo $i
      curl -X PUT -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)" "$(dcos config show core.dcos_url)/acs/api/v1/users/$i" -d "{}"
    done

    Delete users from DC/OS cluster organization

    #!/bin/bash
    
    # Usage dcosDeleteUsers.sh <Users to delete are given in list.txt, each user on new line>
    
    for i in `cat users.list`; 
    do 
      echo $i
      curl -X DELETE -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)" "$(dcos config show core.dcos_url)/acs/api/v1/users/$i" -d "{}"
    done

    Offers/resources from individual DC/OS agent

    In recent versions of the many dcos services, a scheduler endpoint at                

    http://yourcluster.com/service/<service-name>/v1/debug/offers

    will display an HTML table containing a summary of recently-evaluated offers. This table’s contents are currently very similar to what can be found in logs, but in a slightly more accessible format. Alternately, we can look at the scheduler’s logs in stdout. An offer is a set of resources all from one individual DC/OS agent.

    <DC/OS_CLUSTER_URL>/service/<service_name>/v1/debug/offers

    Example:

    http://dcos-cluster.com/service/kafka/v1/debug/offers
    http://dcos-cluster.com/service/elastic/v1/debug/offers

    Save JSON configs of all running Marathon apps

    #!/bin/bash
    
    # Save marathon configs in json format for all marathon apps
    # Usage : saveMarathonConfig.sh
    
    for service in `dcos marathon app list --quiet | tr -d "/" | sort`; do
      dcos marathon app show $service | jq '. | del(.tasks, .version, .versionInfo, .tasksHealthy, .tasksRunning, .tasksStaged, .tasksUnhealthy, .deployments, .executor, .lastTaskFailure, .args, .ports, .residency, .secrets, .storeUrls, .uris, .user)' >& $service.json
    done

    Get report of Marathon apps with details like container type, Docker image, tag or service version used by Marathon app.

    #!/bin/bash
    
    TMP_CSV_FILE=$(mktemp /tmp/dcos-config.XXXXXX.csv)
    TMP_CSV_FILE_SORT="${TMP_CSV_FILE}_sort"
    #dcos marathon app list --json | jq '.[] | if (.container.docker.image != null ) then .id + ",Docker Application," + .container.docker.image else .id + ",DCOS Service," + .labels.DCOS_PACKAGE_VERSION end' -r > $TMP_CSV_FILE
    dcos marathon app list --json | jq '.[] | .id + if (.container.type == "DOCKER") then ",Docker Container," + .container.docker.image else ",Mesos Container," + if(.labels.DCOS_PACKAGE_VERSION !=null) then .labels.DCOS_PACKAGE_NAME+":"+.labels.DCOS_PACKAGE_VERSION  else "[ CMD ]" end end' -r > $TMP_CSV_FILE
    sed -i "s|^/||g" $TMP_CSV_FILE
    sort -t "," -k2,2 -k3,3 -k1,1 $TMP_CSV_FILE > ${TMP_CSV_FILE_SORT}
    cnt=1
    printf '%.0s=' {1..150}
    printf "n  %-5s%-35s%-23s%-40s%-20sn" "No" "Application Name" "Container Type" "Docker Image" "Tag / Version"
    printf '%.0s=' {1..150}
    while IFS=, read -r app typ image; 
    do
            tag=`echo $image | awk -F':' -v im="$image" '{tag=(im=="[ CMD ]")?"NA":($2=="")?"latest":$2; print tag}'`
            image=`echo $image | awk -F':' '{print $1}'`
            printf "n  %-5s%-35s%-23s%-40s%-20s" "$cnt" "$app" "$typ" "$image" "$tag"
            cnt=$((cnt + 1))
            sleep 0.3
    done < $TMP_CSV_FILE_SORT
    printf "n"
    printf '%.0s=' {1..150}
    printf "n"

    Get DC/OS nodes with more information like node type, node ip, attributes, number of running tasks, free memory, free cpu etc.

    #!/bin/bash
    
    printf "n  %-15s %-18s%-18s%-10s%-15s%-10sn" "Node Type" "Node IP" "Attribute" "Tasks" "Mem Free (MB)" "CPU Free"
    printf '%.0s=' {1..90}
    printf "n"
    TAB=`echo -e "t"`
    dcos node --json | jq '.[] | if (.type | contains("leader")) then "Master (leader)" elif ((.type | contains("agent")) and .attributes.public_ip != null) then "Public Agent" elif ((.type | contains("agent")) and .attributes.public_ip == null) then "Private Agent" else empty end + "t"+ if(.type |contains("master")) then .ip else .hostname end + "t" +  (if (.attributes | length !=0) then (.attributes | to_entries[] | join(" = ")) else "NA" end) + "t" + if(.type |contains("agent")) then (.TASK_RUNNING|tostring) + "t" + ((.resources.mem - .used_resources.mem)| tostring) + "tt" +  ((.resources.cpus - .used_resources.cpus)| tostring)  else "ttNAtNAttNA"  end' -r | sort -t"$TAB" -k1,1d -k3,3d -k2,2d
    printf '%.0s=' {1..90}
    printf "n"

    Framework Cleaner

    Uninstall framework and clean reserved resources if any after framework is deleted/uninstalled. (applicable if running DC/OS 1.9 or older, if higher than 1.10, then only uninstall cli is sufficient)

    SERVICE_NAME=
    dcos package uninstall $SERVICE_NAME
    dcos node ssh --option StrictHostKeyChecking=no --master-proxy
    --leader "docker run mesosphere/janitor /janitor.py -r
    ${SERVICE_NAME}-role -p ${SERVICE_NAME}-principal -z dcos-service-${SERVICE_NAME}"

    Get DC/OS apps and their placement constraints

    dcos marathon app list --json | jq '.[] |
    if (.constraints != null) then .id, .constraints else empty end'

    Run shell command on all slaves

    #!/bin/bash
    
    # Run any shell command on all slave nodes (private and public)
    
    # Usage : dcosRunOnAllSlaves.sh <CMD= any shell command to run, Ex: ulimit -a >
    CMD=$1
    for i in `dcos node | egrep -v "TYPE|master" | awk '{print $1}'`; do 
       echo -e "n###> Running command [ $CMD ] on $i"
       dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --private-ip=$i "$CMD"
       echo -e "======================================n"
    done

    Run shell command on master leader

    CMD=<shell command, Ex: ulimit -a >dcos node ssh --option StrictHostKeyChecking=no --option
    LogLevel=quiet --master-proxy --leader "$CMD"

    Run shell command on all master nodes

    #!/bin/bash
    
    # Run any shell command on all master nodes
    
    # Usage : dcosRunOnAllSlaves.sh <CMD= any shell command to run, Ex: ulimit -a >
    CMD=$1
    for i in `dcos node | egrep -v "TYPE|agent" | awk '{print $2}'` 
    do 
      echo -e "n###> Running command [ $CMD ] on $i"
      dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --private-ip=$i "$CMD"
     echo -e "======================================n"
    done

    Add node attributes to dcos nodes and run apps on nodes with required attributes using placement constraints

    #!/bin/bash
    
    #1. SSH on node 
    #2. Create or edit file /var/lib/dcos/mesos-slave-common
    #3. Add contents as :
    #    MESOS_ATTRIBUTES=<key>:<value>
    #    Example:
    #    MESOS_ATTRIBUTES=TYPE:DB;DB_TYPE:MONGO;
    #4. Stop dcos-mesos-slave service
    #    systemctl stop dcos-mesos-slave
    #5. Remove link for latest slave metadata
    #    rm -f /var/lib/mesos/slave/meta/slaves/latest
    #6. Start dcos-mesos-slave service
    #    systemctl start dcos-mesos-slave
    #7. Wait for some time, node will be in HEALTHY state again.
    #8. Add app placement constraint with field = key and value = value
    #9. Verify attributes, run on any node
    #    curl -s http://leader.mesos:5050/state | jq '.slaves[]| .hostname ,.attributes'
    #    OR Check DCOS cluster UI
    #    Nodes => Select any Node => Details Tab
    
    tmpScript=$(mktemp "/tmp/addDcosNodeAttributes-XXXXXXXX")
    
    # key:value paired attribues, separated by ;
    ATTRIBUTES=NODE_TYPE:GPU_NODE
    
    cat <<EOF > ${tmpScript}
    echo "MESOS_ATTRIBUTES=${ATTRIBUTES}" | sudo tee /var/lib/dcos/mesos-slave-common
    sudo systemctl stop dcos-mesos-slave
    sudo rm -f /var/lib/mesos/slave/meta/slaves/latest
    sudo systemctl start dcos-mesos-slave
    EOF
    
    # Add the private ip of nodes on which you want to add attrubutes, one ip per line.
    for i in `cat nodes.txt`; do 
        echo $i
        dcos node ssh --master-proxy --option StrictHostKeyChecking=no --private-ip $i <$tmpScript
        sleep 10
    done

    Install DC/OS Datadog metrics plugin on all DC/OS nodes

    #!/bin/bash
    
    # Usage : bash installDCOSDataDogMetricsPlugin.sh <Datadog API KEY>
    
    DDAPI=$1
    
    if [[ -z $DDAPI ]]; then
        echo "[Datadog Plugin] Need datadog API key as parameter."
        echo "[Datadog Plugin] Usage : bash installDCOSDataDogMetricsPlugin.sh <Datadog API KEY>."
    fi
    tmpScriptMaster=$(mktemp "/tmp/installDatadogPlugin-XXXXXXXX")
    tmpScriptAgent=$(mktemp "/tmp/installDatadogPlugin-XXXXXXXX")
    
    declare agent=$tmpScriptAgent
    declare master=$tmpScriptMaster
    
    for role in "agent" "master"
    do
    cat <<EOF > ${!role}
    curl -s -o /opt/mesosphere/bin/dcos-metrics-datadog -L https://downloads.mesosphere.io/dcos-metrics/plugins/datadog
    chmod +x /opt/mesosphere/bin/dcos-metrics-datadog
    echo "[Datadog Plugin] Downloaded dcos datadog metrics plugin."
    export DD_API_KEY=$DDAPI
    export AGENT_ROLE=$role
    sudo curl -s -o /etc/systemd/system/dcos-metrics-datadog.service https://downloads.mesosphere.io/dcos-metrics/plugins/datadog.service
    echo "[Datadog Plugin] Downloaded dcos-metrics-datadog.service."
    sudo sed -i "s/--dcos-role master/--dcos-role \$AGENT_ROLE/g;s/--datadog-key .*/--datadog-key \$DD_API_KEY/g" /etc/systemd/system/dcos-metrics-datadog.service
    echo "[Datadog Plugin] Updated dcos-metrics-datadog.service with DD API Key and agent role."
    sudo systemctl daemon-reload
    sudo systemctl start dcos-metrics-datadog.service
    echo "[Datadog Plugin] dcos-metrics-datadog.service is started !"
    servStatus=\$(sudo systemctl is-failed dcos-metrics-datadog.service)
    echo "[Datadog Plugin] dcos-metrics-datadog.service status : \${servStatus}"
    #sudo systemctl status dcos-metrics-datadog.service | head -3
    #sudo journalctl -u dcos-metrics-datadog
    EOF
    done
    
    echo "[Datadog Plugin] Temp script for master saved at : $tmpScriptMaster"
    echo "[Datadog Plugin] Temp script for agent saved at : $tmpScriptAgent"
    
    for i in `dcos node | egrep -v "TYPE|master" | awk '{print $1}'` 
    do 
        echo -e "\n###> Node - $i"
        dcos node ssh --option LogLevel=quiet --option StrictHostKeyChecking=no --master-proxy --private-ip=$i < $tmpScriptAgent
        echo -e "======================================================="
    done
    
    for i in `dcos node | egrep -v "TYPE|agent" | awk '{print $2}'` 
    do 
        echo -e "\n###> Master Node - $i"
        dcos node ssh --option LogLevel=quiet --option StrictHostKeyChecking=no --master-proxy --private-ip=$i < $tmpScriptMaster
        echo -e "======================================================="
    done
    
    # Check status of dcos-metrics-datadog.service on all nodes.
    #for i in `dcos node | egrep -v "TYPE|master" | awk '{print $1}'` ; do  echo -e "\n###> $i"; dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --master-proxy --private-ip=$i "sudo systemctl is-failed dcos-metrics-datadog.service"; echo -e "======================================\n"; done

    Get app / node metrics fetched by dcos-metrics component using metrics API

    • Get DC/OS node id [dcos node]
    • Get Node metrics (CPU, memory, local filesystems, networks, etc) :  <dc os_cluster_url=””>/system/v1/agent/<agent_id>/metrics/v0/node</agent_id></dc>
    • Get id of all containers running on that agent : <dc os_cluster_url=””>/system/v1/agent/<agent_id>/metrics/v0/containers</agent_id></dc>
    • Get Resource allocation and usage for the given container ID. : <dc os_cluster_url=””>/system/v1/agent/<agent_id>/metrics/v0/containers/<container_id></container_id></agent_id></dc>
    • Get Application-level metrics from the container (shipped in StatsD format using the listener available at STATSD_UDP_HOST and STATSD_UDP_PORT) : <dc os_cluster_url=””>/system/v1/agent/<agent_id>/metrics/v0/containers/<container_id>/app     </container_id></agent_id></dc>

    Get app / node metrics fetched by dcos-metrics component using dcos cli

    • Summary of container metrics for a specific task
    dcos task metrics summary <task-id>

    • All metrics in details for a specific task
    dcos task metrics details <task-id>

    • Summary of Node metrics for a specific node
    dcos task metrics summary <mesos-node-id>

    • All Node metrics in details for a specific node
    dcos node metrics details <mesos-node-id>

    NOTE – All above commands have ‘–json’ flag to use them programmatically.  

    Launch / run command inside container for a task

    DC/OS task exec cli only supports Mesos containers, this script supports both Mesos and Docker containers.

    #!/bin/bash
    
    echo "DCOS Task Exec 2.0"
    if [ "$#" -eq 0 ]; then
            echo "Need task name or id as input. Exiting."
            exit 1
    fi
    taskName=$1
    taskCmd=${2:-bash}
    TMP_TASKLIST_JSON=/tmp/dcostasklist.json
    dcos task --json > $TMP_TASKLIST_JSON
    taskExist=`cat /tmp/dcostasklist.json | jq --arg tname $taskName '.[] | if(.name == $tname ) then .name else empty end' -r | wc -l`
    if [[ $taskExist -eq 0 ]]; then 
            echo "No task with name $taskName exists."
            echo "Do you mean ?"
            dcos task | grep $taskName | awk '{print $1}'
            exit 1
    fi
    taskType=`cat $TMP_TASKLIST_JSON | jq --arg tname $taskName '[.[] | select(.name == $tname)][0] | .container.type' -r`
    TaskId=`cat $TMP_TASKLIST_JSON | jq --arg tname $taskName '[.[] | select(.name == $tname)][0] | .id' -r`
    if [[ $taskExist -ne 1 ]]; then
            echo -e "More than one instances. Please select task ID for executing command.n"
            #allTaskIds=$(dcos task $taskName | tee /dev/tty | grep -v "NAME" | awk '{print $5}' | paste -s -d",")
            echo ""
            read TaskId
    fi
    if [[ $taskType !=  "DOCKER" ]]; then
            echo "Task [ $taskName ] is of type MESOS Container."
            execCmd="dcos task exec --interactive --tty $TaskId $taskCmd"
            echo "Running [$execCmd]"
            $execCmd
    else
            echo "Task [ $taskName ] is of type DOCKER Container."
            taskNodeIP=`dcos task $TaskId | awk 'FNR == 2 {print $2}'`
            echo "Task [ $taskName ] with task Id [ $TaskId ] is running on node [ $taskNodeIP ]."
            taskContID=`dcos node ssh --option LogLevel=quiet --option StrictHostKeyChecking=no --private-ip=$taskNodeIP --master-proxy "docker ps -q --filter "label=MESOS_TASK_ID=$TaskId"" 2> /dev/null`
            taskContID=`echo $taskContID | tr -d 'r'`
            echo "Task Docker Container ID : [ $taskContID ]"
            echo "Running [ docker exec -it $taskContID $taskCmd ]"
            dcos node ssh --option StrictHostKeyChecking=no --option LogLevel=quiet --private-ip=$taskNodeIP --master-proxy "docker exec -it $taskContID $taskCmd" 2>/dev/null
    fi

    Get DC/OS tasks by node

    #!/bin/bash 
    
    function tasksByNodeAPI
    {
        echo "DC/OS Tasks By Node"
        if [ "$#" -eq 0 ]; then
            echo "Need node ip as input. Exiting."
            exit 1
        fi
        nodeIp=$1
        mesosId=`dcos node | grep $nodeIp | awk '{print $3}'`
        if [ -z "mesosId" ]; then
            echo "No node found with ip $nodeIp. Exiting."
            exit 1
        fi
        curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)" "$(dcos config show core.dcos_url)/mesos/tasks?limit=10000" | jq --arg mesosId $mesosId '.tasks[] | select (.slave_id == $mesosId and .state == "TASK_RUNNING") | .name + "ttt" + .id'  -r
    }
    
    function tasksByNodeCLI
    {
            echo "DC/OS Tasks By Node"
            if [ "$#" -eq 0 ]; then
                    echo "Need node ip as input. Exiting."
                    exit 1
            fi
            nodeIp=$1
            dcos task | egrep "HOST|$nodeIp"
    }

    Get cluster metadata – cluster Public IP and cluster ID

    curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"           
    $(dcos config show core.dcos_url)/metadata 

    Sample Output:

    {
    "PUBLIC_IPV4": "123.456.789.012",
    "CLUSTER_ID": "abcde-abcde-abcde-abcde-abcde-abcde"
    }

    Get DC/OS metadata – DC/OS version

    curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
    $(dcos config show core.dcos_url)/dcos-metadata/dcos-version.jsonq  

    Sample Output:

    {
    "version": "1.11.0",
    "dcos-image-commit": "b6d6ad4722600877fde2860122f870031d109da3",
    "bootstrap-id": "a0654657903fb68dff60f6e522a7f241c1bfbf0f"
    }

    Get Mesos version

    curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
    $(dcos config show core.dcos_url)/mesos/version

    Sample Output:

    {
    "build_date": "2018-02-27 21:31:27",
    "build_time": 1519767087.0,
    "build_user": "",
    "git_sha": "0ba40f86759307cefab1c8702724debe87007bb0",
    "version": "1.5.0"
    }

    Access DC/OS cluster exhibitor UI (Exhibitor supervises ZooKeeper and provides a management web interface)

    <CLUSTER_URL>/exhibitor

    Access DC/OS cluster data from cluster zookeeper using Zookeeper Python client – Run inside any node / container

    from kazoo.client import KazooClient
    
    zk = KazooClient(hosts='leader.mesos:2181', read_only=True)
    zk.start()
    
    clusterId = ""
    # Here we can give znode path to retrieve its decoded data,
    # for ex to get cluster-id, use
    # data, stat = zk.get("/cluster-id")
    # clusterId = data.decode("utf-8")
    
    # Get cluster Id
    if zk.exists("/cluster-id"):
        data, stat = zk.get("/cluster-id")
        clusterId = data.decode("utf-8")
    
    zk.stop()
    
    print (clusterId)

    Access dcos cluster data from cluster zookeeper using exhibitor rest API

    # Get znode data using endpoint :
    # /exhibitor/exhibitor/v1/explorer/node-data?key=/path/to/node
    # Example : Get znode data for path = /cluster-id
    curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
    $(dcos config show core.dcos_url)/exhibitor/exhibitor/v1/explorer/node-data?key=/cluster-id

    Sample Output:

    {
    "bytes": "3333-XXXXXX",
    "str": "abcde-abcde-abcde-abcde-abcde-",
    "stat": "XXXXXX"
    }

    Get cluster name using Mesos API

    curl -s -H "Authorization: Bearer $(dcos config show core.dcos_acs_token)"
    $(dcos config show core.dcos_url)/mesos/state-summary | jq .cluster -r

    Mark Mesos node as decommissioned

    Some times instances which are running as DC/OS node gets terminated and can not come back online, like AWS EC2 instances, once terminated due to any reason, can not start back. When Mesos detects that a node has stopped, it puts the node in the UNREACHABLE state because Mesos does not know if the node is temporarily stopped and will come back online, or if it is permanently stopped. In such case, we can explicitly tell Mesos to put a node in the GONE state if we know a node will not come back.

    dcos node decommission <mesos-agent-id>

    Conclusion

    We learned about Mesosphere DC/OS, its functionality and roles. We also learned how to setup and use DC/OS cli and use http authentication to access DC/OS APIs as well as using DC/OS cli for automating tasks.

    We went through different API endpoints like Mesos, Marathon, DC/OS metrics, exhibitor, DC/OS cluster organization etc. Finally, we looked at different tricks and scripts to automate DC/OS, like DC/OS node details, task exec, Docker report, DC/OS API http authentication etc.

  • Build ML Pipelines at Scale with Kubeflow

    Setting up a ML stack requires lots of tools, analyzing data, and training a model in the ML pipeline. But it is even harder to set up the same stack in multi-cloud environments. This is when Kubeflow comes into the picture and makes it easy to develop, deploy, and manage ML pipelines.

    In this article, we are going to learn how to install Kubeflow on Kubernetes (GKE), train a ML model on Kubernetes and publish the results. This introductory guide will be helpful for anyone who wants to understand how to use Kubernetes to run a ML pipeline in a simple, portable and scalable way.

    Kubeflow Installation on GKE

    You can install Kubeflow onto any Kubernetes cluster no matter which cloud it is, but the cluster needs to fulfill the following minimum requirements:

    • 4 CPU
    • 50 GB storage
    • 12 GB memory

    The recommended Kubernetes version is 1.14 and above.

    You need to download kfctl from the Kubeflow website and untar the file:
    tar -xvf kfctl_v1.0.2_<platform>.tar.gz -C /home/velotio/kubeflow</platform>

    Also, install kustomize using these instructions.

    Start by exporting the following environment variables:

    export PATH=$PATH:/home/velotio/kubeflow/
    export KF_NAME=kubeml
    export BASE_DIR=/home/velotio/kubeflow/
    export KF_DIR=${BASE_DIR}/${KF_NAME}
    export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_k8s_istio.v1.0.2.yaml"

    After we’ve exported these variables, we can build the kubebuilder and customize everything according to our needs. Run the following command:

    cd ${KF_DIR}
    kfctl build -V -f ${CONFIG_URI}

    This will download the file kfctl_k8s_istio.v1.0.2.yaml and a kustomize folder. If you want to expose the UI with LoadBalancer, change the file $KF_DIR/kustomize/istio-install/base/istio-noauth.yaml and edit the service istio-ingressgateway from NodePort to LoadBalancer.

    Now, you can install KubeFlow using the following commands:

    export CONFIG_FILE=${KF_DIR}/kfctl_k8s_istio.v1.0.2.yaml
    kfctl apply -V -f ${CONFIG_FILE}

    This will install a bunch of services that are required to run the ML workflows.

    Once successfully deployed, you can access the Kubeflow UI dashboard on the istio-ingressgateway service. You can find the IP using following command:

    kubectl get svc istio-ingressgateway -n istio-system -o jsonpath={.status.loadBalancer.ingress[0].ip}

    ML Workflow

    Developing your ML application consists of several stages:

    1. Gathering data and data analysis
    2. Researching the model for the type of data collected
    3. Training and testing the model
    4. Tuning the model
    5. Deploy the model

    These are multi-stage models for any ML problem you’re trying to solve, but where does Kubeflow fit in this model?

    Kubeflow provides its own pipelines to solve this problem. The Kubeflow pipeline consists of the ML workflow description, the different stages of the workflow, and how they combine in the form of graph. 

    Kubeflow provides an ability to run your ML pipeline on any hardware be it your laptop, cloud or multi-cloud environment. Wherever you can run Kubernetes, you can run your ML pipeline.

    Training your ML Model on Kubeflow

    Once you’ve deployed Kubeflow in the first step, you should be able to access the Kubeflow UI, which would look like:

    The first step is to upload your pipeline. However, to do that, you need to prepare your pipeline in the first place. We are going to use a financial series database and train our model. You can find the example code here:

    git clone https://github.com/kubeflow/examples.git
    cd examples/financial_time_series/
    export TRAIN_PATH=gcr.io/<project>/<image-name>/cpu:v1
    gcloud builds submit --tag $TRAIN_PATH .

    This command above will build the docker images, and we will create the bucket to store our data and model artifacts. 

    # create storage bucket that will be used to store models
    BUCKET_NAME=<your-bucket-name>
    gsutil mb gs://$BUCKET_NAME/

    Once we have our image ready on the GCR repo, we can start our training job on Kubernetes. Please have a look at the tfjob resource in CPU/tfjob1.yaml and update the image and bucket reference.

    kubectl apply -f CPU/tfjob1.yaml
    POD_NAME=$(kubectl get pods -n kubeflow --selector=tf-job-name=tfjob-flat 
          --template '{{range .items}}{{.metadata.name}}{{"n"}}{{end}}')
    kubectl logs -f $POD_NAME -n kubeflow

    Kubeflow Pipelines needs our pipeline file into a domain-specific-language. We can compile our python3 file with a tool called dsl-compile that comes with the Python3 SDK, which compile our pipeline into DSL.  So, first, install that SDK:

    pip3 install python-dateutil kfp==0.1.36

    Next, inspect the ml_pipline.py and update the ml_pipeline.py with the CPU image path that you built in the previous steps. Then, compile the DSL, using:

    python3 ml_pipeline.py

    Now, a file ml_pipeline.py.tar_gz is generated, which we can upload to the Kubeflow pipelines UI.

    Once the pipeline is uploaded, you can see the stages in a graph-like format.

    Next, we can click on the pipeline and create a run. For each run, you need to specify the params that you want to use. When the pipeline is running, you can inspect the logs:

    Run Jupyter Notebook in your ML Pipeline

    You can also interactively define your pipeline from the Jupyter notebook:

    1. Navigate to the Notebook Servers through the Kubeflow UI

    2. Select the namespace and click on “new server.”

    3. Give the server a name and provide the docker image for the TensorFlow on which you want to train your model. I took the TensorFlow 1.15 image.

    4. Once a notebook server is available, click on “connect” to connect to the server.

    5. This will open up a new window and a Jupyter terminal.

    6. Input the following command: pip install -U kfp.

    7. Download the notebook using following command: 

    curl -O https://raw.githubusercontent.com/kubeflow/examples/master/github_issue_summarization/pipelines/example_pipelines/pipelines-notebook.ipynb

    8. Now that you have notebook, you can replace the environment variables like WORKING_DIR, PROJECT_NAME and GITHUB_TOKEN. Once you do that, you can run the notebook step-by-step (one cell at a time) by pressing shift+enter, or you can run the whole notebook by clicking on menu and run all options.

    Conclusion

    The ML world has its own challenges; the environments are tightly coupled and the tools you needed to deploy to build an ML stack was extremely hard to set up and configure. This becomes harder in production environments because you have to be extremely cautious you are not breaking the components that are already present.

    Kubeflow makes getting started on ML highly accessible. You can run your ML workflows anywhere you can run Kubernetes. Kubeflow made it possible to run your ML stack on multi cloud environments, which enables ML engineers to easily train their models at scale with the scalability of Kubernetes.

    Related Articles

    1. The Ultimate Beginner’s Guide to Jupyter Notebooks
    2. Demystifying High Availability in Kubernetes Using Kubeadm
  • Monitoring a Docker Container with Elasticsearch, Kibana, and Metricbeat

    Since you are on this page, you have probably already started using Docker to deploy your applications and are enjoying it compared to virtual machines, because of it being lightweight, easy to deploy and its exceptional security management features.

    And, once the applications are deployed, monitoring your containers and tracking their activities in real time is very essential. Imagine a scenario where you are managing one or many virtual machines. Your pre-configured session will be doing everything, including monitoring. If you face any problems during production, then—with a handful of commands such as top, htop, iotop, and with flags like -o, %CPU, and %MEM—you are good to troubleshoot the issue.

    On the other hand, consider a scenario where you have the same nodes spread across 100-200 containers. You will need to see all activity in one place to query for information about what happened. Here, monitoring comes into the picture. We will be discussing more benefits as we move further.

    This blog will cover Docker monitoring with Elasticsearch, Kibana, and Metricbeat. Basically, Elasticsearch is a platform that allows us to have distributed search and analysis of data in real-time along with visualization. We’ll be discussing how all these work interdependently as we move ahead. Like Elasticsearch, Kibana is also open-source software. Kibana is an interface mainly used to visualize the data sent from Elasticsearch. Metricbeat is a lightweight shipper of collected metrics from your system to the desired target (Elasticsearch in this case). 

    What is Docker Monitoring?

    In simple terms, monitoring containers is how we keep track of the above metrics and analyze them to ensure the performance of applications built on microservices and to keep track of issues so that they can be solved more easily. This monitoring is vital for performance improvement and optimization and to find the RCA of various issues.

    There is a lot of software available for monitoring the Docker container, both open-source as well as proprietary, like Prometheus, AppOptics, Metricbeats, Datadog, Sumologic, etc.

    You can choose any of these based on convenience. 

    Why is Docker Monitoring needed?

    1. Monitoring helps early detection and to fix issues to avoid a breakdown during production
    2. New feature additions/updates implemented safely as the entire application is monitored
    3. Docker monitoring is beneficial for developers, IT pros, and enterprises as well.
    • For developers, Docker monitoring tracks bugs and helps to resolve them quickly along with enhancing security.
    • For IT pros, it helps with flexible integration of existing processes and enterprise systems and satisfies all the requirements.
    • For enterprises, it helps to build the application within a certified container within a secured ecosystem that runs smoothly. 

    Elasticsearch is a platform that allows us to have distributed search and analysis of data in real-time, along with visualization. Elasticsearch is free and open-source software. It goes well with a huge number of technologies, like Metricbeat, Kibana, etc. Let’s move onto the installation of Elasticsearch.

    Installation of Elasticsearch:

    Prerequisite: Elasticsearch is built in Java. So, make sure that your system at least has Java8 to run Elasticsearch.

    For installing Elasticsearch for your OS, please follow the steps at Installing Elasticsearch | Elasticsearch Reference [7.11].

    After installing,  check the status of Elasticsearch by sending an HTTP request on port 9200 on localhost.

    http://localhost:9200/

    This will give you a response as below:

    You can configure Elasticsearch by editing $ES_HOME/config/elasticsearch.yml 

    Learn more about configuring Elasticsearch here.

    Now, we are done with the Elasticsearch setup and are ready to move onto Kibana.

    Kibana:

    Like Elasticsearch, Kibana is also open-source software. Kibana is an interface mainly used to visualize the data from Elasticsearch. Kibana allows you to do anything via query and let’s you generate numerous visuals as per your requirements. Kibana lets you visualize enormous amounts of data in terms of line graphs, gauges, and all other graphs.

    Let’s cover the installation steps of Kibana.

    Installing Kibana

    Prerequisites: 

    • Must have Java1.8+ installed 
    • Elasticsearch v1.4.4+
    • Web browser such as Chrome, Firefox

    For installing Kibana with respect to your OS, please follow the steps at Install Kibana | Kibana Guide [7.11]

    Kibana runs on default port number 5601. Just send an HTTP request to port 5601 on localhost with http://localhost:5601/ 

    You should land on the Kibana dashboard, and it is now ready to use:

    You can configure Kibana by editing $KIBANA_HOME/config. For more about configuring Kibana, visit here.

    Let’s move onto the final part—setting up with Metricbeat.

    Metricbeat

    Metricbeat sends metrics frequently, and we can say it’s a lightweight shipper of collected metrics from your system.

    You can simply install Metricbeat to your system or servers to periodically collect metrics from the OS and the microservices running on services. The collected metrics are shipped to the output you specified, e.g., Elasticsearch, Logstash. 

    Installing Metricbeat

    For installing Metricbeat according to your OS, follow the steps at Install Kibana | Kibana Guide [7.11]

    As soon as we start the Metricbeat service, it sends Docker metrics to the Elasticsearch index, which can be confirmed by curling Elasticsearch indexes with the command:

    curl -XGET 'localhost:9200/_cat/indices?v&pretty'

    How Are They Internally Connected?

    We have now installed all three and they are up and running. As per the period mentioned, docker.yml will hit the Docker API and send the Docker metrics to Elasticsearch. Those metrics are now available in different indexes of Elasticsearch. As mentioned earlier, Kibana queries the data of Elasticsearch and visualizes it in the form of graphs. In this, all three are connected. 

    Please refer to the flow chart for more clarification:

    How to Create Dashboards?

    Now that we are aware of how these three tools work interdependently, let’s create dashboards to monitor our containers and understand those. 

    First of all, open the Dashboards section on Kibana (localhost:5601/) and click the Create dashboard button:

     

    You will be directed to the next page:

    Choose the type of visualization you want from all options:

    For example, let’s go with Lens

    (Learn more about Kibana Lens)

    Here, we will be looking for the number of containers vs. timestamps by selecting the timestamp on X-axis and the unique count of docker.container.created on Y-axis.

    As soon we have selected both parameters, it will generate a graph as shown in the snapshot, and we will be getting the count of created containers (here Count=1). If you create move containers on your system, when that data metric is sent to Kibana, the graph and the counter will be modified. In this way, you can monitor how many containers are created over time. In similar fashion, depending on your monitoring needs, you can choose a parameter from the left panel showing available fields like: 

    activemq.broker.connections.count

    docker.container.status

    Docker.container.tags

    Now, we will show one more example of how to create a bar graph:

    As mentioned above, to create a bar graph just choose “vertical bar” from the above snapshot. Here, I’m trying to get a bar graph for the count of documents vs. metricset names, such as network, file system, cpu, etc. So, as shown in the snapshot on the left, choose the Y-axis parameter as count and X-axis parameter as metricset.name as shown in the right side of the snapshot

    After hitting enter, a graph will be generated: 

    Similarly, you can try it out with multiple parameters with different types of graphs to monitor. Now, we will move onto the most important and widely used monitoring tool to track warnings, errors, etc., which is DISCOVER.

    Discover for Monitoring:

    Basically, Discover provides deep insights into data, showing you where you can apply searches and filters as well. With it, you can show which processes are taking more time and show only those. Filter out errors occurring with the message filter with a value of ERROR. Check the health of the container; check for logged-in users. These kinds of queries can be sent and the desired results can be achieved, leading to good monitoring of containers, same as the SQL queries. 

    [More about Discover here.]

    To apply filters, just click on the “filter by type” from the left panel, and you will see all available filtering options. From there, you can select one as per your requirements, and view those on the central panel. 

    Similar to filter, you can choose fields to be shown on the dashboard from the left panel with “Selected fields” right below the filters. (Here, we have only selected info for Source.)

    Now, if you take a look at the top part of the snapshot, you will find the search bar. This is the most useful part of Discover for monitoring.

    In that bar, you just need to put a query, and according to that query, logs will be filtered. For example, I will be putting a query for error messages equal to No memory stats data available.

    When we hit the update button on the right side, only logs containing that error message will be there and highlighted for differentiation, as shown in the snapshot. All other logs will not be shown. In this way, you can track a particular error and ensure that it does not exist after fixing it.

    In addition to query, it also provides keyword search. So, if you input a word like warning, error, memory, or user, then it will provide logs for that word, like “memory” in the snapshot:

     

    Similar to Kibana, we also receive logs in the terminal. For example, the following highlighted portion is about the state of your cluster. In the terminal, you can put a simple grep command for required logs. 

    With this, you can monitor Docker containers with multiple queries, such as nested queries for the Discover facility. There are many different graphs you can try depending on your requirements to keep your application running smoothly.

    Conclusion

    Monitoring requires a lot of time and effort. What we have seen here is a drop in the ocean. For some next steps, try:

    1. Monitoring network
    2. Aggregating logs from your different applications
    3. Aggregating logs from multiple containers
    4. Alerts setting and monitoring
    5. Nested queries for logs
  • ClickHouse – The Newest Data Store in Your Big Data Arsenal

    ClickHouse

    ClickHouse is an open-source column-oriented data warehouse for online analytical processing of queries (OLAP). It is fast, scalable, flexible, cost-efficient, and easy to run. It supports the best in the industry query performance while significantly reducing storage requirements through innovative use of columnar storage and compression.

    ClickHouse’s performance exceeds comparable column-oriented database management systems that are available on the market. ClickHouse is a database management system, not a single database. ClickHouse allows creating tables and databases at runtime, loading data, and running queries without reconfiguring and restarting the server.

    ClickHouse processes from hundreds of millions to over a billion rows of data across hundreds of node clusters. It utilizes all available hardware for processing queries to their fastest. The peak processing performance for a single query stands at more than two terabytes per second.

    What makes ClickHouse unique?

    • Data Storage & Compression: ClickHouse is designed to work on regular hard drives but uses SSD and additional RAM if available. Data compression in ClickHouse plays a crucial role in achieving excellent performance. It provides general-purpose compression codecs and some specialized codecs for specific kinds of data. These codecs have different CPU consumption and disk space and help ClickHouse outperform other databases.
    • High Performance: By using vector computation, engine data is processed by vectors which are parts of columns, and achieve high CPU efficiency. It supports parallel processing across multiple cores, turning large queries into parallelized naturally. ClickHouse also supports distributed query processing; data resides across shards which are used for parallel execution of the query.
    • Primary & Secondary Index: Data is sorted physically by the primary key allowing low latency extraction of specific values or ranges. The secondary index in ClickHouse enable the database to know that the query filtering conditions would skip some of the parts entirely. Therefore, these are also called data skipping indexes.
    • Support for Approximated Calculations: ClickHouse trades accuracy for performance by approximated calculations. It provides aggregate functions for an approximated estimate of several distinct values, medians, and quantiles. It retrieves proportionally fewer data from the disk to run queries based on the part of data to get approximated results.
    • Data Replication and Data Integrity Support: All the remaining duplicates retrieve their copies in the background after being written to any available replica. The system keeps identical data on several clones. Most failures are recovered automatically or semi-automatically in complex scenarios.

    But it can’t be all good, can it? there are some disadvantages to ClickHouse as well:

    • No full-fledged transactions.
    • Inability to efficiently and precisely change or remove previously input data. For example, to comply with GDPR, data could well be cleaned up or modified using batch deletes and updates.
    • ClickHouse is less efficient for point queries that retrieve individual rows by their keys due to the sparse index.

    ClickHouse against its contemporaries

    So with all these distinctive features, how does ClickHouse compare with other industry-leading data storage tools. Now, ClickHouse being general-purpose, has a variety of use cases, and it has its pros and cons, so here’s a high-level comparison against the best tools in their domain. Depending on the use case, each tool has its unique traits, and comparison around them would not be fair, but what we care about the most is performance, scalability, cost, and other key attributes that can be compared irrespective of the domain. So here we go:

    ClickHouse vs Snowflake:

    • With its decoupled storage & compute approach, Snowflake is able to segregate workloads and enhance performance. The search optimization service in Snowflake further enhances the performance for point lookups but has additional costs attached with it. ClickHouse, on the other hand, with local runtime and inherent support for multiple forms of indexing, drastically improves query performance.
    • Regarding scalability, ClickHouse being on-prem makes it slightly challenging to scale compared to Snowflake, which is cloud-based. Managing hardware manually by provisioning clusters and migrating is doable but tedious. But one possible solution to tackle is to deploy CH on the cloud, a very good option that is cheaper and, frankly, the most viable. 

    ClickHouse vs Redshift:

    • Redshift is a managed, scalable cloud data warehouse. It offers both provisioned and serverless options. Its RA3 nodes compute scalably and cache the necessary data. Still, even with that, its performance does not separate different workloads that are on the same data putting it on the lower end of the decoupled compute & storage cloud architectures. ClickHouse’s local runtime is one of the fastest. 
    • Both Redshift and ClickHouse are columnar, sort data, allowing read-only specific data. But deploying CH is cheaper, and although RS is tailored to be a ready-to-use tool, CH is better if you’re not entirely dependent on Redshift’s features like configuration, backup & monitoring.

    ClickHouse vs InfluxDB:

    • InfluxDB, written in Go, this open-source no-SQL is one of the most popular choices when it comes to dealing with time-series data and analysis. Despite being a general-purpose analytical DB, ClickHouse provides competitive write performance. 
    • ClickHouse’s data structures like AggregatingMergeTree allow real-time data to be stored in a pre-aggregated format which puts it on par in performance regarding TSDBs. It is significantly faster in heavy queries and comparable in the case of light queries.

    ClickHouse vs PostgreSQL:

    • Postgres is another DB that is very versatile and thus is widely used by the world for various use cases, just like ClickHouse. Postgres, however, is an OLTP DB, so unlike ClickHouse, analytics is not its primary aim, but it’s still used for analytics purposes to a certain extent.
    • In terms of transactional data, ClickHouse’s columnar nature puts it below Postgres, but when it comes to analytical capabilities, even after tuning Postgres to its max potential, for, e.g., by using materialized views, indexing, cache size, buffers, etc. ClickHouse is ahead.  

    ClickHouse vs Apache Druid:

    • Apache Druid is an open-source data store that is primarily used for OLAP. Both Druid & ClickHouse are very similar in terms of their approaches and use cases but differ in terms of their architecture. Druid is mainly used for real-time analytics with heavy ingestions and high uptime.
    • Unlike Druid, ClickHouse has a much simpler deployment. CH can be deployed on only one server, while Druid setup needs multiple types of nodes (master, broker, ingestion, etc.). ClickHouse, with its support for SQL-like nature, provides better flexibility. It is more performant when the deployment is small.

    To summarize the differences between ClickHouse and other data warehouses:

    ClickHouse Engines

    Depending on the type of your table (internal or external) ClickHouse provides an array of engines that help us connect to different data storages and also determine the way data is stored, accessed, and other interactions on it.

    These engines are mainly categorized into two types:

    Database Engines:

    These allow us to work with different databases & tables.
    ClickHouse uses the Atomic database engine to provide configurable table engines and dialects. The popular ones are PostgreSQL, MySQL, and so on.

    Table Engines:

    These determine 

    • how and where data is stored
    • where to read/write it from/to
    • which queries it supports
    • use of indexes
    • concurrent data access and so on.

    These engines are further classified into families based on the above parameters:

    MergeTree Engines:

    This is the most universal and functional table for high-load tasks. The engines of this family support quick data insertion with subsequent background data processing. These engines also support data replication, partitioning, secondary data-skipping indexes and some other features. Following are some of the popular engines in this family:

    • MergeTree
    • SummingMergeTree
    • AggregatingMergeTree

    MergeTree engines with indexing and partitioning support allow data to be processed at a tremendous speed. These can also be leveraged to form materialized views that store aggregated data further improving the performance.

    Log Engines:

    These are lightweight engines with minimum functionality. These work the best when the requirement is to quickly write into many small tables and read them later as a whole. This family consists of:

    • Log
    • StripeLog
    • TinyLog

    These engines append data to the disk in a sequential fashion and support concurrent reading. They do not support indexing, updating, or deleting and hence are only useful when the data is small, sequential, and immutable.

    Integration Engines:

    These are used for communicating with other data storage and processing systems. This support:

    • JDBC
    • MongoDB
    • HDFS
    • S3
    • Kafka and so on.

    Using these engines we can import and export data from external sources. With engines like Kafka we can ingest data directly from a topic to a table in ClickHouse and with the S3 engine, we work directly with S3 objects.

    Special Engines:

    ClickHouse offers some special engines that are specific to the use case. For example:

    • MaterializedView
    • Distributed
    • Merge
    • File and so on.

    These special engines have their own quirks for eg. with File we can export data to a file, update data in the table by updating the file, etc.

    Summary

    We learned that ClickHouse is a very powerful and versatile tool. One that has stellar performance is feature-packed, very cost-efficient, and open-source. We saw a high-level comparison of ClickHouse with some of the best choices in an array of use cases. Although it ultimately comes down to how specific and intense your use case is, ClickHouse and its generic nature measure up pretty well on multiple occasions.

    ClickHouse’s applicability in web analytics, network management, log analysis, time series analysis, asset valuation in financial markets, and security threat identification makes it tremendously versatile. With consistently solving business problems in a low latency response for petabytes of data, ClickHouse is indeed one of the faster data warehouses out there.

    Further Readings

  • Olli Salumeria Saved 80% on Disaster Recovery using AWS services and achieved better reliability, availability, and security

    R Systems helped Olli set up and configure Disaster Recovery using CloudEndure (AWS DR service) to ensure real-time, asynchronous, block-level replication from on-premises to AWS.