Blog

Hear from Sema’s Founder & CEO, Matt Van Itallie on the Golden Rules of Code Reviews
We had a wonderful time talking to Matt about Sema, a much-awaited online code review and developer portfolio tool, and how Velotio helped in building it. He also talks about the basics of writing more meaningful code reviews and his team’s experience of working with Velotio.

Spradha: Talk us through your entrepreneurial journey, and what made you build Sema?

Matt: I learned to code from my parents, who were both programmers, and then worked on the organizational side of technology. I decided to start Sema because of the challenges I saw with lower code quality and companies not managing the engineers’ careers the right way. And so, I wanted to build a company to help solve that.

At Sema, we are a 50-person team with engineers from all over the globe. Being a global team is one of my favorite things at Sema because I get to learn so much from people with so many different backgrounds. We have been around since 2017.

We are building products to help improve code quality and to help engineers improve their careers and their knowledge. We think that code is a craft. It’s not a competition. And so, to build tools for engineers to improve the code and skills must begin with treating code as a craft.

Spradha: What advice do you have for developers when it comes to code reviews?

Matt: Code reviews are a great way to improve the quality of code and help developers improve their skills. Anyone can benefit from doing code reviews, as well as, of course, receiving code reviews. Even junior developers reviewing the code of seniors can provide meaningful feedback. They can sometimes teach the seniors while having meaningful learning moments as a reviewer themselves.

And so, code reviews are incredibly important. There are six tips I would share.
- Treat code reviews as part of coding and prioritize it
It might be obvious, but developers and the engineering team, and the entire company should treat code reviews as part of coding, not as something to do when devs have time. The reason is that an incomplete code review is a blocker for another engineer. So, the faster we can get high-quality code reviews, the faster other engineers can progress.
- Always remember – it’s a human being on the other end of the review
Code reviews should be clear, concise, and communicated the right way. It’s also important to deliver the message with empathy. You can always consider reading your code review out loud and asking yourself, “Is this something I would want to be said to me?” If not, change the tone or content.
- Give clear recommendations and suggestions
Never tell someone that the code needs to be fixed without giving suggestions or recommendations on what to fix or how to fix it.
- Always assume good intent
Code may not be written how you would write it. Let’s say that more clearly: code is rarely written the same way by two different people. After all, code is a craft, not a task on an assembly line. Tap into a sense of curiosity and appreciation while reviewing – curiosity to understand what the reviewer had in mind and gratitude for what the coder did or tried to do.
- Clarify the action and the level of importance
If you are making an optional suggestion, for example, a “nit” that isn’t necessary before the code is approved for production, say so clearly.
- Don’t forget that code feedback – and all feedback – includes praise.
It goes without saying that a benefit of doing code reviews is to make the code better and fix issues. But that’s only half of it. On the flip side, code reviews present an excellent opportunity to appreciate your colleagues’ work. If someone has written particularly elegant or maintainable code or has made a great decision about using a library, let them know!

It’s always the right time to share positive feedback.

Spradha: How has Velotio helped you build the code review tool at Sema?

Matt: We’ve been working with Velotio for over 18 months. We have several amazing colleagues from Velotio, including software developers, DevOps engineers, and product managers.

Our Velotio colleagues have been instrumental in building our new product, a free code review assistant and developer portfolio tool. It includes a Chrome extension that makes code reviews more clear, more robust, and reusable. The information is available in dashboards for further future exploration too.

Developers can now create portfolios of their work that goes far way beyond a traditional developer portfolio. It is based on what other reviewers have said about your code and what you have said as a reviewer on other people’s code. It allows you to really tell a clear story about what you have worked on and how you have grown.

Spradha: How has your experience been working with Velotio?

Matt: We have had an extraordinary experience working with the Velotio team at Sema. We consider our colleagues from Velotio as core team members and leaders of our organization. I have been so impressed with the quality, the knowledge, the energy, and the commitment that Velotio colleagues have shown. And we would not have been able to achieve as much as we have without their contribution.

I can think of three crucial moments in particular when talking about the impact our Velotio colleagues have made. First, one of their engineers played a major role in designing and building the Chrome extension that we use.

Secondly, a DevOps engineer from Velotio has radically improved the setup, reliability, and ease of use of our DevOps systems.

And third, a product manager from Velotio has been an extraordinary project leader for a critical feature of the Sema tool, a library of over 20,000 best practices that coders can insert into code reviews, which saves time and helps make the code reviews more robust.

We literally would not be where we are if it was not for the great work of our colleagues from Velotio.

Spradha: How can people learn more about Sema?

Matt: You can visit our website at www.semasoftware.com. And for those interested in using our tool to help with code reviews, whether it’s a commercial project or an open-source project, you can sign up for free.
December 12, 2022
The Ultimate Beginner’s Guide to Jupyter Notebooks
Jupyter Notebooks offer a great way to write and iterate on your Python code. It is a powerful tool for developing data science projects in an interactive way. Jupyter Notebook allows to showcase the source code and its corresponding output at a single place helping combine narrative text, visualizations and other rich media.The intuitive workflow promotes iterative and rapid development, making notebooks the first choice for data scientists. Creating Jupyter Notebooks is completely free as it falls under Project Jupyter which is completely open source.

Project Jupyter is the successor to an earlier project IPython Notebook, which was first published as a prototype in 2010. Jupyter Notebook is built on top of iPython, an interactive tool for executing Python code in the terminal using REPL model(Read-Eval-Print-Loop). The iPython kernel executes the python code and communicates with the Jupyter Notebook front-end interface. Jupyter Notebooks also provide additional features like storing your code and output and keep the markdown by extending iPython.

Although Jupyter Notebooks support using various programming languages, we will focus on Python and its application in this article.

Getting Started with Jupyter Notebooks!

Installation

Prerequisites

As you would have surmised from the above abstract we need to have Python installed on your machine. Either Python 2.7 or Python 3.+ will do.

Install Using Anaconda

The simplest way to get started with Jupyter Notebooks is by installing it using Anaconda. Anaconda installs both Python3 and Jupyter and also includes quite a lot of packages commonly used in the data science and machine learning community. You can follow the latest guidelines from here.

Install Using Pip

If, for some reason, you decide not to use Anaconda, then you can install Jupyter manually using Python pip package, just follow the below code:
```
pip install jupyter
```
Launching First Notebook

Open your terminal, navigate to the directory where you would like to store you notebook and launch the Jupyter Notebooks. Then type the below command and the program will instantiate a local server at http://localhost:8888/tree.
```
jupyter notebook
```
A new window with the Jupyter Notebook interface will open in your internet browser. As you might have already noticed Jupyter starts up a local Python server to serve web apps in your browser, where you can access the Dashboard and work with the Jupyter Notebooks. The Jupyter Notebooks are platform independent which makes it easier to collaborate and share with others.

The list of all files is displayed under the Files tab whereas all the running processes can be viewed by clicking on the Running tab and the third tab, Clusters is extended from IPython parallel, IPython’s parallel computing framework. It helps you to control multiple engines, extended from the IPython kernel.

Let’s start by making a new notebook. We can easily do this by clicking on the New drop-down list in the top- right corner of the dashboard. You see that you have the option to make a Python 3 notebook as well as regular text file, a folder, and a terminal. Please select the Python 3 notebook option.

Your Jupyter Notebook will open in a new tab as shown in below image.

Now each notebook is opened in a new tab so that you can simultaneously work with multiple notebooks. If you go back to the dashboard tab, you will see the new file Untitled.ipynb and you should see some green icon to it’s left which indicates your new notebook is running.

Why a .ipynb file?

.ipynb is the standard file format for storing Jupyter Notebooks, hence the file name Untitled.ipynb. Let’s begin by first understanding what an .ipynb file is and what it might contain. Each .ipynb file is a text file that describes the content of your notebook in a JSON format. The content of each cell, whether it is text, code or image attachments that have been converted into strings, along with some additional metadata is stored in the .ipynb file. You can also edit the metadata by selecting “Edit > Edit Notebook Metadata” from the menu options in the notebook.

You can also view the content of your notebook files by selecting “Edit” from the controls on the dashboard, there’s no reason to do so unless you really want to edit the file manually.

Understanding the Notebook Interface

Now that you have created a notebook, let’s have a look at the various menu options and functions, which are readily available. Take some time out to scroll through the the list of commands that opens up when you click on the keyboard icon (or press Ctrl + Shift + P).

There are two prominent terminologies that you should care to learn about: cells and kernels are key both to understanding Jupyter and to what makes it more than just a content writing tool. Fortunately, these concepts are not difficult to understand.
- A kernel is a program that interprets and executes the user’s code. The Jupyter Notebook App has an inbuilt kernel for Python code, but there are also kernels available for other programming languages.
- A cell is a container which holds the executable code or normal text
Cells

Cells form the body of a notebook. If you look at the screenshot above for a new notebook (Untitled.ipynb), the text box with the green border is an empty cell. There are 4 types of cells:
- Code – This is where you type your code and when executed the kernel will display its output below the cell.
- Markdown – This is where you type your text formatted using Markdown and the output is displayed in place when it is run.
- Raw NBConvert – It’s a command line tool to convert your notebook into another format (like HTML, PDF etc.)
- Heading – This is where you add Headings to separate sections and make your notebook look tidy and neat. This has now been merged into the Markdown option itself. Adding a ‘#’ at the beginning ensures that whatever you type after that will be taken as a heading.
Let’s test out how the cells work with a basic “hello world” example. Type print(‘Hello World!’) in the cell and press Ctrl + Enter or click on the Run button in the toolbar at the top.
```
print("Hello World!")
```
Hello World!

When you run the cell, the output will be displayed below, and the label to its left changes from In[ ] to In[1] . Moreover, to signify that the cell is still running, Jupyter changes the label to In[*]

Additionally, it is important to note that the output of a code cell comes from any of the print statements in the code cell, as well as the value of the last line in the cell, irrespective of it being a variable, function call or some other code snippet.

Markdown

Markdown is a lightweight, markup language for formatting plain text. Its syntax has a one-to-one correspondence with HTML tags. As this article has been written in a Jupyter notebook, all of the narrative text and images you can see, are written in Markdown. Let’s go through the basics with the following example.
# This is a level 1 heading ### This is a level 3 heading This is how you write some plain text that would form a paragraph. You can emphasize the text by enclosing the text like "**" or "__" to make it bold and enclosing the text in "*" or "_" to make it italic. Paragraphs are separated by an empty line. * We can include lists. * And also indent them. 1. Getting Numbered lists is 2. Also easy. [To include hyperlinks enclose the text with square braces and then add the link url in round braces](http://www.example.com) Inline code uses single backticks: `foo()`, and code blocks use triple backticks: ``` foo() ``` Or can be indented by 4 spaces: foo() And finally, adding images is easy: ![Online Image](https://www.example.com/image.jpg) or ![Local Image](img/image.jpg) or ![Image Attachment](attachment:image.jpg)
```
# This is a level 1 heading 
### This is a level 3 heading
This is how you write some plain text that would form a paragraph.
You can emphasize the text by enclosing the text like "**" or "__" to make it bold and enclosing the text in "*" or "_" to make it italic. 
Paragraphs are separated by an empty line.
* We can include lists.
  * And also indent them.

1. Getting Numbered lists is
2. Also easy.

[To include hyperlinks enclose the text with square braces and then add the link url in round braces](http://www.example.com)

Inline code uses single backticks: `foo()`, and code blocks use triple backticks:

``` 
foo()
```

Or can be indented by 4 spaces: 

    foo()
    
And finally, adding images is easy: ![Online Image](https://www.example.com/image.jpg) or ![Local Image](img/image.jpg) or ![Image Attachment](attachment:image.jpg)
```
We have 3 different ways to attach images
- Link the URL of an image from the web.
- Use relative path of an image present locally
- Add an attachment to the notebook by using “Edit>Insert Image” option; This method converts the image into a string and store it inside your notebook
Note that adding an image as an attachment will make the .ipynb file much larger because it is stored inside the notebook in a string format.

There are a lot more features available in Markdown. To learn more about markdown, you can refer to the official guide from the creator, John Gruber, on his website.

Kernels

Every notebook runs on top of a kernel. Whenever you execute a code cell, the content of the cell is executed within the kernel and any output is returned back to the cell for display. The kernel’s state applies to the document as a whole and not individual cells and is persisted over time.

For example, if you declare a variable or import some libraries in a cell, they will be accessible in other cells. Now let’s understand this with the help of an example. First we’ll import a Python package and then define a function.
```
import os, binascii
def sum(x,y):
  return x+y
```
Once the cell above is executed, we can reference os, binascii and sum in any other cell.
```
rand_hex_string = binascii.b2a_hex(os.urandom(15)) 
print(rand_hex_string)
x = 1
y = 2
z = sum(x,y)
print('Sum of %d and %d is %d' % (x, y, z))
```
The output should look something like this:
```
c84766ca4a3ce52c3602bbf02a
d1f7 Sum of 1 and 2 is 3
```
The execution flow of a notebook is generally from top-to-bottom, but it’s common to go back to make changes. The order of execution is shown to the left of each cell, such as In [2] , will let you know whether any of your cells have stale output. Additionally, there are multiple options in the Kernel menu which often come very handy.
- Restart: restarts the kernel, thus clearing all the variables etc that were defined.
- Restart & Clear Output: same as above but will also wipe the output displayed below your code cells.
- Restart & Run All: same as above but will also run all your cells in order from top-to-bottom.
- Interrupt: If your kernel is ever stuck on a computation and you wish to stop it, you can choose the Interrupt option.
Naming Your Notebooks

It is always a best practice to give a meaningful name to your notebooks. You can rename your notebooks from the notebook app itself by double-clicking on the existing name at the top left corner. You can also use the dashboard or the file browser to rename the notebook file. We’ll head back to the dashboard to rename the file we created earlier, which will have the default notebook file name Untitled.ipynb.

Now that you are back on the dashboard, you can simply select your notebook and click “Rename” in the dashboard controls

Shutting Down your Notebooks

We can shutdown a running notebook by selecting “File > Close and Halt” from the notebook menu. However, we can also shutdown the kernel either by selecting the notebook in the dashboard and clicking “Shutdown” or by going to “Kernel > Shutdown” from within the notebook app (see images below).

Shutdown the kernel from Notebook App:

Shutdown the kernel from Dashboard:

Sharing Your Notebooks

When we talk about sharing a notebook, there are two things that might come to our mind. In most cases, we would want to share the end-result of the work, i.e. sharing non-interactive, pre-rendered version of the notebook, very much similar to this article; however, in some cases we might want to share the code and collaborate with others on notebooks with the aid of version control systems such as Git which is also possible.

Before You Start Sharing

The state of the shared notebook including the output of any code cells is maintained when exported to a file. Hence, to ensure that the notebook is share-ready, we should follow below steps before sharing.
1. Click “Cell > All Output > Clear”
2. Click “Kernel > Restart & Run All”
3. After the code cells have finished executing, validate the output.
This ensures that your notebooks don’t have a stale state or contain intermediary output.

Exporting Your Notebooks

Jupyter has built-in support for exporting to HTML, Markdown and PDF as well as several other formats, which you can find from the menu under “File > Download as” . It is a very convenient way to share the results with others. But if sharing exported files isn’t suitable for you, there are some other popular methods of sharing the notebooks directly on the web.
- GitHub
- With home to over 2 million notebooks, GitHub is the most popular place for sharing Jupyter projects with the world. GitHub has integrated support for rendering .ipynb files directly both in repositories and gists on its website.
- You can just follow the GitHub guides for you to get started on your own.
- Nbviewer
- NBViewer is one of the most prominent notebook renderers on the web.
- It also renders your notebook from GitHub and other such code storage platforms and provide a shareable URL along with it. nbviewer.jupyter.org provides a free rendering service as part of Project Jupyter.
Data Analysis in a Jupyter Notebook

Now that we’ve looked at what a Jupyter Notebook is, it’s time to look at how they’re used in practice, which should give you a clearer understanding of why they are so popular. As we walk through the sample analysis, you will be able to see how the flow of a notebook makes the task intuitive to work through ourselves, as well as for others to understand when we share it with them. We also hope to learn some of the more advanced features of Jupyter notebooks along the way. So let’s get started, shall we?

Analyzing the Revenue and Profit Trends of Fortune 500 US companies from 1955-2013

So, let’s say you’ve been tasked with finding out how the revenues and profits of the largest companies in the US changed historically over the past 60 years. We shall begin by gathering the data to analyze.

Gathering the DataSet

The data set that we will be using to analyze the revenue and profit trends of fortune 500 companies has been sourced from Fortune 500 Archives and Top Foreign Stocks. For your ease we have compiled the data from both the sources and created a CSV for you.

Importing the Required Dependencies

Let’s start off with a code cell specifically for imports and initial setup, so that if we need to add or change anything at a later point in time, we can simply edit and re-run the cell without having to change the other cells. We can start by importing Pandas to work with our data, Matplotlib to plot the charts and Seaborn to make our charts prettier.
```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
```
Set the design styles for the charts
```
sns.set(style="darkgrid")
```
Load the Input Data to be Analyzed

As we plan on using pandas to aid in our analysis, let’s begin by importing our input data set into the most widely used pandas data-structure, DataFrame.
```
df = pd.read_csv('../data/fortune500_1955_2013.csv')
```
Now that we are done loading our input dataset, let us see how it looks like!
```
df.head()
```
Looking good. Each row corresponds to a single company per year and all the columns we need are present.

Exploring the Dataset

Next, let’s begin by exploring our data set. We will primarily look into the number of records imported and the data types for each of the different columns that were imported.

As we have 500 data points per year and since the data set has records between 1955 and 2012, the total number of records in the dataset looks good!

Now, let’s move on to the individual data types for each of the column.
```
df.columns = ['year', 'rank', 'company', 'revenue', 'profit']
len(df)
```
```
df.dtypes
```
As we can see from the output of the above command the data types for the columns revenue and profit are being shown as object whereas the expected data type should be float. It indicates that there may be some non-numeric values in the revenue and profit columns.

So let’s first look at the details of imported values for revenue.
```
non_numeric_revenues = df.revenue.str.contains('[^0-9.-]')
df.loc[non_numeric_revenues].head()
```
```
print("Number of Non-numeric revenue values: ", len(df.loc[non_numeric_revenues]))
```
```
Number of Non-numeric revenue values:	1
```
```
print("List of distinct Non-numeric revenue values: ", set(df.revenue[non_numeric_revenues]))
```
```
List of distinct Non-numeric revenue values:	{'N.A.'}
```
As the number of non-numeric revenue values is considerably less compared to the total size of our data set. Hence, it would be easier to just remove those rows.
```
df = df.loc[~non_numeric_revenues]
df.revenue = df.revenue.apply(pd.to_numeric)
eval(In[6])
```
Now that the data type issue for column revenue is resolved, let’s move on to values in column profit.
```
non_numeric_profits = df.profit.str.contains('[^0-9.-]')
df.loc[non_numeric_profits].head()
```
```
print("Number of Non-numeric profit values: ", len(df.loc[non_numeric_profits]))
```
```
Number of Non-numeric profit values:	374
```
```
print("List of distinct Non-numeric profit values: ", set(df.profit[non_numeric_profits]))
```
```
List of distinct Non-numeric profit values:	{'N.A.'}
```
As the number of non-numeric profit values is around 1.5% which is a small percentage of our data set, but not completely inconsequential. Let’s take a quick look at the distribution of values and if the rows having N.A. values are uniformly distributed over the years then it would be wise to just remove the rows with missing values.
```
bin_sizes, _, _ = plt.hist(df.year[non_numeric_profits], bins=range(1955, 2013))
```
As observed from the histogram above, majority of invalid values in single year is fewer than 25, removing these values would account for less than 4% of the data as there are 500 data points per year. Also, other than a surge around 1990, most years have fewer than less than 10 values missing. Let’s assume that this is acceptable for us and move ahead with removing these rows.
```
df = df.loc[~non_numeric_profits]
df.profit = df.profit.apply(pd.to_numeric)
```
We should validate if that worked!
```
eval(In[6])
```
Hurray! Our dataset has been cleaned up.

Time to Plot the graphs

Let’s begin with defining a function to plot the graph, set the title and add lables for the x-axis and y-axis.
# function to plot the graphs for average revenues or profits of the fortune 500 companies against year def plot(x, y, ax, title, y_label): ax.set_title(title) ax.set_ylabel(y_label) ax.plot(x, y) ax.margins(x=0, y=0) # function to plot the graphs with superimposed standard deviation def plot_with_std(x, y, stds, ax, title, y_label): ax.fill_between(x, y - stds, y + stds, alpha=0.2) plot(x, y, ax, title, y_label)
```
# function to plot the graphs for average revenues or profits of the fortune 500 companies against year
def plot(x, y, ax, title, y_label):
    ax.set_title(title)
    ax.set_ylabel(y_label)
    ax.plot(x, y)
    ax.margins(x=0, y=0)
    
# function to plot the graphs with superimposed standard deviation    
def plot_with_std(x, y, stds, ax, title, y_label):
    ax.fill_between(x, y - stds, y + stds, alpha=0.2)
    plot(x, y, ax, title, y_label)
```
Let’s plot the average profit by year and average revenue by year using Matplotlib.
```
group_by_year = df.loc[:, ['year', 'revenue', 'profit']].groupby('year')
avgs = group_by_year.mean()
x = avgs.index
y = avgs.profit

fig, ax = plt.subplots()
plot(x, y, ax, 'Increase in mean Fortune 500 company profits from 1955 to 2013', 'Profit (millions)')
```
```
y2 = avgs.revenue
fig, ax = plt.subplots()
plot(x, y2, ax, 'Increase in mean Fortune 500 company revenues from 1955 to 2013', 'Revenue (millions)')
```
Woah! The charts for profits has got some huge ups and downs. It seems like they correspond to the early 1990s recession, the dot-com bubble in the early 2000s and the Great Recession in 2008.

On the other hand, the Revenues are constantly growing and are comparatively stable. Also it does help to understand how the average profits recovered so quickly after the staggering drops because of the recession.

Let’s also take a look at how the average profits and revenues compare to their standard deviations.
fig, (ax1, ax2) = plt.subplots(ncols=2) title = 'Increase in mean and std Fortune 500 company %s from 1955 to 2013' stds1 = group_by_year.std().profit.values stds2 = group_by_year.std().revenue.values plot_with_std(x, y.values, stds1, ax1, title % 'profits', 'Profit (millions)') plot_with_std(x, y2.values, stds2, ax2, title % 'revenues', 'Revenue (millions)') fig.set_size_inches(14, 4) fig.tight_layout()
```
fig, (ax1, ax2) = plt.subplots(ncols=2)
title = 'Increase in mean and std Fortune 500 company %s from 1955 to 2013'
stds1 = group_by_year.std().profit.values
stds2 = group_by_year.std().revenue.values
plot_with_std(x, y.values, stds1, ax1, title % 'profits', 'Profit (millions)')
plot_with_std(x, y2.values, stds2, ax2, title % 'revenues', 'Revenue (millions)')
fig.set_size_inches(14, 4)
fig.tight_layout()
```
That’s astonishing, the standard deviations are huge. Some companies are making billions while some others are losing as much, and the risk certainly has increased along with rising profits and revenues over the years. Although we could keep on playing around with our data set and plot plenty more charts to analyze, it is time to bring this article to a close.

Conclusion

As part of this article we have seen various features of the Jupyter notebooks, from basics like installation, creating, and running code cells to more advanced features like plotting graphs. The power of Jupyter Notebooks to promote a productive working experience and provide an ease of use is evident from the above example, and I do hope that you feel confident to begin using Jupyter Notebooks in your own work and start exploring more advanced features. You can read more about data analytics using Pandas here.

If you’d like to further explore and want to look at more examples, Jupyter has put together A Gallery of Interesting Jupyter Notebooks that you may find helpful and the Nbviewer homepage provides a lot of examples for further references. Find the entire code here on Github.
December 12, 2022

Hacking Your Way Around AWS IAM Roles

Identity and Access Management (IAM) offers role-based access control (RBAC) to your AWS account users and resources, and you can granularize the permission set by defining the policy. If you are familiar or even a beginner with AWS cloud, you know how important IAM is.

“AWS Identity and Access Management (IAM) is a web service that helps you securely control access to AWS resources. You use IAM to control who is authenticated (signed in) and authorized (has permissions) to use resources.”

– AWS IAM User Guide

‍

With the emergence of cloud infrastructure services, the coolest thing you can do is write your infrastructure as code. AWS offers SDKs for various programming/scripting languages, and of course, like any other API call, you need to sign a request with tokens. The AWS IAM console lets you generate access_key and secret_access_key tokens. This token can then be configured with your SDK.

Alternatively, you can configure the token with your user profile via aws cli. This also means anyone with access_key and secret_access_key will have permissions configured as per the IAM policy. Thus, keeping credentials on the disk is insecure. You can implement a key rotation policy to keep the environment compliant. To even overcome this, you can use the AWS IAM role for services.

Let’s say if you are working on an AWS EC2 instance that needs access to some other AWS service, like S3. You can create an IAM role for EC2 with a policy that has appropriate permission to access the S3 bucket. In this case, your SDK doesn’t need a token (not at least on the disk or hardcoded in code). Let’s take a look at the hierarchy of how the AWS SDK looks for a token for signing requests.

1. Embedded in your code (very insecure). This is the very first place your SDK looks for. Below is a NodeJS example, where access_key and secret_access_key are part of the code itself.

const {S3} = require("aws-sdk");
const s3 = new S3({
   accessKeyId : "ABCDEFGHIJKLMNOPQRST",
   secretAccessKey : "7is/HVjA8lm9hRrJyZEPWAs5Bo8KyyvEqjjxIHoO"
  //sessionToken : "options_session_token_if_applicable"
});

const {S3} = require("aws-sdk");
const s3 = new S3({
   accessKeyId : "ABCDEFGHIJKLMNOPQRST",
   secretAccessKey : "7is/HVjA8lm9hRrJyZEPWAs5Bo8KyyvEqjjxIHoO"
  //sessionToken : "options_session_token_if_applicable"
});

2. AWS environment variables. If the token is not embedded in your code, your SDK looks for AWS environment variables available to process. These environment variables are AWS_ACCESS_KEY, AWS_SECRET_ACCESS_KEY, and optional AWS_SESSION_TOKEN. Below is an example where AWS credentials are exported and the aws cli command is used to list S3 buckets. Note that once credentials are exported, they are available to all the child processes. Therefore, these credentials are auto looked up by your AWS SDK.

3. The AWS credentials (default profile) file located at ~/.aws/credentials. This is the third place for the lookup. You can generate this file by running the command aws configure. You may also manually create this file with various profiles. If you happen to have multiple profiles, you can then export an environment variable called AWS_PROFILE. An example credentials file is given below:

[default] ; default profile
aws_access_key_id = <DEFAULT_ACCESS_KEY_ID>
aws_secret_access_key = <DEFAULT_SECRET_ACCESS_KEY>
  
[personal-account] ; personal account profile
aws_access_key_id = <PERSONAL_ACCESS_KEY_ID>
aws_secret_access_key = <PERSONAL_SECRET_ACCESS_KEY>
  
[work-account] ; work account profile
aws_access_key_id = <WORK_ACCESS_KEY_ID>
aws_secret_access_key = <WORK_SECRET_ACCESS_KEY>

[default] ; default profile
aws_access_key_id = <DEFAULT_ACCESS_KEY_ID>
aws_secret_access_key = <DEFAULT_SECRET_ACCESS_KEY>
  
[personal-account] ; personal account profile
aws_access_key_id = <PERSONAL_ACCESS_KEY_ID>
aws_secret_access_key = <PERSONAL_SECRET_ACCESS_KEY>
  
[work-account] ; work account profile
aws_access_key_id = <WORK_ACCESS_KEY_ID>
aws_secret_access_key = <WORK_SECRET_ACCESS_KEY>

4. The IAM role attached to your resource. Your resource could be EC2 Instance, Lambda function, AWS glue, ECS Container, RDS, etc. Now, this is a secure way of using credentials. Since your credentials are not stored anywhere on the disk, exported via an environment variable, or hardcoded in the code. You need not worry about key rotation at all.

TL;DR: IAM roles are a secure way of using credentials. However, they are only applicable to resources within AWS. You can not use them outside of AWS. So, the IAM role can only be attached to resources like EC2, Lambda, ECS, etc.

The problem statement:

Let’s say a group of developers needs access to a few S3 buckets and DynamoDB. The organization does not want developers to use access_key and secret_access_key on their local machine (laptop) as access_key and secret_access can be used anywhere or can be stolen.

Since IAM roles are more secure, they allocate EC2 with Windows OS and attach the IAM role with appropriate permission to access S3 buckets and DynamoDB and configure IDE and other essential dev tools. Developers then use RDP to connect to EC2 Instance. However, due to license restrictions, only two users can connect with RDP at a given time. So, they add more similar instances. This heavily increases cost. Wouldn’t it be nice, if somehow, IAM roles could be attached to local machines?

How do IAM roles for resources work?

Resources like EC2 or Lambda have the link-local address available. The link-local address 169.254.169.254 can be accessed over HTTP port 80 to retrieve instance metadata. For instance, to get the instance-id of an EC2 instance from the host itself, you can query with a GET request to curl -L http://169.254.169.254/latest/meta-data/instance-id/. Similarly, you can retrieve IAM credentials if the IAM role is attached to the EC2 instance. Let’s assume you have created an IAM role for an EC2 instance with the name “iam-role-for-ec2”. Your SDK will then automatically access credentials via a GET request to curl -L http://169.254.169.254/latest/meta-data/iam/security-credentials/iam-role-for-ec2/

$ curl -L 169.254.169.254/latest/meta-data/iam/security-credentials/iam-role-for-ec2/
{
 "Code" : "Success",
 "LastUpdated" : "2021-08-03T09:18:49Z",
 "Type" : "AWS-HMAC",
 "AccessKeyId" : "ASIASP26DFHDIOFNJFFX",
 "SecretAccessKey" : "EK1A7x9dntSzF9LlG7BK08C6zpTS/F6MHYTBo/+U",
 "Token" : "IQoJb3JpZ2luX2VjEPr//////////wEaCXVzLXdlc3QtMiJIMEYCIQCOCqHrHjEkYZUFsRtGXwa8gfGjsBmaU+WrL2Z0ihvA3QIhAIsGhJFiPetOod7IUUC++unWZfoUEgjEU0ULYwZUvGwwKvoDCBIQAhoMMTcxNDU5MTYwNTE4IgxFUXJfE/0cdJs2Gigq1wM8Ww8yAS2i2qUqsQ1t+yd4ATkE5fvIMDtHxzPQ2raVQb+cCgC/eJVQpeNET1SP01HnrN5W1QFID+xOPk3vZt6NrCy48OUf6+cCGrd63Jv/7glAsyQGaGM/Jt5ddi6593dgN7VLFHsEBAwqkZ3j/VjAzYbthP3clmRl++6k+vpiUp2j4uwM4zW/6f8faR6awPbPVmJsyh94pXaQXJU+H0w+9Hp0MlUvP6GRqBiuTwv/+EOiRfth1XGRxxOuR5X+fr0Ve4tede2x0ZvSLeUsUENHlOQnUkSGbu1Hiv1BhDEjhzbHi7PXhW1G9N1FZObE+wdF4hGYbe3LUUIrnp2xnIcxKzmume2YQvFE4DvJvBtF22DsdLP4GPmitofhV2FGcVxP1f5Nv76M6SfOQY65vSZQde4LIwcotRIrMgwEWup2Rplq6s56K93IYXp6QmnUWLgdtcMBTMVQsOFhCdj05P+VYqlKe5xRT4/8BucmIHn7+J4indNoL+3BvYvnpiISdcEhlyswNZOPhVQJjwJfKPPdu9NDEKQ+Jep4wpVvOSh+CAtxKtqwGz1wrKzqlRvzqBFaEQrD4WdPdf9YnTvmKIXgPuk74pZRlarVsREL0KmG6G0zzA2lRYow6JOkiAY6pAHIZGH+UH5RL79drKe86tUnWCORcX9omN2uUK7FemTENwyvholib4jLGY6HcjvDF10jqkcu1KEV20xNsPj87BP7irEH7xH//Jz2+rnSaN5PCqLezSsATPYhHFQjg6Oti+0E33F+F5MA25Pn2+u5TDP1VfFgYExwSor79gNtwbOMs76432ssHYFioYjHttPfVwyNXloLCwgphqJBwiNhMDMcKapK6Q==",
 "Expiration" : "2021-08-03T15:47:26Z"
}

$ curl -L 169.254.169.254/latest/meta-data/iam/security-credentials/iam-role-for-ec2/
{
 "Code" : "Success",
 "LastUpdated" : "2021-08-03T09:18:49Z",
 "Type" : "AWS-HMAC",
 "AccessKeyId" : "ASIASP26DFHDIOFNJFFX",
 "SecretAccessKey" : "EK1A7x9dntSzF9LlG7BK08C6zpTS/F6MHYTBo/+U",
 "Token" : "IQoJb3JpZ2luX2VjEPr//////////wEaCXVzLXdlc3QtMiJIMEYCIQCOCqHrHjEkYZUFsRtGXwa8gfGjsBmaU+WrL2Z0ihvA3QIhAIsGhJFiPetOod7IUUC++unWZfoUEgjEU0ULYwZUvGwwKvoDCBIQAhoMMTcxNDU5MTYwNTE4IgxFUXJfE/0cdJs2Gigq1wM8Ww8yAS2i2qUqsQ1t+yd4ATkE5fvIMDtHxzPQ2raVQb+cCgC/eJVQpeNET1SP01HnrN5W1QFID+xOPk3vZt6NrCy48OUf6+cCGrd63Jv/7glAsyQGaGM/Jt5ddi6593dgN7VLFHsEBAwqkZ3j/VjAzYbthP3clmRl++6k+vpiUp2j4uwM4zW/6f8faR6awPbPVmJsyh94pXaQXJU+H0w+9Hp0MlUvP6GRqBiuTwv/+EOiRfth1XGRxxOuR5X+fr0Ve4tede2x0ZvSLeUsUENHlOQnUkSGbu1Hiv1BhDEjhzbHi7PXhW1G9N1FZObE+wdF4hGYbe3LUUIrnp2xnIcxKzmume2YQvFE4DvJvBtF22DsdLP4GPmitofhV2FGcVxP1f5Nv76M6SfOQY65vSZQde4LIwcotRIrMgwEWup2Rplq6s56K93IYXp6QmnUWLgdtcMBTMVQsOFhCdj05P+VYqlKe5xRT4/8BucmIHn7+J4indNoL+3BvYvnpiISdcEhlyswNZOPhVQJjwJfKPPdu9NDEKQ+Jep4wpVvOSh+CAtxKtqwGz1wrKzqlRvzqBFaEQrD4WdPdf9YnTvmKIXgPuk74pZRlarVsREL0KmG6G0zzA2lRYow6JOkiAY6pAHIZGH+UH5RL79drKe86tUnWCORcX9omN2uUK7FemTENwyvholib4jLGY6HcjvDF10jqkcu1KEV20xNsPj87BP7irEH7xH//Jz2+rnSaN5PCqLezSsATPYhHFQjg6Oti+0E33F+F5MA25Pn2+u5TDP1VfFgYExwSor79gNtwbOMs76432ssHYFioYjHttPfVwyNXloLCwgphqJBwiNhMDMcKapK6Q==",
 "Expiration" : "2021-08-03T15:47:26Z"
}

Notice that the response payload is JSON with AccessKeyId, SecretAccessKey, and Token. Additionally, there is an Expiration key, which states the validity of the token. This means the token is autogenerated once they expire.

Solution:

Now that you know how IAM roles work and how important link-local address is, you have probably guessed what needs to be done so that you can access IAM role credentials from your local machine. The two solutions that popup in my mind are:

1. Host a lightweight reverse proxy server like Nginx and then write a wrapper around your SDK so that initial calls are made to EC2 and credentials are retrieved.

2. Route traffic originating from your system, targeting 169.254.169.254. Traffic should reach the EC2 instance and EC2 itself should take care of forwarding packets to the instance metadata server.

The second solution may sound pretty techy, but it is the ideal solution, and you don’t need to do additional tweaking in your SDK. The developer is transparent about what is being implemented. This blog will focus on implementing a second solution.

Implementation:

1. Launch a Linux (Ubuntu 20.04 LTS prefered) EC2 instance from AWS console and attach the IAM role with appropriate permissions. The instance should be in the public subnet and make sure to attach an Elastic IP address. Whitelist incoming port 1194 UDP (open to world) and port 22 (ssh, open to your IP address only) TCP in your instance security group.

2. Install OpenVPN and git package. apt update; apt install git openvpn.

3. Clone easy-rsa repository on your server. cd ~;git clone https://github.com/OpenVPN/easy-rsa.git

4. Generate certificates for OpenVPN server and client using easy-rsa.

#switch to easy-rsa directory
cd ~/easy-rsa/easyrsa3
#copy vars.example to vars
cp vars.example vars
#Find below variables in "vars" file and edit them according to your needs
set_var EASYRSA_REQ_COUNTRY    "US"
set_var EASYRSA_REQ_PROVINCE   "California"
set_var EASYRSA_REQ_CITY       "San Francisco"
set_var EASYRSA_REQ_ORG        "Copyleft Certificate Co"
set_var EASYRSA_REQ_EMAIL      "me@example.net"
set_var EASYRSA_REQ_OU         "My Organizational Unit"
#Also edit below two variables if you plan to run easyrsa in non-interactive mode
# EASYRSA_REQ_CN should be set to your ElasticIP Address.
# Note: If your are using openvpn behind a load balancer, or if you plan to map DNS to your server, then this should be set to your DNS name
set_var EASYRSA_REQ_CN         "Your Instance Elastic IP"
set_var EASYRSA_BATCH          "NONEMPTY"
#====================================================
#Generate certificate and keys for server and client
./easyrsa init-pki
./easyrsa build-ca nopass
./easyrsa gen-dh
./easyrsa build-server-full server nopass
./easyrsa build-client-full client nopass
#Copy certificates and keys to server configuration
cp -p ./pki/ca.crt /etc/openvpn/
cp -p ./pki/issued/server.crt /etc/openvpn/
cp -p ./pki/private/server.key /etc/openvpn/
cp -p ./pki/dh.pem /etc/openvpn/dh2049.pem
cd /etc/openvpn
openvpn --genkey --secret myvpn.tlsauth
echo "net.ipv4.ip_forward = 1" >>/etc/sysctl.conf
sysctl -p

#switch to easy-rsa directory
cd ~/easy-rsa/easyrsa3
#copy vars.example to vars
cp vars.example vars
#Find below variables in "vars" file and edit them according to your needs
set_var EASYRSA_REQ_COUNTRY    "US"
set_var EASYRSA_REQ_PROVINCE   "California"
set_var EASYRSA_REQ_CITY       "San Francisco"
set_var EASYRSA_REQ_ORG        "Copyleft Certificate Co"
set_var EASYRSA_REQ_EMAIL      "me@example.net"
set_var EASYRSA_REQ_OU         "My Organizational Unit"
#Also edit below two variables if you plan to run easyrsa in non-interactive mode
# EASYRSA_REQ_CN should be set to your ElasticIP Address.
# Note: If your are using openvpn behind a load balancer, or if you plan to map DNS to your server, then this should be set to your DNS name
set_var EASYRSA_REQ_CN         "Your Instance Elastic IP"
set_var EASYRSA_BATCH          "NONEMPTY"
#====================================================
#Generate certificate and keys for server and client
./easyrsa init-pki
./easyrsa build-ca nopass
./easyrsa gen-dh
./easyrsa build-server-full server nopass
./easyrsa build-client-full client nopass
#Copy certificates and keys to server configuration
cp -p ./pki/ca.crt /etc/openvpn/
cp -p ./pki/issued/server.crt /etc/openvpn/
cp -p ./pki/private/server.key /etc/openvpn/
cp -p ./pki/dh.pem /etc/openvpn/dh2049.pem
cd /etc/openvpn
openvpn --genkey --secret myvpn.tlsauth
echo "net.ipv4.ip_forward = 1" >>/etc/sysctl.conf
sysctl -p

5. Configure OpenVPN server.conf file:

port 1194
proto udp
dev tun
ca ca.crt
cert server.crt
key server.key # This file should be kept secret
dh dh2048.pem
topology subnet
server 10.8.0.0 255.255.255.0
ifconfig-pool-persist ipp.txt
push "redirect-gateway def1 bypass-dhcp"
push "dhcp-option DNS 8.8.8.8"
push "dhcp-option DNS 1.1.1.1"
push "route 169.254.169.254 255.255.255.255"
keepalive 10 120
tls-auth myvpn.tlsauth 0
cipher AES-256-CBC
comp-lzo
user nobody
group nogroup
persist-key
persist-tun
status openvpn-status.log
log-append  /var/log/openvpn.log
verb 4
explicit-exit-notify 1
remote-cert-eku "TLS Web Client Authentication"

port 1194
proto udp
dev tun
ca ca.crt
cert server.crt
key server.key # This file should be kept secret
dh dh2048.pem
topology subnet
server 10.8.0.0 255.255.255.0
ifconfig-pool-persist ipp.txt
push "redirect-gateway def1 bypass-dhcp"
push "dhcp-option DNS 8.8.8.8"
push "dhcp-option DNS 1.1.1.1"
push "route 169.254.169.254 255.255.255.255"
keepalive 10 120
tls-auth myvpn.tlsauth 0
cipher AES-256-CBC
comp-lzo
user nobody
group nogroup
persist-key
persist-tun
status openvpn-status.log
log-append  /var/log/openvpn.log
verb 4
explicit-exit-notify 1
remote-cert-eku "TLS Web Client Authentication"

In the above configuration file, make sure line number 9 is not conflicting with your AWS VPC CIDR. Line number 14 (push “route 169.254.169.254 255.255.255.255”) does a trick for us and is the heart of this blog post. This assures that when a client connects via OpenVPN, a route is added to the client machine so that packets targeting 168.254.169.254 are routed via OpenVPN tunnel. (Note: If you do not add this here, you can manually add a route to your client-side once OpenVPN is connected. ip route add 169.254.169.254/32 YOUR_TUNNEL_IP dev tun0)

6. Generate an OpenVPN client configuration file:

#These commands are executed on your EC2 (OopenvVpn)
cd ~/easy-rsa/easyrsa3
cat <<EOF >/tmp/client.ovpn
client
dev tun
proto udp
remote YOUR-ELASTIC-IP-ADDRESS 1194
resolv-retry infinite
nobind
persist-key
persist-tun
cipher AES-256-CBC
comp-lzo
verb 3
key-direction 1
EOF
#append ca certificate
echo '<ca>' >>/tmp/client.ovpn
cat ./pki/ca.crt >>/tmp/client.ovpn
echo '</ca>' >>/tmp/client.ovpn
#append client certificate
echo '<cert>' >>/tmp/client.ovpn
sed -n '/BEGIN CERTIFICATE/,/END CERTIFICATE/{p;/END CERTIFICATE/q}' ./pki/issued/client.crt >>/tmp/client.ovpn
echo '</cert>' >>/tmp/client.ovpn
#append client key
echo '<key>' >>/tmp/client.ovpn
cat ./pki/private/client.key >>/tmp/client.ovpn
echo '</key>' >>/tmp/client.ovpn
#append TLS auth key
echo '<tls-auth>' >>/tmp/client.ovpn
cat /etc/openvpn/myvpn.tlsauth >>/tmp/client.ovpn
echo '</tls-auth>' >>/tmp/client.ovpn

#These commands are executed on your EC2 (OopenvVpn)
cd ~/easy-rsa/easyrsa3
cat <<EOF >/tmp/client.ovpn
client
dev tun
proto udp
remote YOUR-ELASTIC-IP-ADDRESS 1194
resolv-retry infinite
nobind
persist-key
persist-tun
cipher AES-256-CBC
comp-lzo
verb 3
key-direction 1
EOF
#append ca certificate
echo '<ca>' >>/tmp/client.ovpn
cat ./pki/ca.crt >>/tmp/client.ovpn
echo '</ca>' >>/tmp/client.ovpn
#append client certificate
echo '<cert>' >>/tmp/client.ovpn
sed -n '/BEGIN CERTIFICATE/,/END CERTIFICATE/{p;/END CERTIFICATE/q}' ./pki/issued/client.crt >>/tmp/client.ovpn
echo '</cert>' >>/tmp/client.ovpn
#append client key
echo '<key>' >>/tmp/client.ovpn
cat ./pki/private/client.key >>/tmp/client.ovpn
echo '</key>' >>/tmp/client.ovpn
#append TLS auth key
echo '<tls-auth>' >>/tmp/client.ovpn
cat /etc/openvpn/myvpn.tlsauth >>/tmp/client.ovpn
echo '</tls-auth>' >>/tmp/client.ovpn

In the above configuration file, make sure to update line number 9. This could be your EC2 elastic IP address (or domain if mapped and configured).

7. Finally, download the /tmp/client.ovpn file to your local machine. Install the OpenVPN client software, import the client.ovpn file, and connect. If you are using a Linux machine, you may connect using sudo openvpn –config /path/to/client.ovpn.

Testing:

Let us say you have configured the IAM role with permission that lets you list S3 buckets. You should be able to access AWS resources once the OpenVPN client is connected. Your SDK should automatically look for credentials via metadata link-local address. You may install the aws-cli utility and run aws s3 ls to list S3 buckets.

Conclusion:

IAM roles are meant to be used with AWS resources like EC2, ECS, Lambda, etc. so that you don’t keep the credentials hardcoded in the code or in the configuration file left unsecured on the disk. Our goal was to use the IAM role directly from the local machine (laptop). We achieved this by using OpenVPN secure SSL tunnel. The VPN assures that we are in a private network, thus keeping the environment compliant. This guide is not meant for how one should set up an OpenVPN server/client. Therefore, you must harden the OpenVPN server. You may put the server behind the network load balancer and may enforce MAC binding features to your clients.

December 12, 2022

Implementing gRPC In Python: A Step-by-step Guide

In the last few years, we saw a great shift in technology, where projects are moving towards “microservice architecture” vs the old “monolithic architecture”. This approach has done wonders for us.

As we say, “smaller things are much easier to handle”, so here we have microservices that can be handled conveniently. We need to interact among different microservices. I handled it using the HTTP API call, which seems great and it worked for me.

But is this the perfect way to do things?

The answer is a resounding, “no,” because we compromised both speed and efficiency here.

Then came in the picture, the gRPC framework, that has been a game-changer.

What is gRPC?

Quoting the official documentation–

“gRPC or Google Remote Procedure Call is a modern open-source high-performance RPC framework that can run in any environment. It can efficiently connect services in and across data centers with pluggable support for load balancing, tracing, health checking and authentication.”

Credit: gRPC

RPC or remote procedure calls are the messages that the server sends to the remote system to get the task(or subroutines) done.

Google’s RPC is designed to facilitate smooth and efficient communication between the services. It can be utilized in different ways, such as:

Efficiently connecting polyglot services in microservices style architecture
Connecting mobile devices, browser clients to backend services
Generating efficient client libraries

Why gRPC?

– HTTP/2 based transport – It uses HTTP/2 protocol instead of HTTP 1.1. HTTP/2 protocol provides multiple benefits over the latter. One major benefit is multiple bidirectional streams that can be created and sent over TCP connections parallelly, making it swift.

– Auth, tracing, load balancing and health checking – gRPC provides all these features, making it a secure and reliable option to choose.

– Language independent communication– Two services may be written in different languages, say Python and Golang. gRPC ensures smooth communication between them.

– Use of Protocol Buffers – gRPC uses protocol buffers for defining the type of data (also called Interface Definition Language (IDL)) to be sent between the gRPC client and the gRPC server. It also uses it as the message interchange format.

Let’s dig a little more into what are Protocol Buffers.

Protocol Buffers

Protocol Buffers like XML, are an efficient and automated mechanism for serializing structured data. They provide a way to define the structure of data to be transmitted. Google says that protocol buffers are better than XML, as they are:

simpler
three to ten times smaller
20 to 100 times faster
less ambiguous
generates data access classes that make it easier to use them programmatically

Protobuf are defined in .proto files. It is easy to define them.

Types of gRPC implementation

1. Unary RPCs:- This is a simple gRPC which works like a normal function call. It sends a single request declared in the .proto file to the server and gets back a single response from the server.

rpc HelloServer(RequestMessage) returns (ResponseMessage);

rpc HelloServer(RequestMessage) returns (ResponseMessage);

2. Server streaming RPCs:- The client sends a message declared in the .proto file to the server and gets back a stream of message sequence to read. The client reads from that stream of messages until there are no messages.

rpc HelloServer(RequestMessage) returns (stream ResponseMessage);

rpc HelloServer(RequestMessage) returns (stream ResponseMessage);

3. Client streaming RPCs:- The client writes a message sequence using a write stream and sends the same to the server. After all the messages are sent to the server, the client waits for the server to read all the messages and return a response.

rpc HelloServer(stream RequestMessage) returns (ResponseMessage);

rpc HelloServer(stream RequestMessage) returns (ResponseMessage);

4. Bidirectional streaming RPCs:- Both gRPC client and the gRPC server use a read-write stream to send a message sequence. Both operate independently, so gRPC clients and gRPC servers can write and read in any order they like, i.e. the server can read a message then write a message alternatively, wait to receive all messages then write its responses, or perform reads and writes in any other combination.

rpc HelloServer(stream RequestMessage) returns (stream ResponseMessage);

rpc HelloServer(stream RequestMessage) returns (stream ResponseMessage);

**gRPC guarantees the ordering of messages within an individual RPC call. In the case of Bidirectional streaming, the order of messages is preserved in each stream.

Implementing gRPC in Python

Currently, gRPC provides support for many languages like Golang, C++, Java, etc. I will be focussing on its implementation using Python.

mkdir grpc_example
cd grpc_example
virtualenv -p python3 env
source env/bin/activate
pip install grpcio grpcio-tools

mkdir grpc_example
cd grpc_example
virtualenv -p python3 env
source env/bin/activate
pip install grpcio grpcio-tools

This will install all the required dependencies to implement gRPC.

Unary gRPC

For implementing gRPC services, we need to define three files:-

Proto file – Proto file comprises the declaration of the service that is used to generate stubs (<package_name>_pb2.py and <package_name>_pb2_grpc.py). These are used by the gRPC client and the gRPC server.</package_name></package_name>
gRPC client – The client makes a gRPC call to the server to get the response as per the proto file.‍
gRPC Server – The server is responsible for serving requests to the client.

syntax = "proto3";
package unary;
service Unary{
  // A simple RPC.
  //
  // Obtains the MessageResponse at a given position.
 rpc GetServerResponse(Message) returns (MessageResponse) {}
}
message Message{
 string message = 1;
}
message MessageResponse{
 string message = 1;
 bool received = 2;
}

syntax = "proto3";

package unary;

service Unary{
  // A simple RPC.
  //
  // Obtains the MessageResponse at a given position.
 rpc GetServerResponse(Message) returns (MessageResponse) {}

}

message Message{
 string message = 1;
}

message MessageResponse{
 string message = 1;
 bool received = 2;
}

In the above code, we have declared a service named Unary. It consists of a collection of services. For now, I have implemented a single service GetServerResponse(). This service takes an input of type Message and returns a MessageResponse. Below the service declaration, I have declared Message and Message Response.

Once we are done with the creation of the .proto file, we need to generate the stubs. For that, we will execute the below command:-

python -m grpc_tools.protoc --proto_path=. ./unary.proto --python_out=. --grpc_python_out=.

python -m grpc_tools.protoc --proto_path=. ./unary.proto --python_out=. --grpc_python_out=.

Two files are generated named unary_pb2.py and unary_pb2_grpc.py. Using these two stub files, we will implement the gRPC server and the client.

Implementing the Server

import grpc
from concurrent import futures
import time
import unary.unary_pb2_grpc as pb2_grpc
import unary.unary_pb2 as pb2
class UnaryService(pb2_grpc.UnaryServicer):
    def __init__(self, *args, **kwargs):
        pass
    def GetServerResponse(self, request, context):
        # get the string from the incoming request
        message = request.message
        result = f'Hello I am up and running received "{message}" message from you'
        result = {'message': result, 'received': True}
        return pb2.MessageResponse(**result)
def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    pb2_grpc.add_UnaryServicer_to_server(UnaryService(), server)
    server.add_insecure_port('[::]:50051')
    server.start()
    server.wait_for_termination()
if __name__ == '__main__':
    serve()

import grpc
from concurrent import futures
import time
import unary.unary_pb2_grpc as pb2_grpc
import unary.unary_pb2 as pb2


class UnaryService(pb2_grpc.UnaryServicer):

    def __init__(self, *args, **kwargs):
        pass

    def GetServerResponse(self, request, context):

        # get the string from the incoming request
        message = request.message
        result = f'Hello I am up and running received "{message}" message from you'
        result = {'message': result, 'received': True}

        return pb2.MessageResponse(**result)


def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    pb2_grpc.add_UnaryServicer_to_server(UnaryService(), server)
    server.add_insecure_port('[::]:50051')
    server.start()
    server.wait_for_termination()


if __name__ == '__main__':
    serve()

In the gRPC server file, there is a GetServerResponse() method which takes `Message` from the client and returns a `MessageResponse` as defined in the proto file.

server() function is called from the main function, and makes sure that the server is listening to all the time. We will run the unary_server to start the server

python3 unary_server.py

python3 unary_server.py

Implementing the Client

import grpc
import unary.unary_pb2_grpc as pb2_grpc
import unary.unary_pb2 as pb2
class UnaryClient(object):
    """
    Client for gRPC functionality
    """
    def __init__(self):
        self.host = 'localhost'
        self.server_port = 50051
        # instantiate a channel
        self.channel = grpc.insecure_channel(
            '{}:{}'.format(self.host, self.server_port))
        # bind the client and the server
        self.stub = pb2_grpc.UnaryStub(self.channel)
    def get_url(self, message):
        """
        Client function to call the rpc for GetServerResponse
        """
        message = pb2.Message(message=message)
        print(f'{message}')
        return self.stub.GetServerResponse(message)
if __name__ == '__main__':
    client = UnaryClient()
    result = client.get_url(message="Hello Server you there?")
    print(f'{result}')

import grpc
import unary.unary_pb2_grpc as pb2_grpc
import unary.unary_pb2 as pb2


class UnaryClient(object):
    """
    Client for gRPC functionality
    """

    def __init__(self):
        self.host = 'localhost'
        self.server_port = 50051

        # instantiate a channel
        self.channel = grpc.insecure_channel(
            '{}:{}'.format(self.host, self.server_port))

        # bind the client and the server
        self.stub = pb2_grpc.UnaryStub(self.channel)

    def get_url(self, message):
        """
        Client function to call the rpc for GetServerResponse
        """
        message = pb2.Message(message=message)
        print(f'{message}')
        return self.stub.GetServerResponse(message)


if __name__ == '__main__':
    client = UnaryClient()
    result = client.get_url(message="Hello Server you there?")
    print(f'{result}')

In the __init__func. we have initialized the stub using ` self.stub = pb2_grpc.UnaryStub(self.channel)’ And we have a get_url function which calls to server using the above-initialized stub

This completes the implementation of Unary gRPC service.

Let’s check the output:-

Run -> python3 unary_client.py

Output:-

message: “Hello Server you there?”

message: “Hello I am up and running. Received ‘Hello Server you there?’ message from you”

received: true

Bidirectional Implementation

syntax = "proto3";
package bidirectional;
service Bidirectional {
  // A Bidirectional streaming RPC.
  //
  // Accepts a stream of Message sent while a route is being traversed,
   rpc GetServerResponse(stream Message) returns (stream Message) {}
}
message Message {
  string message = 1;
}

syntax = "proto3";

package bidirectional;

service Bidirectional {
  // A Bidirectional streaming RPC.
  //
  // Accepts a stream of Message sent while a route is being traversed,
   rpc GetServerResponse(stream Message) returns (stream Message) {}
}

message Message {
  string message = 1;
}

In the above code, we have declared a service named Bidirectional. It consists of a collection of services. For now, I have implemented a single service GetServerResponse(). This service takes an input of type Message and returns a Message. Below the service declaration, I have declared Message.

Once we are done with the creation of the .proto file, we need to generate the stubs. To generate the stub, we need the execute the below command:-

python -m grpc_tools.protoc --proto_path=.  ./bidirecctional.proto --python_out=. --grpc_python_out=.

python -m grpc_tools.protoc --proto_path=.  ./bidirecctional.proto --python_out=. --grpc_python_out=.

Two files are generated named bidirectional_pb2.py and bidirectional_pb2_grpc.py. Using these two stub files, we will implement the gRPC server and client.

Implementing the Server

from concurrent import futures
import grpc
import bidirectional.bidirectional_pb2_grpc as bidirectional_pb2_grpc
class BidirectionalService(bidirectional_pb2_grpc.BidirectionalServicer):
    def GetServerResponse(self, request_iterator, context):
        for message in request_iterator:
            yield message
def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    bidirectional_pb2_grpc.add_BidirectionalServicer_to_server(BidirectionalService(), server)
    server.add_insecure_port('[::]:50051')
    server.start()
    server.wait_for_termination()
if __name__ == '__main__':
    serve()

from concurrent import futures

import grpc
import bidirectional.bidirectional_pb2_grpc as bidirectional_pb2_grpc


class BidirectionalService(bidirectional_pb2_grpc.BidirectionalServicer):

    def GetServerResponse(self, request_iterator, context):
        for message in request_iterator:
            yield message


def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    bidirectional_pb2_grpc.add_BidirectionalServicer_to_server(BidirectionalService(), server)
    server.add_insecure_port('[::]:50051')
    server.start()
    server.wait_for_termination()


if __name__ == '__main__':
    serve()

In the gRPC server file, there is a GetServerResponse() method which takes a stream of `Message` from the client and returns a stream of `Message` independent of each other. server() function is called from the main function and makes sure that the server is listening to all the time.

We will run the bidirectional_server to start the server:

python3 bidirectional_server.py

python3 bidirectional_server.py

Implementing the Client

from __future__ import print_function
import grpc
import bidirectional.bidirectional_pb2_grpc as bidirectional_pb2_grpc
import bidirectional.bidirectional_pb2 as bidirectional_pb2
def make_message(message):
    return bidirectional_pb2.Message(
        message=message
    )
def generate_messages():
    messages = [
        make_message("First message"),
        make_message("Second message"),
        make_message("Third message"),
        make_message("Fourth message"),
        make_message("Fifth message"),
    ]
    for msg in messages:
        print("Hello Server Sending you the %s" % msg.message)
        yield msg
def send_message(stub):
    responses = stub.GetServerResponse(generate_messages())
    for response in responses:
        print("Hello from the server received your %s" % response.message)
def run():
    with grpc.insecure_channel('localhost:50051') as channel:
        stub = bidirectional_pb2_grpc.BidirectionalStub(channel)
        send_message(stub)
if __name__ == '__main__':
    run()

from __future__ import print_function

import grpc
import bidirectional.bidirectional_pb2_grpc as bidirectional_pb2_grpc
import bidirectional.bidirectional_pb2 as bidirectional_pb2


def make_message(message):
    return bidirectional_pb2.Message(
        message=message
    )


def generate_messages():
    messages = [
        make_message("First message"),
        make_message("Second message"),
        make_message("Third message"),
        make_message("Fourth message"),
        make_message("Fifth message"),
    ]
    for msg in messages:
        print("Hello Server Sending you the %s" % msg.message)
        yield msg


def send_message(stub):
    responses = stub.GetServerResponse(generate_messages())
    for response in responses:
        print("Hello from the server received your %s" % response.message)


def run():
    with grpc.insecure_channel('localhost:50051') as channel:
        stub = bidirectional_pb2_grpc.BidirectionalStub(channel)
        send_message(stub)


if __name__ == '__main__':
    run()

In the run() function. we have initialised the stub using ` stub = bidirectional_pb2_grpc.BidirectionalStub(channel)’

And we have a send_message function to which the stub is passed and it makes multiple calls to the server and receives the results from the server simultaneously.

This completes the implementation of Bidirectional gRPC service.

Let’s check the output:-

Run -> python3 bidirectional_client.py

Output:-

Hello Server Sending you the First message

Hello Server Sending you the Second message

Hello Server Sending you the Third message

Hello Server Sending you the Fourth message

Hello Server Sending you the Fifth message

Hello from the server received your First message

Hello from the server received your Second message

Hello from the server received your Third message

Hello from the server received your Fourth message

Hello from the server received your Fifth message‍

For code reference, please visit here.

Conclusion‍

gRPC is an emerging RPC framework that makes communication between microservices smooth and efficient. I believe gRPC is currently confined to inter microservice but has many other utilities that we will see in the coming years. To know more about modern data communication solutions, check out this blog.

December 12, 2022

A Practical Guide to Deploying Multi-tier Applications on Google Container Engine (GKE)
Introduction

All modern era programmers can attest that containerization has afforded more flexibility and allows us to build truly cloud-native applications. Containers provide portability – ability to easily move applications across environments. Although complex applications comprise of many (10s or 100s) containers. Managing such applications is a real challenge and that’s where container orchestration and scheduling platforms like Kubernetes, Mesosphere, Docker Swarm, etc. come into the picture.
Kubernetes, backed by Google is leading the pack given that Redhat, Microsoft and now Amazon are putting their weight behind it.

Kubernetes can run on any cloud or bare metal infrastructure. Setting up & managing Kubernetes can be a challenge but Google provides an easy way to use Kubernetes through the Google Container Engine(GKE) service.

What is GKE?

Google Container Engine is a Management and orchestration system for Containers. In short, it is a hosted Kubernetes. The goal of GKE is to increase the productivity of DevOps and development teams by hiding the complexity of setting up the Kubernetes cluster, the overlay network, etc.

Why GKE? What are the things that GKE does for the user?
- GKE abstracts away the complexity of managing a highly available Kubernetes cluster.
- GKE takes care of the overlay network
- GKE also provides built-in authentication
- GKE also provides built-in auto-scaling.
- GKE also provides easy integration with the Google storage services.
In this blog, we will see how to create your own Kubernetes cluster in GKE and how to deploy a multi-tier application in it. The blog assumes you have a basic understanding of Kubernetes and have used it before. It also assumes you have created an account with Google Cloud Platform. If you are not familiar with Kubernetes, this guide from Deis is a good place to start.

Google provides a Command-line interface (gcloud) to interact with all Google Cloud Platform products and services. gcloud is a tool that provides the primary command-line interface to Google Cloud Platform. Gcloud tool can be used in the scripts to automate the tasks or directly from the command-line. Follow this guide to install the gcloud tool.

Now let’s begin! The first step is to create the cluster.

Basic Steps to create cluster

In this section, I would like to explain about how to create GKE cluster. We will use a command-line tool to setup the cluster.

Set the zone in which you want to deploy the cluster
```
$ gcloud config set compute/zone us-west1-a
```
Create the cluster using following command,
```
$ gcloud container --project <project-name> 
clusters create <cluster-name> 
--machine-type n1-standard-2 
--image-type "COS" --disk-size "50" 
--num-nodes 2 --network default 
--enable-cloud-logging --no-enable-cloud-monitoring
```
Let’s try to understand what each of these parameters mean:

–project: Project Name

–machine-type: Type of the machine like n1-standard-2, n1-standard-4

–image-type: OS image.”COS” i.e. Container Optimized OS from Google: More Info here.

–disk-size: Disk size of each instance.

–num-nodes: Number of nodes in the cluster.

–network: Network that users want to use for the cluster. In this case, we are using default network.

Apart from the above options, you can also use the following to provide specific requirements while creating the cluster:

–scopes: Scopes enable containers to direct access any Google service without needs credentials. You can specify comma separated list of scope APIs. For example:
- Compute: Lets you view and manage your Google Compute Engine resources
- Logging.write: Submit log data to Stackdriver.
You can find all the Scopes that Google supports here: .

–additional-zones: Specify additional zones to high availability. Eg. –additional-zones us-east1-b, us-east1-d . Here Kubernetes will create a cluster in 3 zones (1 specified at the beginning and additional 2 here).

–enable-autoscaling : To enable the autoscaling option. If you specify this option then you have to specify the minimum and maximum required nodes as follows; You can read more about how auto-scaling works here. Eg: –enable-autoscaling –min-nodes=15 –max-nodes=50

You can fetch the credentials of the created cluster. This step is to update the credentials in the kubeconfig file, so that kubectl will point to required cluster.
```
$ gcloud container clusters get-credentials my-first-cluster --project project-name
```
Now, your First Kubernetes cluster is ready. Let’s check the cluster information & health.
```
$ kubectl get nodes
NAME    STATUS    AGE   VERSION
gke-first-cluster-default-pool-d344484d-vnj1  Ready  2h  v1.6.4
gke-first-cluster-default-pool-d344484d-kdd7  Ready  2h  v1.6.4
gke-first-cluster-default-pool-d344484d-ytre2  Ready  2h  v1.6.4
```
After creating Cluster, now let’s see how to deploy a multi tier application on it. Let’s use simple Python Flask app which will greet the user, store employee data & get employee data.

Application Deployment

I have created simple Python Flask application to deploy on K8S cluster created using GKE. you can go through the source code here. If you check the source code then you will find directory structure as follows:
```
TryGKE/
├── Dockerfile
├── mysql-deployment.yaml
├── mysql-service.yaml
├── src    
  ├── app.py    
  └── requirements.txt    
  ├── testapp-deployment.yaml    
  └── testapp-service.yaml
```
In this, I have written a Dockerfile for the Python Flask application in order to build our own image to deploy. For MySQL, we won’t build an image of our own. We will use the latest MySQL image from the public docker repository.

Before deploying the application, let’s re-visit some of the important Kubernetes terms:

Pods:

The pod is a Docker container or a group of Docker containers which are deployed together on the host machine. It acts as a single unit of deployment.

Deployments:

Deployment is an entity which manages the ReplicaSets and provides declarative updates to pods. It is recommended to use Deployments instead of directly using ReplicaSets. We can use deployment to create, remove and update ReplicaSets. Deployments have the ability to rollout and rollback the changes.

Services:

Service in K8S is an abstraction which will connect you to one or more pods. You can connect to pod using the pod’s IP Address but since pods come and go, their IP Addresses change. Services get their own IP & DNS and those remain for the entire lifetime of the service.

Each tier in an application is represented by a Deployment. A Deployment is described by the YAML file. We have two YAML files – one for MySQL and one for the Python application.

1. MySQL Deployment YAML
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: mysql spec: template: metadata: labels: app: mysql spec: containers: - env: - name: MYSQL_DATABASE value: admin - name: MYSQL_ROOT_PASSWORD value: admin image: 'mysql:latest' name: mysql ports: - name: mysqlport containerPort: 3306 protocol: TCP
```
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: mysql
spec:
  template:
    metadata:
      labels:
        app: mysql
    spec:
      containers:
        - env:
            - name: MYSQL_DATABASE
              value: admin
            - name: MYSQL_ROOT_PASSWORD
              value: admin
          image: 'mysql:latest'
          name: mysql
          ports:
            - name: mysqlport
              containerPort: 3306
              protocol: TCP
```
2. Python Application Deployment YAML
```
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: test-app
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: test-app
    spec:
      containers:
      - name: test-app
        image: ajaynemade/pymy:latest
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 5000
```
Each Service is also represented by a YAML file as follows:

1. MySQL service YAML
```
apiVersion: v1
kind: Service
metadata:
  name: mysql-service
spec:
  ports:
  - port: 3306
    targetPort: 3306
    protocol: TCP
    name: http
  selector:
    app: mysql
```
2. Python Application service YAML
```
apiVersion: v1
kind: Service
metadata:
  name: test-service
spec:
  type: LoadBalancer
  ports:
  - name: test-service
    port: 80
    protocol: TCP
    targetPort: 5000
  selector:
    app: test-app
```
You will find a ‘kind’ field in each YAML file. It is used to specify whether the given configuration is for deployment, service, pod, etc.

In the Python app service YAML, I am using type = LoadBalancer. In GKE, There are two types of cloud load balancers available to expose the application to outside world.
1. TCP load balancer: This is a TCP Proxy-based load balancer. We will use this in our example.
2. HTTP(s) load balancer: It can be created using Ingress. For more information, refer to this post that talks about Ingress in detail.
In the MySQL service, I’ve not specified any type, in that case, type ‘ClusterIP’ will get used, which will make sure that MySQL container is exposed to the cluster and the Python app can access it.

If you check the app.py, you can see that I have used “mysql-service.default” as a hostname. “Mysql-service.default” is a DNS name of the service. The Python application will refer to that DNS name while accessing the MySQL Database.

Now, let’s actually setup the components from the configurations. As mentioned above, we will first create services followed by deployments.

Services:
```
$ kubectl create -f mysql-service.yaml
$ kubectl create -f testapp-service.yaml
```
Deployments:
```
$ kubectl create -f mysql-deployment.yaml
$ kubectl create -f testapp-deployment.yaml
```
Check the status of the pods and services. Wait till all pods come to the running state and Python application service to get external IP like below:
```
$ kubectl get services
NAME            CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
kubernetes      10.55.240.1     <none>        443/TCP        5h
mysql-service   10.55.240.57    <none>        3306/TCP       1m
test-service    10.55.246.105   35.185.225.67     80:32546/TCP   11s
```
Once you get the external IP, then you should be able to make APIs calls using simple curl requests.

Eg. To Store Data :
```
curl -H "Content-Type: application/x-www-form-urlencoded" -X POST  http://35.185.225.67:80/storedata -d id=1 -d name=NoOne
```
Eg. To Get Data :
```
curl 35.185.225.67:80/getdata/1
```
At this stage your application is completely deployed and is externally accessible.

Manual scaling of pods

Scaling your application up or down in Kubernetes is quite straightforward. Let’s scale up the test-app deployment.
```
$ kubectl scale deployment test-app --replicas=3
```
Deployment configuration for test-app will get updated and you can see 3 replicas of test-app are running. Verify it using,
```
kubectl get pods
```
In the same manner, you can scale down your application by reducing the replica count.

Cleanup :

Un-deploying an application from Kubernetes is also quite straightforward. All we have to do is delete the services and delete the deployments. The only caveat is that the deletion of the load balancer is an asynchronous process. You have to wait until it gets deleted.
```
$ kubectl delete service mysql-service
$ kubectl delete service test-service
```
The above command will deallocate Load Balancer which was created as a part of test-service. You can check the status of the load balancer with the following command.
```
$ gcloud compute forwarding-rules list
```
Once the load balancer is deleted, you can clean-up the deployments as well.
```
$ kubectl delete deployments test-app
$ kubectl delete deployments mysql
```
Delete the Cluster:
```
$ gcloud container clusters delete my-first-cluster
```
Conclusion

In this blog, we saw how easy it is to deploy, scale & terminate applications on Google Container Engine. Google Container Engine abstracts away all the complexity of Kubernetes and gives us a robust platform to run containerised applications. I am super excited about what the future holds for Kubernetes!

Check out some of Velotio’s other blogs on Kubernetes.
December 12, 2022
Idiot-proof Coding with Node.js and Express.js
Node.js has become the most popular framework for web development surpassing Ruby on Rails and Django in terms of the popularity.The growing popularity of full stack development along with the performance benefits of asynchronous programming has led to the rise of Node’s popularity. ExpressJs is a minimalistic, unopinionated and the most popular web framework built for Node which has become the de-facto framework for many projects.
Note — This article is about building a Restful API server with ExpressJs . I won’t be delving into a templating library like handlebars to manage the views.

A quick search on google will lead you a ton of articles agreeing with what I just said which could validate the theory. Your next step would be to go through a couple of videos about ExpressJS on Youtube, try hello world with a boilerplate template, choose few recommended middleware for Express (Helmet, Multer etc), an ORM (mongoose if you are using Mongo or Sequelize if you are using relational DB) and start building the APIs. Wow, that was so fast!

The problem starts to appear after a few weeks when your code gets larger and complex and you realise that there is no standard coding practice followed across the client and the server code, refactoring or updating the code breaks something else, versioning of the APIs becomes difficult, call backs have made your life hell (you are smart if you are using Promises but have you heard of async-await?).

Do you think you your code is not so idiot-proof anymore? Don’t worry! You aren’t the only one who thinks this way after reading this.

Let me break the suspense and list down the technologies and libraries used in our idiot-proof code before you get restless.
1. Node 8.11.3: This is the latest LTS release from Node. We are using all the ES6 features along with async-await. We have the latest version of ExpressJs (4.16.3).
2. Typescript: It adds an optional static typing interface to Javascript and also gives us familiar constructs like classes (Es6 also gives provides class as a construct) which makes it easy to maintain a large codebase.
3. Swagger: It provides a specification to easily design, develop, test and document RESTful interfaces. Swagger also provides many open source tools like codegen and editor that makes it easy to design the app.
4. TSLint: It performs static code analysis on Typescript for maintainability, readability and functionality errors.
5. Prettier: It is an opinionated code formatter which maintains a consistent style throughout the project. This only takes care of the styling like the indentation (2 or 4 spaces), should the arguments remain on the same line or go to the next line when the line length exceeds 80 characters etc.
6. Husky: It allows you to add git hooks (pre-commit, pre-push) which can trigger TSLint, Prettier or Unit tests to automatically format the code and to prevent the push if the lint or the tests fail.
Before you move to the next section I would recommend going through the links to ensure that you have a sound understanding of these tools.

Now I’ll talk about some of the challenges we faced in some of our older projects and how we addressed these issues in the newer projects with the tools/technologies listed above.

Formal API definition

A problem that everyone can relate to is the lack of formal documentation in the project. Swagger addresses a part of this problem with their OpenAPI specification which defines a standard to design REST APIs which can be discovered by both machines and humans. As a practice, we first design the APIs in swagger before writing the code. This has 3 benefits:
- It helps us to focus only on the design without having to worry about the code, scaffolder, naming conventions etc. Our API designs are consistent with the implementation because of this focused approach.
- We can leverage tools like swagger-express-mw to internally wire the routes in the API doc to the controller, validate request and response object from their definitions etc.
- Collaboration between teams becomes very easy, simple and standardised because of the Swagger specification.
Code Consistency

We wanted our code to look consistent across the stack (UI and Backend)and we use ESlint to enforce this consistency.
Example –
Node traditionally used “require” and the UI based frameworks used “import” based syntax to load the modules. We decided to follow ES6 style across the project and these rules are defined with ESLint.

Note — We have made slight adjustments to the TSlint for the backend and the frontend to make it easy for the developers. For example, we allow upto 120 characters in React as some of our DOM related code gets lengthy very easily.

Code Formatting

This is as important as maintaining the code consistency in the project. It’s easy to read a code which follows a consistent format like indentation, spaces, line breaks etc. Prettier does a great job at this. We have also integrated Prettier with Typescript to highlight the formatting errors along with linting errors. IDE like VS Code also has prettier plugin which supports features like auto-format to make it easy.

Strict Typing

Typescript can be leveraged to the best only if the application follows strict typing. We try to enforce it as much as possible with exceptions made in some cases (mostly when a third party library doesn’t have a type definition). This has the following benefits:
- Static code analysis works better when your code is strongly typed. We discover about 80–90% of the issues before compilation itself using the plugins mentioned above.
- Refactoring and enhancements becomes very simple with Typescript. We first update the interface or the function definition and then follow the errors thrown by Typescript compiler to refactor the code.
Git Hooks

Husky’s “pre-push” hook runs TSLint to ensure that we don’t push the code with linting issues. If you follow TDD (in the way it’s supposed to be done), then you can also run unit tests before pushing the code. We decided to go with pre-hooks because
– Not everyone has CI from the very first day. With a git hook, we at least have some code quality checks from the first day.
– Running lint and unit tests on the dev’s system will leave your CI with more resources to run integration and other complex tests which are not possible to do in local environment.
– You force the developer to fix the issues at the earliest which results in better code quality, faster code merges and release.

Async-await

We were using promises in our project for all the asynchronous operations. Promises would often lead to a long chain of then-error blocks which is not very comfortable to read and often result in bugs when it got very long (it goes without saying that Promises are much better than the call back function pattern). Async-await provides a very clean syntax to write asynchronous operations which just looks like sequential code. We have seen a drastic improvement in the code quality, fewer bugs and better readability after moving to async-await.

Hope this article gave you some insights into tools and libraries that you can use to build a scalable ExpressJS app.
December 12, 2022
Ensure Continuous Delivery On Kubernetes With GitOps’ Argo CD
What is GitOps?

GitOps is a Continuous Deployment model for cloud-native applications. In GitOps, the Git repositories which contain the declarative descriptions of the infrastructure are considered as the single source of truth for the desired state of the system and we need to have an automated way to ensure that the deployed state of the system always matches the state defined in the Git repository. All the changes (deployment/upgrade/rollback) on the environment are triggered by changes (commits) made on the Git repository

“The artifacts that we run on any environment always have a corresponding code for them on some Git repositories. Can we say the same thing for our infrastructure code?”

Infrastructure as code tools, completely declarative orchestration tools like Kubernetes allow us to represent the entire state of our system in a declarative way. GitOps intends to make use of this ability and make infrastructure-related operations more developer-centric.

Role of Infrastructure as Code (IaC) in GitOps

The ability to represent the infrastructure as code is at the core of GitOps. But just having versioned controlled infrastructure as code doesn’t mean GitOps, we also need to have a mechanism in place to keep (try to keep) our deployed state in sync with the state we define in the Git repository.

“Infrastructure as Code is necessary but not sufficient to achieve GitOps”

GitOps does pull-based deployments

Most of the deployment pipelines we see currently, push the changes in the deployed environment. For example, consider that we need to upgrade our application to a newer version then we will update its docker image tag in some repository which will trigger a deployment pipeline and update the deployed application. Here the changes were pushed on the environment. In GitOps, we just need to update the image tag on the Git repository for that environment and the changes will be pulled to the environment to match the updated state in the Git repository. The magic of keeping the deployed state in sync with state-defined on Git is achieved with the help of operators/agents. The operator is like a control loop which can identify differences between the deployed state and the desired state and make sure they are the same.

Key benefits of GitOps:
1. All the changes are verifiable and auditable as they make their way into the system through Git repositories.
2. Easy and consistent replication of the environment as Git repository is the single source of truth. This makes disaster recovery much quicker and simpler.
3. More developer-centric experience for operating infrastructure. Also a smaller learning curve for deploying dev environments.
4. Consistent rollback of application as well as infrastructure state.
Introduction to Argo CD

Argo CD is a continuous delivery tool that works on the principles of GitOps and is built specifically for Kubernetes. The product was developed and open-sourced by Intuit and is currently a part of CNCF.

Key components of Argo CD:
1. API Server: Just like K8s, Argo CD also has an API server that exposes APIs that other systems can interact with. The API server is responsible for managing the application, repository and cluster credentials, enforcing authentication and authorization, etc.
2. Repository server: The repository server keeps a local cache of the Git repository, which holds the K8s manifest files for the application. This service is called by other services to get the K8s manifests.
3. Application controller: The application controller continuously watches the deployed state of the application and compares it with the desired state of the application, reports the API server whenever they are not in sync with each other and seldom takes corrective actions as well. It is also responsible for executing user-defined hooks for various lifecycle events of the application.
Key objects/resources in Argo CD:
1. Application: Argo CD allows us to represent the instance of the application which we want to deploy in an environment by creating Kubernetes objects of a custom resource definition(CRD) named Application. In the specification of Application type objects, we specify the source (repository) of our application’s K8s manifest files, the K8s server where we want to deploy those manifests, namespace, and other information.
2. AppProject: Just like Application, Argo CD provides another CRD named AppProject. AppProjects are used to logically group related-applications.
3. Repo Credentials: In the case of private repositories, we need to provide access credentials. For credentials, Argo CD uses the K8s secrets and config map. First, we create objects of secret types and then we update a special-purpose configuration map named argocd-cm with the repository URL and the secret which contains the credentials.‍
4. Cluster Credentials: Along with Git repository credentials, we also need to provide the K8s cluster credentials. These credentials are also managed using K8s secret, we are required to add the label argocd.argoproj.io/secret-type: cluster to these secrets.
Demo:

Enough of theory, let’s try out the things we discussed above. For the demo, I have created a simple app named message-app. This app reads a message set in the environment variable named MESSAGE. We will populate the values of this environment variable using a K8s config map. I have kept the K8s manifest files for the app in a separate repository. We have the application and the K8s manifest files ready. Now we are all set to install Argo CD and deploy our application.

Installing Argo CD:

For installing Argo CD, we first need to create a namespace named argocd.
```
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
```
Applying the files from the argo-cd repo directly is fine for demo purposes, but in actual environments, you must copy the file in your repository before applying them.

We can see that this command has created the core components and CRDs we discussed earlier in the blog. There are some additional resources as well but we can ignore them for the time being.

Accessing the Argo CD GUI

We have the Argo CD running in our cluster, Argo CD also provides a GUI which gives us a graphical representation of our k8s objects. It allows us to view events, pod logs, and other configurations.

By default, the GUI service is not exposed outside the cluster. Let us update its service type to LoadBalancer so that we can access it from outside.
```
kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}'
```
After this, we can use the external IP of the argocd-server service and access the GUI.

The initial username is admin and the password is the name of the api-server pod. The password can be obtained by listing the pods in the argocd namespace or directly by this command.
```
kubectl get pods -n argocd -l app.kubernetes.io/name=argocd-server -o name | cut -d'/' -f 2 
```
Deploy the app:

Now let’s go ahead and create our application for the staging environment for our message app.
apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: message-app-staging namespace: argocd environment: staging finalizers: - resources-finalizer.argocd.argoproj.io spec: project: default # Source of the application manifests source: repoURL: https://github.com/akash-gautam/message-app-manifests.git targetRevision: HEAD path: manifests/staging # Destination cluster and namespace to deploy the application destination: server: https://kubernetes.default.svc namespace: staging syncPolicy: automated: prune: false selfHeal: false
```
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: message-app-staging
  namespace: argocd
  environment: staging
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default

  # Source of the application manifests
  source:
    repoURL: https://github.com/akash-gautam/message-app-manifests.git
    targetRevision: HEAD
    path: manifests/staging

  # Destination cluster and namespace to deploy the application
  destination:
    server: https://kubernetes.default.svc
    namespace: staging

  syncPolicy:
    automated:
      prune: false
      selfHeal: false
```
In the application spec, we have specified the repository, where our manifest files are stored and also the path of the files in the repository.

We want to deploy our app in the same k8s cluster where ArgoCD is running so we have specified the local k8s service URL in the destination. We want the resources to be deployed in the staging namespace, so we have set it accordingly.

In the sync policy, we have enabled automated sync. We have kept the project as default.

Adding the resources-finalizer.argocd.argoproj.io ensures that all the resources created for the application are deleted when the Application is deleted. This is fine for our demo setup but might not always be desirable in real-life scenarios.

Our git repos are public so we don’t need to create secrets for git repo credentials.

We are deploying in the same cluster where Argo CD itself is running. As this is a demo setup, we can use the admin user created by Argo CD, so we don’t need to create secrets for cluster credentials either.

Now let’s go ahead and create the application and see the magic happen.
```
kubectl apply -f message-app-staging.yaml
```
As soon as the application is created, we can see it on the GUI.

By clicking on the application, we can see all the Kubernetes objects created for it.

It also shows the objects which are indirectly created by the objects we create. In the above image, we can see the replica set and endpoint object which are created as a result of creating the deployment and service respectively.

We can also click on the individual objects and see their configuration. For pods, we can see events and logs as well.

As our app is deployed now, we can grab public IP of message-app service and access it on the browser.

We can see that our app is deployed and accessible.

Updating the app

For updating our application, all we need to do is commit our changes to the GitHub repository. We know the message-app just displays the message we pass to it via. Config map, so let’s update the message and push it to the repository.
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: message-configmap
  labels:
    app: message-app
data:
  MESSAGE: "This too shall pass" #Put the message you want to display here.
```
Once the commit is done, Argo CD will start to sync again.

Once the sync is done, we will restart our message app pod, so that it picks up the latest values in the config map. Then we need to refresh the browser to see updated values.

As we discussed earlier, for making any changes to the environment, we just need to update the repo which is being used as the source for the environment and then the changes will get pulled in the environment.

We can follow an exact similar approach and deploy the application to the production environment as well. We just need to create a new application object and set the manifest path and deployment namespace accordingly.

Conclusion:

It’s still early days for GitOps, but it has already been successfully implemented at scale by many organizations. As the GitOps tools mature along with the ever-growing adoption of Kubernetes, I think many organizations will consider adopting GitOps soon. GitOps is not limited only to Kubernetes, but the completely declarative nature of Kubernetes makes it simpler to achieve GitOps. Argo CD is a deployment tool that’s tailored for Kubernetes and allows us to do deployments in a Kubernetes native way while following the principles of GitOps.I hope this blog helped you in understanding how what and why of GitOps and gave some insights to Argo CD.
December 12, 2022
A Primer on HTTP Load Balancing in Kubernetes using Ingress on Google Cloud Platform
Containerized applications and Kubernetes adoption in cloud environments is on the rise. One of the challenges while deploying applications in Kubernetes is exposing these containerized applications to the outside world. This blog explores different options via which applications can be externally accessed with focus on Ingress – a new feature in Kubernetes that provides an external load balancer. This blog also provides a simple hand-on tutorial on Google Cloud Platform (GCP).

Ingress is the new feature (currently in beta) from Kubernetes which aspires to be an Application Load Balancer intending to simplify the ability to expose your applications and services to the outside world. It can be configured to give services externally-reachable URLs, load balance traffic, terminate SSL, offer name based virtual hosting etc. Before we dive into Ingress, let’s look at some of the alternatives currently available that help expose your applications, their complexities/limitations and then try to understand Ingress and how it addresses these problems.

Current ways of exposing applications externally:

There are certain ways using which you can expose your applications externally. Lets look at each of them:

EXPOSE Pod:

You can expose your application directly from your pod by using a port from the node which is running your pod, mapping that port to a port exposed by your container and using the combination of your HOST-IP:HOST-PORT to access your application externally. This is similar to what you would have done when running docker containers directly without using Kubernetes. Using Kubernetes you can use hostPortsetting in service configuration which will do the same thing. Another approach is to set hostNetwork: true in service configuration to use the host’s network interface from your pod.

Limitations:
- In both scenarios you should take extra care to avoid port conflicts at the host, and possibly some issues with packet routing and name resolutions.
- This would limit running only one replica of the pod per cluster node as the hostport you use is unique and can bind with only one service.
EXPOSE Service:

Kubernetes services primarily work to interconnect different pods which constitute an application. You can scale the pods of your application very easily using services. Services are not primarily intended for external access, but there are some accepted ways to expose services to the external world.

Basically, services provide a routing, balancing and discovery mechanism for the pod’s endpoints. Services target pods using selectors, and can map container ports to service ports. A service exposes one or more ports, although usually, you will find that only one is defined.

A service can be exposed using 3 ServiceType choices:
- ClusterIP: Exposes the service on a cluster-internal IP. Choosing this value makes the service only reachable from within the cluster. This is the default ServiceType.
- NodePort: Exposes the service on each Node’s IP at a static port (the NodePort). A ClusterIP service, to which the NodePort service will route, is automatically created. You’ll be able to contact the NodePort service, from outside the cluster, by requesting <nodeip>:<nodeport>.Here NodePort remains fixed and NodeIP can be any node IP of your Kubernetes cluster.</nodeport></nodeip>
- LoadBalancer: Exposes the service externally using a cloud provider’s load balancer (eg. AWS ELB). NodePort and ClusterIP services, to which the external load balancer will route, are automatically created.
- ExternalName: Maps the service to the contents of the externalName field (e.g. foo.bar.example.com), by returning a CNAME record with its value. No proxying of any kind is set up. This requires version 1.7 or higher of kube-dns
Limitations:
- If we choose NodePort to expose our services, kubernetes will generate ports corresponding to the ports of your pods in the range of 30000-32767. You will need to add an external proxy layer that uses DNAT to expose more friendly ports. The external proxy layer will also have to take care of load balancing so that you leverage the power of your pod replicas. Also it would not be easy to add TLS or simple host header routing rules to the external service.
- ClusterIP and ExternalName similarly while easy to use have the limitation where we can add any routing or load balancing rules.
- Choosing LoadBalancer is probably the easiest of all methods to get your service exposed to the internet. The problem is that there is no standard way of telling a Kubernetes service about the elements that a balancer requires, again TLS and host headers are left out. Another limitation is reliance on an external load balancer (AWS’s ELB, GCP’s Cloud Load Balancer etc.)
Endpoints

Endpoints are usually automatically created by services, unless you are using headless services and adding the endpoints manually. An endpoint is a host:port tuple registered at Kubernetes, and in the service context it is used to route traffic. The service tracks the endpoints as pods, that match the selector are created, deleted and modified. Individually, endpoints are not useful to expose services, since they are to some extent ephemeral objects.

Summary

If you can rely on your cloud provider to correctly implement the LoadBalancer for their API, to keep up-to-date with Kubernetes releases, and you are happy with their management interfaces for DNS and certificates, then setting up your services as type LoadBalancer is quite acceptable.

On the other hand, if you want to manage load balancing systems manually and set up port mappings yourself, NodePort is a low-complexity solution. If you are directly using Endpoints to expose external traffic, perhaps you already know what you are doing (but consider that you might have made a mistake, there could be another option).

Given that none of these elements has been originally designed to expose services to the internet, their functionality may seem limited for this purpose.

Understanding Ingress

Traditionally, you would create a LoadBalancer service for each public application you want to expose. Ingress gives you a way to route requests to services based on the request host or path, centralizing a number of services into a single entrypoint.

Ingress is split up into two main pieces. The first is an Ingress resource, which defines how you want requests routed to the backing services and second is the Ingress Controller which does the routing and also keeps track of the changes on a service level.

Ingress Resources

The Ingress resource is a set of rules that map to Kubernetes services. Ingress resources are defined purely within Kubernetes as an object that other entities can watch and respond to.

Ingress Supports defining following rules in beta stage:
- host header: Forward traffic based on domain names.
- paths: Looks for a match at the beginning of the path.
- TLS: If the ingress adds TLS, HTTPS and a certificate configured through a secret will be used.
When no host header rules are included at an Ingress, requests without a match will use that Ingress and be mapped to the backend service. You will usually do this to send a 404 page to requests for sites/paths which are not sent to the other services. Ingress tries to match requests to rules, and forwards them to backends, which are composed of a service and a port.

Ingress Controllers

Ingress controller is the entity which grants (or remove) access, based on the changes in the services, pods and Ingress resources. Ingress controller gets the state change data by directly calling Kubernetes API.

Ingress controllers are applications that watch Ingresses in the cluster and configure a balancer to apply those rules. You can configure any of the third party balancers like HAProxy, NGINX, Vulcand or Traefik to create your version of the Ingress controller. Ingress controller should track the changes in ingress resources, services and pods and accordingly update configuration of the balancer.

Ingress controllers will usually track and communicate with endpoints behind services instead of using services directly. This way some network plumbing is avoided, and we can also manage the balancing strategy from the balancer. Some of the open source implementations of Ingress Controllers can be found here.

Now, let’s do an exercise of setting up a HTTP Load Balancer using Ingress on Google Cloud Platform (GCP), which has already integrated the ingress feature in it’s Container Engine (GKE) service.

Ingress-based HTTP Load Balancer in Google Cloud Platform

The tutorial assumes that you have your GCP account setup done and a default project created. We will first create a Container cluster, followed by deployment of a nginx server service and an echoserver service. Then we will setup an ingress resource for both the services, which will configure the HTTP Load Balancer provided by GCP

Basic Setup

Get your project ID by going to the “Project info” section in your GCP dashboard. Start the Cloud Shell terminal, set your project id and the compute/zone in which you want to create your cluster.
```
$ gcloud config set project glassy-chalice-129514$ 
gcloud config set compute/zone us-east1-d
# Create a 3 node cluster with name “loadbalancedcluster”$ 
gcloud container clusters create loadbalancedcluster  
```
Fetch the cluster credentials for the kubectl tool:
```
$ gcloud container clusters get-credentials loadbalancedcluster --zone us-east1-d --project glassy-chalice-129514
```
Step 1: Deploy an nginx server and echoserver service
```
$ kubectl run nginx --image=nginx --port=80
$ kubectl run echoserver --image=gcr.io/google_containers/echoserver:1.4 --port=8080
$ kubectl get deployments
NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
echoserver   1         1         1            1           15s
nginx        1         1         1            1           26m
```
Step 2: Expose your nginx and echoserver deployment as a service internally

Create a Service resource to make the nginx and echoserver deployment reachable within your container cluster:
```
$ kubectl expose deployment nginx --target-port=80  --type=NodePort
$ kubectl expose deployment echoserver --target-port=8080 --type=NodePort
```
When you create a Service of type NodePort with this command, Container Engine makes your Service available on a randomly-selected high port number (e.g. 30746) on all the nodes in your cluster. Verify the Service was created and a node port was allocated:
```
$ kubectl get service nginx
NAME      CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
nginx     10.47.245.54   <nodes>       80:30746/TCP   20s
$ kubectl get service echoserver
NAME         CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
echoserver   10.47.251.9   <nodes>       8080:32301/TCP   33s
```
In the output above, the node port for the nginx Service is 30746 and for echoserver service is 32301. Also, note that there is no external IP allocated for this Services. Since the Container Engine nodes are not externally accessible by default, creating this Service does not make your application accessible from the Internet. To make your HTTP(S) web server application publicly accessible, you need to create an Ingress resource.

Step 3: Create an Ingress resource

On Container Engine, Ingress is implemented using Cloud Load Balancing. When you create an Ingress in your cluster, Container Engine creates an HTTP(S) load balancer and configures it to route traffic to your application. Container Engine has internally defined an Ingress Controller, which takes the Ingress resource as input for setting up proxy rules and talk to Kubernetes API to get the service related information.

The following config file defines an Ingress resource that directs traffic to your nginx and echoserver server:
```
apiVersion: extensions/v1beta1
kind: Ingress
metadata: 
name: fanout-ingress
spec: 
rules: 
- http:     
paths:     
- path: /       
backend:         
serviceName: nginx         
servicePort: 80     
- path: /echo       
backend:         
serviceName: echoserver         
servicePort: 8080
```
To deploy this Ingress resource run in the cloud shell:
```
$ kubectl apply -f basic-ingress.yaml
```
Step 4: Access your application

Find out the external IP address of the load balancer serving your application by running:
```
$ kubectl get ingress fanout-ingres
NAME             HOSTS     ADDRESS          PORTS     AG
fanout-ingress   *         130.211.36.168   80        36s    
```
Use http://<external-ip-address> </external-ip-address>and http://<external-ip-address>/echo</external-ip-address> to access nginx and the echo-server.

Summary

Ingresses are simple and very easy to deploy, and really fun to play with. However, it’s currently in beta phase and misses some of the features that may restrict it from production use. Stay tuned to get updates in Ingress on Kubernetes page and their Github repo.

References
December 12, 2022
BigQuery 101: All the Basics You Need to Know
Google BigQuery is an enterprise data warehouse built using BigTable and Google Cloud Platform. It’s serverless and completely managed. BigQuery works great with all sizes of data, from a 100 row Excel spreadsheet to several Petabytes of data. Most importantly, it can execute a complex query on those data within a few seconds.

We need to note before we proceed, BigQuery is not a transactional database. It takes around 2 seconds to run a simple query like ‘SELECT * FROM bigquery-public-data.object LIMIT 10’ on a 100 KB table with 500 rows. Hence, it shouldn’t be thought of as OLTP (Online Transaction Processing) database. BigQuery is for Big Data!

BigQuery supports SQL-like query, which makes it user-friendly and beginner friendly. It’s accessible via its web UI, command-line tool, or client library (written in C#, Go, Java, Node.js, PHP, Python, and Ruby). You can also take advantage of its REST APIs and get our job` done by sending a JSON request.

Now, let’s dive deeper to understand it better. Suppose you are a data scientist (or a startup which analyzes data) and you need to analyze terabytes of data. If you choose a tool like MySQL, the first step before even thinking about any query is to have an infrastructure in place, that can store this magnitude of data.

Designing this setup itself will be a difficult task because you have to figure out what will be the RAM size, DCOS or Kubernetes, and other factors. And if you have streaming data coming, you will need to set up and maintain a Kafka cluster. In BigQuery, all you have to do is a bulk upload of your CSV/JSON file, and you are done. BigQuery handles all the backend for you. If you need streaming data ingestion, you can use Fluentd. Another advantage of this is that you can connect Google Analytics with BigQuery seamlessly.

BigQuery is serverless, highly available, and petabyte scalable service which allows you to execute complex SQL queries quickly. It lets you focus on analysis rather than handling infrastructure. The idea of hardware is completely abstracted and not visible to us, not even as virtual machines.

Architecture of Google BigQuery

You don’t need to know too much about the underlying architecture of BigQuery. That’s actually the whole idea of it – you don’t need to worry about architecture and operation.

However, understanding BigQuery Architecture helps us in controlling costs, optimizing query performance, and optimizing storage. BigQuery is built using the Google Dremel paper.

Quoting an Abstract from the Google Dremel Paper –

“Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.”

Dremel was in production at Google since 2006. Google used it for the following tasks –
- Analysis of crawled web documents.
- Tracking install data for applications on Android Market.
- Crash reporting for Google products.
- OCR results from Google Books.
- Spam analysis.
- Debugging of map tiles on Google Maps.
- Tablet migrations in managed Bigtable instances.
- Results of tests run on Google’s distributed build system.
- Disk I/O statistics for hundreds of thousands of disks.
- Resource monitoring for jobs run in Google’s data centers.
- Symbols and dependencies in Google’s codebase.
BigQuery is much more than Dremel. Dremel is just a query execution engine, whereas Bigquery is based on interesting technologies like Borg (predecessor of Kubernetes) and Colossus. Colossus is the successor to the Google File System (GFS) as mentioned in Google Spanner Paper.

How BigQuery Stores Data?

BigQuery stores data in a columnar format – Capacitor (which is a successor of ColumnarIO). BigQuery achieves very high compression ratio and scan throughput. Unlike ColumnarIO, now on BigQuery, you can directly operate on compressed data without decompressing it.

Columnar storage has the following advantages:
- Traffic minimization – When you submit a query, the required column values on each query are scanned and only those are transferred on query execution. E.g., a query `SELECT title FROM Collection` would access the title column values only.
- Higher compression ratio – Columnar storage can achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3.
(Image source: Google Dremel Paper)

Columnar storage has the disadvantage of not working efficiently when updating existing records. That is why Dremel doesn’t support any update queries.

How the Query Gets Executed?

BigQuery depends on Borg for data processing. Borg simultaneously instantiates hundreds of Dremel jobs across required clusters made up of thousands of machines. In addition to assigning compute capacity for Dremel jobs, Borg handles fault-tolerance as well.

Now, how do you design/execute a query which can run on thousands of nodes and fetches the result? This challenge was overcome by using the Tree Architecture. This architecture forms a gigantically parallel distributed tree for pushing down a query to the tree and aggregating the results from the leaves at a blazingly fast speed.

(Image source: Google Dremel Paper)

BigQuery vs. MapReduce

The key differences between BigQuery and MapReduce are –
- Dremel is designed as an interactive data analysis tool for large datasets
- MapReduce is designed as a programming framework to batch process large datasets
Moreover, Dremel finishes most queries within seconds or tens of seconds and can even be used by non-programmers, whereas MapReduce takes much longer (sometimes even hours or days) to process a query.

Following is a comparison on running MapReduce on a row and columnar DB:

(Image source: Google Dremel Paper)

Another important thing to note is that BigQuery is meant to analyze structured data (SQL) but in MapReduce, you can write logic for unstructured data as well.

Comparing BigQuery and Redshift

In Redshift, you need to allocate different instance types and create your own clusters. The benefit of this is that it lets you tune the compute/storage to meet your needs. However, you have to be aware of (virtualized) hardware limits and scale up/out based on that. Note that you are charged by the hour for each instance you spin up.

In BigQuery, you just upload the data and query it. It is a truly managed service. You are charged by storage, streaming inserts, and queries.

There are more similarities in both the data warehouses than the differences.

A smart user will definitely take advantage of the hybrid cloud (GCE+AWS) and leverage different services offered by both the ecosystems. Check out your quintessential guide to AWS Athena here.

Getting Started With Google BigQuery

Following is a quick example to show how you can quickly get started with BigQuery:
1. There are many public datasets available on bigquery, you are going to play with ‘bigquery-public-data:stackoverflow’ dataset. You can click on the “Add Data” button on the left panel and select datasets.
2. Next, find a language that has the best community, based on the response time. You can write the following query to do that.
WITH question_answers_join AS ( SELECT * , GREATEST(1, TIMESTAMP_DIFF(answers.first, creation_date, minute)) minutes_2_answer FROM ( SELECT id, creation_date, title , (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c FROM `bigquery-public-data.stackoverflow.posts_answers` WHERE a.id=parent_id ) answers , SPLIT(tags, '|') tags FROM `bigquery-public-data.stackoverflow.posts_questions` a WHERE EXTRACT(year FROM creation_date) > 2014 ) ) SELECT COUNT(*) questions, tag , ROUND(EXP(AVG(LOG(minutes_2_answer))), 2) mean_geo_minutes , APPROX_QUANTILES(minutes_2_answer, 100)[SAFE_OFFSET(50)] median FROM question_answers_join, UNNEST(tags) tag WHERE tag IN ('javascript', 'python', 'rust', 'java', 'scala', 'ruby', 'go', 'react', 'c', 'c++') AND answers.c > 0 GROUP BY tag ORDER BY mean_geo_minutes
```
WITH question_answers_join AS (
  SELECT *
    , GREATEST(1, TIMESTAMP_DIFF(answers.first, creation_date, minute)) minutes_2_answer
  FROM (
    SELECT id, creation_date, title
      , (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c
         FROM `bigquery-public-data.stackoverflow.posts_answers` 
         WHERE a.id=parent_id
      ) answers
      , SPLIT(tags, '|') tags
    FROM `bigquery-public-data.stackoverflow.posts_questions` a
    WHERE EXTRACT(year FROM creation_date) > 2014
  )
)
SELECT COUNT(*) questions, tag
  , ROUND(EXP(AVG(LOG(minutes_2_answer))), 2) mean_geo_minutes
  , APPROX_QUANTILES(minutes_2_answer, 100)[SAFE_OFFSET(50)] median
FROM question_answers_join, UNNEST(tags) tag
WHERE tag IN ('javascript', 'python', 'rust', 'java', 'scala', 'ruby', 'go', 'react', 'c', 'c++')
AND answers.c > 0
GROUP BY tag
ORDER BY mean_geo_minutes
```
3. Now you can execute the query and get results –

You can see that C has the best community followed by JavaScript!

How to do Machine Learning on BigQuery?

Now that you have a sound understanding of BigQuery. It’s time for some real action.

As discussed above, you can connect Google Analytics with BigQuery by going to the Google Analytics Admin panel, then enable BigQuery by clicking on PROPERTY column, click All Products, then click Link BigQuery. After that, you need to enter BigQuery ID (or project number) and then BigQuery will be linked to Google Analytics. Note – Right now BigQuery integration is only available to Google Analytics 360.

Assuming that you already have uploaded your google analytics data, here is how you can create a logistic regression model. Here, you are predicting whether a website visitor will make a transaction or not.
CREATE MODEL `velotio_tutorial.sample_model` OPTIONS(model_type='logistic_reg') AS SELECT IF(totals.transactions IS NULL, 0, 1) AS label, IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(geoNetwork.country, "") AS country, IFNULL(totals.pageviews, 0) AS pageviews FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20190401' AND '20180630'
```
CREATE MODEL `velotio_tutorial.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20190401' AND '20180630'
```
Create a model named ‘velotio_tutorial.sample_model’. Now set the ‘model_type’ as ‘logistic_reg’ because you want to train a logistic regression model. A logistic regression model splits input data into two classes and gives the probability that the data is in one of the classes. Usually, in “spam or not spam” type of problems, you use logistic regression. Here, the problem is similar – a transaction will be made or not.

The above query gets the total number of page views, the country from where the session originated, the operating system of visitors device, the total number of e-commerce transactions within the session, etc.

Now you just press run query to execute the query.

Conclusion

BigQuery is a query service that allows us to run SQL-like queries against multiple terabytes of data in a matter of seconds. If you have structured data, BigQuery is the best option to go for. It can help even a non-programmer to get the analytics right!

Learn how to build an ETL Pipeline for MongoDB & Amazon Redshift using Apache Airflow.

If you need help with using machine learning in product development for your organization, connect with experts at Velotio!
December 12, 2022
Building Your First AWS Serverless Application? Here’s Everything You Need to Know
A serverless architecture is a way to implement and run applications and services or micro-services without need to manage infrastructure. Your application still runs on servers, but all the servers management is done by AWS. Now we don’t need to provision, scale or maintain servers to run our applications, databases and storage systems. Services which are developed by developers who don’t let developers build application from scratch.

Why Serverless
1. More focus on development rather than managing servers.
2. Cost Effective.
3. Application which scales automatically.
4. Quick application setup.
Services For ServerLess

For implementing serverless architecture there are multiple services which are provided by cloud partners though we will be exploring most of the services from AWS. Following are the services which we can use depending on the application requirement.
1. Lambda: It is used to write business logic / schedulers / functions.
2. S3: It is mostly used for storing objects but it also gives the privilege to host WebApps. You can host a static website on S3.
3. API Gateway: It is used for creating, publishing, maintaining, monitoring and securing REST and WebSocket APIs at any scale.
4. Cognito: It provides authentication, authorization & user management for your web and mobile apps. Your users can sign in directly sign in with a username and password or through third parties such as Facebook, Amazon or Google.
5. DynamoDB: It is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.
Three-tier Serverless Architecture

So, let’s take a use case in which you want to develop a three tier serverless application. The three tier architecture is a popular pattern for user facing applications, The tiers that comprise the architecture include the presentation tier, the logic tier and the data tier. The presentation tier represents the component that users directly interact with web page / mobile app UI. The logic tier contains the code required to translate user action at the presentation tier to the functionality that drives the application’s behaviour. The data tier consists of your storage media (databases, file systems, object stores) that holds the data relevant to the application. Figure shows the simple three-tier application.

Figure: Simple Three-Tier Architectural Pattern

Presentation Tier

The presentation tier of the three tier represents the View part of the application. Here you can use S3 to host static website. On a static website, individual web pages include static content and they also contain client side scripting.

The following is a quick procedure to configure an Amazon S3 bucket for static website hosting in the S3 console.

To configure an S3 bucket for static website hosting

1. Log in to the AWS Management Console and open the S3 console at

2. In the Bucket name list, choose the name of the bucket that you want to enable static website hosting for.

3. Choose Properties.

4. Choose Static Website Hosting

Once you enable your bucket for static website hosting, browsers can access all of your content through the Amazon S3 website endpoint for your bucket.

5. Choose Use this bucket to host.

A. For Index Document, type the name of your index document, which is typically named index.html. When you configure a S3 bucket for website hosting, you must specify an index document, which will be returned by S3 when requests are made to the root domain or any of the subfolders.

B. (Optional) For 4XX errors, you can optionally provide your own custom error document that provides additional guidance for your users. Type the name of the file that contains the custom error document. If an error occurs, S3 returns an error document.

C. (Optional) If you want to give advanced redirection rules, In the edit redirection rule text box, you have to XML to describe the rule.
E.g.
```
<RoutingRules>
    <RoutingRule>
        <Condition>
            <HttpErrorCodeReturnedEquals>403</HttpErrorCodeReturnedEquals>
        </Condition>
        <Redirect>
            <HostName>mywebsite.com</HostName>
            <ReplaceKeyPrefixWith>notfound/</ReplaceKeyPrefixWith>
        </Redirect>
    </RoutingRule>
</RoutingRules>
```
6. Choose Save

7. Add a bucket policy to the website bucket that grants access to the object in the S3 bucket for everyone. You must make the objects that you want to serve publicly readable, when you configure a S3 bucket as a website. To do so, you write a bucket policy that grants everyone S3:GetObject permission. The following bucket policy grants everyone access to the objects in the example-bucket bucket.
```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "PublicReadGetObject",
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::example-bucket/*"
            ]
        }
    ]
}
```
Note: If you choose Disable Website Hosting, S3 removes the website configuration from the bucket, so that the bucket no longer accessible from the website endpoint, but the bucket is still available at the REST endpoint.

Logic Tier

The logic tier represents the brains of the application. Here the two core services for serverless will be used i.e. API Gateway and Lambda to form your logic tier can be so revolutionary. The feature of the 2 services allow you to build a serverless production application which is highly scalable, available and secure. Your application could use number of servers, however by leveraging this pattern you do not have to manage a single one. In addition, by using these managed services together you get following benefits:
1. No operating system to choose, secure or manage.
2. No servers to right size, monitor.
3. No risk to your cost by over-provisioning.
4. No Risk to your performance by under-provisioning.
API Gateway

API Gateway is a fully managed service for defining, deploying and maintaining APIs. Anyone can integrate with the APIs using standard HTTPS requests. However, it has specific features and qualities that result it being an edge for your logic tier.

Integration with Lambda

API Gateway gives your application a simple way to leverage the innovation of AWS lambda directly (HTTPS Requests). API Gateway forms the bridge that connects your presentation tier and the functions you write in Lambda. After defining the client / server relationship using your API, the contents of the client’s HTTPS requests are passed to Lambda function for execution. The content include request metadata, request headers and the request body.

API Performance Across the Globe

Each deployment of API Gateway includes an Amazon CloudFront distribution under the covers. Amazon CloudFront is a content delivery web service that used Amazon’s global network of edge locations as connection points for clients integrating with API. This helps drive down the total response time latency of your API. Through its use of multiple edge locations across the world, Amazon CloudFront also provides you capabilities to combat distributed denial of service (DDoS) attack scenarios.

You can improve the performance of specific API requests by using API Gateway to store responses in an optional in-memory cache. This not only provides performance benefits for repeated API requests, but is also reduces backend executions, which can reduce overall cost.

Let’s dive into each step

1. Create Lambda Function
Login to Aws Console and head over to Lambda Service and Click on “Create A Function”

A. Choose first option “Author from scratch”
B. Enter Function Name
C. Select Runtime e.g. Python 2.7
D. Click on “Create Function”

As your function is ready, you can see your basic function will get generated in language you choose to write.
E.g.
```
import json

def lambda_handler(event, context):
    # TODO implement
    return {
        'statusCode': 200,
        'body': json.dumps('Hello from Lambda!')
    }
```
2. Testing Lambda Function

Click on “Test” button at the top right corner where we need to configure test event. As we are not sending any events, just give event a name, for example, “Hello World” template as it is and “Create” it.

Now, when you hit the “Test” button again, it runs through testing the function we created earlier and returns the configured value.

Create & Configure API Gateway connecting to Lambda

We are done with creating lambda functions but how to invoke function from outside world ? We need endpoint, right ?

Go to API Gateway & click on “Get Started” and agree on creating an Example API but we will not use that API we will create “New API”. Give it a name by keeping “Endpoint Type” regional for now.

Create the API and you will go on the page “resources” page of the created API Gateway. Go through the following steps:

A. Click on the “Actions”, then click on “Create Method”. Select Get method for our function. Then, “Tick Mark” on the right side of “GET” to set it up.
B. Choose “Lambda Function” as integration type.
C. Choose the region where we created earlier.
D. Write the name of Lambda Function we created
E. Save the method where it will ask you for confirmation of “Add Permission to Lambda Function”. Agree to that & that is done.
F. Now, we can test our setup. Click on “Test” to run API. It should give the response text we had on the lambda test screen.

Now, to get endpoint. We need to deploy the API. On the Actions dropdown, click on Deploy API under API Actions. Fill in the details of deployment and hit Deploy.

After that, we will get our HTTPS endpoint.

On the above screen you can see the things like cache settings, throttling, logging which can be configured. Save the changes and browse the invoke URL from which we will get the response which was earlier getting from Lambda. So, here is our logic tier of serverless application is to be done.

Data Tier

By using Lambda as your logic tier, you have a number of data storage options for your data tier. These options fall into broad categories: Amazon VPC hosted data stores and IAM-enabled data stores. Lambda has the ability to integrate with both securely.

Amazon VPC Hosted Data Stores
1. Amazon RDS
2. Amazon ElasticCache
3. Amazon Redshift
IAM-Enabled Data Stores
1. Amazon DynamoDB
2. Amazon S3
3. Amazon ElasticSearch Service
You can use any of those for storage purpose, But DynamoDB is one of best suited for ServerLess application.

Why DynamoDB ?
1. It is NoSQL DB, also that is fully managed by AWS.
2. It provides fast & prectable performance with seamless scalability.
3. DynamoDB lets you offload the administrative burden of operating and scaling a distributed system.
4. It offers encryption at rest, which eliminates the operational burden and complexity involved in protecting sensitive data.
5. You can scale up/down your tables throughput capacity without downtime/performance degradation.
6. It provides On-Demand backups as well as enable point in time recovery for your DynamoDB tables.
7. DynamoDB allows you to delete expired items from table automatically to help you reduce storage usage and the cost of storing data that is no longer relevant.
Following is the sample script for DynamoDB with Python which you can use with lambda.
from __future__ import print_function # Python 2/3 compatibility import boto3 import json import decimal from boto3.dynamodb.conditions import Key, Attr from botocore.exceptions import ClientError # Helper class to convert a DynamoDB item to JSON. class DecimalEncoder(json.JSONEncoder): def default(self, o): if isinstance(o, decimal.Decimal): if o % 1 > 0: return float(o) else: return int(o) return super(DecimalEncoder, self).default(o) dynamodb = boto3.resource("dynamodb", region_name='us-west-2', endpoint_url="http://localhost:8000") table = dynamodb.Table('Movies') title = "The Big New Movie" year = 2015 try: response = table.get_item( Key={ 'year': year, 'title': title } ) except ClientError as e: print(e.response['Error']['Message']) else: item = response['Item'] print("GetItem succeeded:") print(json.dumps(item, indent=4, cls=DecimalEncoder))
```
from __future__ import print_function # Python 2/3 compatibility
import boto3
import json
import decimal
from boto3.dynamodb.conditions import Key, Attr
from botocore.exceptions import ClientError

# Helper class to convert a DynamoDB item to JSON.
class DecimalEncoder(json.JSONEncoder):
    def default(self, o):
        if isinstance(o, decimal.Decimal):
            if o % 1 > 0:
                return float(o)
            else:
                return int(o)
        return super(DecimalEncoder, self).default(o)

dynamodb = boto3.resource("dynamodb", region_name='us-west-2', endpoint_url="http://localhost:8000")

table = dynamodb.Table('Movies')

title = "The Big New Movie"
year = 2015

try:
    response = table.get_item(
        Key={
            'year': year,
            'title': title
        }
    )
except ClientError as e:
    print(e.response['Error']['Message'])
else:
    item = response['Item']
    print("GetItem succeeded:")
    print(json.dumps(item, indent=4, cls=DecimalEncoder))
```
Note: To run the above script successfully you need to attach policy to your role for lambda. So in this case you need to attach policy for DynamoDB operations to take place & for CloudWatch if required to store your logs. Following is the policy which you can attach to your role for DB executions.
{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": [ "dynamodb:BatchGetItem", "dynamodb:GetItem", "dynamodb:Query", "dynamodb:Scan", "dynamodb:BatchWriteItem", "dynamodb:PutItem", "dynamodb:UpdateItem" ], "Resource": "arn:aws:dynamodb:eu-west-1:123456789012:table/SampleTable" }, { "Effect": "Allow", "Action": [ "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "arn:aws:logs:eu-west-1:123456789012:*" }, { "Effect": "Allow", "Action": "logs:CreateLogGroup", "Resource": "*" } ] }
```
{
	"Version": "2012-10-17",
	"Statement": [{
			"Effect": "Allow",
			"Action": [
				"dynamodb:BatchGetItem",
				"dynamodb:GetItem",
				"dynamodb:Query",
				"dynamodb:Scan",
				"dynamodb:BatchWriteItem",
				"dynamodb:PutItem",
				"dynamodb:UpdateItem"
			],
			"Resource": "arn:aws:dynamodb:eu-west-1:123456789012:table/SampleTable"
		},
		{
			"Effect": "Allow",
			"Action": [
				"logs:CreateLogStream",
				"logs:PutLogEvents"
			],
			"Resource": "arn:aws:logs:eu-west-1:123456789012:*"
		},
		{
			"Effect": "Allow",
			"Action": "logs:CreateLogGroup",
			"Resource": "*"
		}
	]
}
```
Sample Architecture Patterns

You can implement the following popular architecture patterns using API Gateway & Lambda as your logic tier, Amazon S3 for presentation tier, DynamoDB as your data tier. For each example, we will only use AWS Service that do not require users to manage their own infrastructure.

Mobile Backend

1. Presentation Tier: A mobile application running on each user’s smartphone.

2. Logic Tier: API Gateway & Lambda. The logic tier is globally distributed by the Amazon CloudFront distribution created as part of each API Gateway each API. A set of lambda functions can be specific to user / device identity management and authentication & managed by Amazon Cognito, which provides integration with IAM for temporary user access credentials as well as with popular third party identity providers. Other Lambda functions can define core business logic for your Mobile Back End.

3. Data Tier: The various data storage services can be leveraged as needed; options are given above in data tier.

Amazon S3 Hosted Website

1. Presentation Tier: Static website content hosted on S3, distributed by Amazon CLoudFront. Hosting static website content on S3 is a cost effective alternative to hosting content on server-based infrastructure. However, for a website to contain rich feature, the static content often must integrate with a dynamic back end.

2. Logic Tier: API Gateway & Lambda, static web content hosted in S3 can directly integrate with API Gateway, which can be CORS complaint.

3. Data Tier: The various data storage services can be leveraged based on your requirement.

ServerLess Costing

At the top of the AWS invoice, we can see the total costing of AWS Services. The bill was processed for 2.1 million API request & all of the infrastructure required to support them.

Following is the list of services with their costing.

Note: You can get your costing done from AWS Calculator using following links;
1. https://calculator.s3.amazonaws.com/index.html
2. AWS Pricing Calculator
Conclusion

The three-tier architecture pattern encourages the best practice of creating application component that are easy to maintain, develop, decoupled & scalable. Serverless Application services varies based on the requirements over development.
December 12, 2022