Serverless is the emerging trend in software architecture. 2016 was a very exciting year for serverless and adoption will continue to explode in 2017. This post covers the interesting developments in serverless space in 2016 and my thoughts on how this space will evolve in 2017.
What is serverless?
In my opinion, Serverless means two things: 1. Serverless was initially used to describe fully hosted services or Backend-as-a-Service (BaaS) offerings where you fully depend on a 3rd party to manage the server-side logic and state. Examples include AWS S3(storage), Auth0(authentication), AWS Dynamo (database), Firebase, etc. 2. The popular interpretation of Serverless is Functions-as-a-Service(FaaS) where developers can upload code that is run within stateless compute containers that are triggered by a variety of events, are ephemeral and fully managed by the cloud platform. FaaS obviates the need to provision, manage, scale or manage availability of your own servers. The most popular FaaS offering is AWS Lambda but Microsoft Azure Functions , Auth0 Webtask, Google Cloud Functionsare also coming up with fast maturing offerings. IBM OpenWhisk, Iron.io allow serverless environments to be setup on-premise as well.
This post will focus more on FaaS and especially AWS Lambda since it is the leader in this space with the largest feature set.
Serverless vs PaaS
People are comparing serverless and Platform-as-a-Service(PaaS) which is an invalid comparison in my opinion. PaaS platforms provide you the ability to deploy and manage your entire application or micro-service. PaaS platforms will help you to scale your application infrastructure. Anyone who has used a PaaS knows that while it reduces administration, it does not do away with it. PaaS does not really require you to re-evaluate your code design. In PaaS, your application needs to be always on, though it can be scaled up or down.
Serverless on the other hand is about “breaking up your application into discrete functions to obviate the need for any complex infrastructure or it’s management”. Serverless has several restrictions with respect to execution duration limits and state. Serverless requires developers to architect differently thinking about discrete functionality of each component of your application.
New features from AWS for Serverless
Lambda@Edge: This feature allows Lambda code to be executed on global AWS Edge locations. Imagine you deploy your code to one region of AWS and then are able to run it in any of the 10s and soon 100s of AWS Edge locations around the globe. Imagine the reduction in network latency for your end users.
Step Functions: Companies that adopt serverless soon end up with 10s of 100s of functions. The logic and workflow between these functions is hard to track and manage. Step Functions allow developers to create visual workflows to organize the various components, micro-services, events and conditions. This is nothing but a Rapid Application Development(RAD) product. I expect AWS to build a fully functional enterprise RAD based on this feature.
API Gateway Monetisation: This is big deal in my opinion. There are various trends like a) startups are increasingly API-first b) all enterprises and startups build 10s or 100s of integrations with APIs c) fine-grained usage based billing based on API usage d) adoption of micro-services architecture which uses API contracts. This feature allows companies to start monetising their APIs via the AWS Marketplace. And the APIs can be implemented in AWS Lambda in the backend. I expect to see a lot of “data integration”, “data pipeline” and “data marketplace” companies try out this approach.
AWS Greengrass: Extend AWS compute to devices. This feature enables running Lambda functions offline on devices in the field. This is another great feature which is extending the meaning of a “global cloud”. Most of the use-cases for this feature are in IoT space.
Continuous Deployment for Serverless: AWS CodePipeline, CodeCommit and CodeBuild support AWS Lambda and enable creation of continuous integration and deployment pipelines. Jenkins and other CI tools also have some plugins for serverlesss which are improving all the time.
API Gateway Developer Portal: Easily build your own developer portal with API documentation. Support as per Swagger documentation.
AWS X-Ray: Analyze and debug distributed applications, commonly used with micro-services architecture. X-Ray will soon get Lambda support and enable even easier debugging of Lambda logic flows across all the AWS services and triggers that are part of your workflow.
Predictions
Your code will increasingly run closer to clients. With Lambda@Edge, Greengrass & Snowball Edge, you can truly “deploy once and run around the globe”. This is not just scaling horizontally but scaling up or down geographically. I expect customers to leverage this feature in some very interesting ways.
Serverless frameworks will mature and allow easy creation of simple REST applications. I also expect vertical specific serverless frameworks to evolve, especially for IoT.
Monitoring, logging and debugging for serverless will improve with cloud vendor solutions as well as frameworks providing capabilities. A good example is IOPipe which provides a monitoring and logging capabilities by instrumenting your Lambda code.
I expect all the continuous integration and deployment tool vendors to add increasing support for serverless architectures in 2017. Automated testing for serverless will become easier via frameworks and CI tools.
With CloudFormation and Step Functions, AWS will try and solve the versioning and discovery problem for Lambda functions.
Mature patterns will emerge for serverless usage. For example, some very common use-cases today are image or video conversion based on S3 trigger, backends for messaging bots, data processing for IoT, etc. I expect to see more mature patterns and use-cases where serverless becomes an obvious solution. Expect lots of interesting white-papers and case studies from AWS to educate the market on merging use-cases.
AWS will allow users to choose to keep some number of Lambda instances always-on for lower latency with some extra pricing. This will start addressing the latency issues that some heavy functions or frameworks entail.
API and data integration vendors will increasing use Lambda and monetized API gateways. An early example is Cloud Elements which has released a feature that allows publishing SaaS APIs as AWS Lambda functions.
Serverless Frameworks
There are open-source projects like Serverless, Chalice, Lambda Framework, Apex, Gomix which add a level of abstraction over vendor serverless platforms. These frameworks make it easier to develop and deploy serverless components.
Some of these frameworks plan to allow abstraction across FaaS offerings to avoid lock-in into a single platform. I do not expect to see any abstraction or standardisation that allows portability of serverless functions. Cloud vendors have unique FaaS offerings with different implementations and the triggers/events are based on their own proprietary services (for example, AWS has S3, Kinesis, DynamodDB, etc. triggers).
Bots, IoT, mobile app backends, IoT backends, data processing/ETL, scheduled jobs and any other event driven use-cases are a perfect fit for serverless.
Conclusion
In my opinion, any service that adds agility along with reduction in IT operations and costs will become successful. Serverless definitely fits into this philosophy and in my opinion, will continue to see increasing adoption. It will not “replace” existing architectures but augment them. The serverless movement is just getting started and you can expect cloud vendors to invest heavily in improving the feature set and capabilities of serverless offerings. The serverless frameworks, DevOps teams, operational management and monitoring of FaaS will continue to mature in 2017. There are a lot of emergent trends at play here — containerization, serverless, LessOps, voice and chatbots, IoT proliferation, globalization and agility demands. All of these will accelerate serverless adoption.
Automation is everywhere and it is better to adopt it as soon as possible. Today, in this blog post, we are going to discuss creating the infrastructure. For this, we will be using AWS for hosting our deployment pipeline. Packer will be used to create AMI’s and Terraform will be used for creating the master/slaves. We will be discussing different ways of connecting the slaves and will also run a sample application with the pipeline.
Please remember the intent of the blog is to accumulate all the different components together, this means some of the code which should be available in development code repo is also included here. Now that we have highlighted the required tools, 10000 ft view and intent of the blog. Let’s begin.
Using Packer to Create AMI’s for Jenkins Master and Linux Slave
Hashicorp has bestowed with some of the most amazing tools for simplifying our life. Packer is one of them. Packer can be used to create custom AMI from already available AMI’s. We just need to create a JSON file and pass installation script as part of creation and it will take care of developing the AMI for us. Install packer depending upon your requirement from Packer downloads page. For simplicity purpose, we will be using Linux machine for creating Jenkins Master and Linux Slave. JSON file for both of them will be same but can be separated if needed.
Note: user-data passed from terraform will be different which will eventually differentiate their usage.
We are using Amazon Linux 2 – JSON file for the same.
As you can see the file is pretty simple. The only thing of interest here is the install_amazon.bash script. In this blog post, we will deploy a Node-based application which is running inside a docker container. Content of the bash file is as follows:
#!/bin/bashset -x# For Nodecurl -sL https://rpm.nodesource.com/setup_10.x | sudo -E bash -# For xmlstarletsudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpmsudo yum update -ysleep 10# Setting up Dockersudo yum install -y dockersudo usermod -a -G docker ec2-user# Just to be safe removing previously available java if presentsudo yum remove -y javasudo yum install -y python2-pip jq unzip vim tree biosdevname nc mariadb bind-utils at screen tmux xmlstarlet git java-1.8.0-openjdk nc gcc-c++ make nodejssudo -H pip install awscli bcryptsudo -H pip install --upgrade awsclisudo -H pip install --upgrade aws-ec2-assign-elastic-ipsudo npm install -g @angular/clisudo systemctl enable dockersudo systemctl enable atdsudo yum clean allsudo rm -rf /var/cache/yum/exit 0@velotiotech
Now there are a lot of things mentioned let’s check them out. As mentioned earlier we will be discussing different ways of connecting to a slave and for one of them, we need xmlstarlet. Rest of the things are packages that we might need in one way or the other.
Update ami_users with actual user value. This can be found on AWS console Under Support and inside of it Support Center.
Validate what we have written is right or not by running packer validate amazon.json.
Once confirmed, build the packer image by running packer build amazon.json.
After completion check your AWS console and you will find a new AMI created in “My AMI’s”.
It’s now time to start using terraform for creating the machines.
Prerequisite:
1. Please make sure you create a provider.tf file.
provider "aws" { region ="us-east-1" shared_credentials_file ="~/.aws/credentials" profile ="dev"}
The ‘credentials file’ will contain aws_access_key_id and aws_secret_access_key.
2. Keep SSH keys handy for server/slave machines. Here is a nice article highlighting how to create it or else create them before hand on aws console and reference it in the code.
3. VPC:
# lookup for the "default"VPCdata "aws_vpc""default_vpc" {default=true}# subnet list in the "default"VPC# The "default"VPC has all "public subnets"data "aws_subnet_ids""default_public" { vpc_id ="${data.aws_vpc.default_vpc.id}"}
Creating Terraform Script for Spinning up Jenkins Master
Creating Terraform Script for Spinning up Jenkins Master. Get terraform from terraform download page.
We will need to set up the Security Group before setting up the instance.
# Security Group:resource "aws_security_group""jenkins_server" { name ="jenkins_server" description ="Jenkins Server: created by Terraform for [dev]" # legacy name ofVPCID vpc_id ="${data.aws_vpc.default_vpc.id}" tags { Name ="jenkins_server" env ="dev" }}################################################################################ ALLINBOUND################################################################################ sshresource "aws_security_group_rule""jenkins_server_from_source_ingress_ssh" { type ="ingress" from_port =22 to_port =22 protocol ="tcp" security_group_id ="${aws_security_group.jenkins_server.id}" cidr_blocks = ["<Your Public IP>/32", "172.0.0.0/8"] description ="ssh to jenkins_server"}# webresource "aws_security_group_rule""jenkins_server_from_source_ingress_webui" { type ="ingress" from_port =8080 to_port =8080 protocol ="tcp" security_group_id ="${aws_security_group.jenkins_server.id}" cidr_blocks = ["0.0.0.0/0"] description ="jenkins server web"}# JNLPresource "aws_security_group_rule""jenkins_server_from_source_ingress_jnlp" { type ="ingress" from_port =33453 to_port =33453 protocol ="tcp" security_group_id ="${aws_security_group.jenkins_server.id}" cidr_blocks = ["172.31.0.0/16"] description ="jenkins server JNLP Connection"}################################################################################ ALLOUTBOUND###############################################################################resource "aws_security_group_rule""jenkins_server_to_other_machines_ssh" { type ="egress" from_port =22 to_port =22 protocol ="tcp" security_group_id ="${aws_security_group.jenkins_server.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins servers to ssh to other machines"}resource "aws_security_group_rule""jenkins_server_outbound_all_80" { type ="egress" from_port =80 to_port =80 protocol ="tcp" security_group_id ="${aws_security_group.jenkins_server.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins servers for outbound yum"}resource "aws_security_group_rule""jenkins_server_outbound_all_443" { type ="egress" from_port =443 to_port =443 protocol ="tcp" security_group_id ="${aws_security_group.jenkins_server.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins servers for outbound yum"}
Now that we have a custom AMI and security groups for ourselves let’s use them to create a terraform instance.
# AMI lookup for this Jenkins Serverdata "aws_ami""jenkins_server" { most_recent =true owners = ["self"] filter { name ="name" values = ["amazon-linux-for-jenkins*"] }}resource "aws_key_pair""jenkins_server" { key_name ="jenkins_server" public_key ="${file("jenkins_server.pub")}"}# lookup the security group of the Jenkins Serverdata "aws_security_group""jenkins_server" { filter { name ="group-name" values = ["jenkins_server"] }}# userdata for the Jenkins server ...data "template_file""jenkins_server" { template ="${file("scripts/jenkins_server.sh")}" vars { env ="dev" jenkins_admin_password ="mysupersecretpassword" }}# the Jenkins server itselfresource "aws_instance""jenkins_server" { ami ="${data.aws_ami.jenkins_server.image_id}" instance_type ="t3.medium" key_name ="${aws_key_pair.jenkins_server.key_name}" subnet_id ="${data.aws_subnet_ids.default_public.ids[0]}" vpc_security_group_ids = ["${data.aws_security_group.jenkins_server.id}"] iam_instance_profile ="dev_jenkins_server" user_data ="${data.template_file.jenkins_server.rendered}" tags {"Name"="jenkins_server" } root_block_device { delete_on_termination =true }}output "jenkins_server_ami_name" { value ="${data.aws_ami.jenkins_server.name}"}output "jenkins_server_ami_id" { value ="${data.aws_ami.jenkins_server.id}"}output "jenkins_server_public_ip" { value ="${aws_instance.jenkins_server.public_ip}"}output "jenkins_server_private_ip" { value ="${aws_instance.jenkins_server.private_ip}"}
As mentioned before, we will be discussing multiple ways in which we can connect the slaves to Jenkins master. But it is already known that every time a new Jenkins comes up, it generates a unique password. Now there are two ways to deal with this, one is to wait for Jenkins to spin up and retrieve that password or just directly edit the admin password while creating Jenkins master. Here we will be discussing how to change the password when configuring Jenkins. (If you need the script to retrieve Jenkins password as soon as it gets created than comment and I will share that with you as well).
Below is the user data to install Jenkins master, configure its password and install required packages.
#!/bin/bashset -xfunctionwait_for_jenkins(){while (( 1 )); do echo "waiting for Jenkins to launch on port [8080] ..." nc -zv 127.0.0.18080if (( $?==0 )); then break fi sleep 10 done echo "Jenkins launched"}functionupdating_jenkins_master_password (){ cat >/tmp/jenkinsHash.py <<EOFimport bcryptimport sysif not sys.argv[1]: sys.exit(10)plaintext_pwd=sys.argv[1]encrypted_pwd=bcrypt.hashpw(sys.argv[1], bcrypt.gensalt(rounds=10, prefix=b"2a"))isCorrect=bcrypt.checkpw(plaintext_pwd, encrypted_pwd)if not isCorrect: sys.exit(20);print "{}".format(encrypted_pwd)EOF chmod +x /tmp/jenkinsHash.py # Wait till /var/lib/jenkins/users/admin* folder gets created sleep 10 cd /var/lib/jenkins/users/admin* pwd while (( 1 )); do echo "Waiting for Jenkins to generate admin user's config file ..." if [[ -f "./config.xml" ]]; then break fi sleep 10 done echo "Admin config file created" admin_password=$(python /tmp/jenkinsHash.py ${jenkins_admin_password} 2>&1) # Please do not remove alter quote as it keeps the hash syntax intact or else while substitution, $<character> will be replaced by null xmlstarlet -q ed --inplace -u "/user/properties/hudson.security.HudsonPrivateSecurityRealm_-Details/passwordHash" -v '#jbcrypt:'"$admin_password" config.xml # Restart systemctl restart jenkins sleep 10}function install_packages (){ wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat-stable/jenkins.repo rpm --import https://jenkins-ci.org/redhat/jenkins-ci.org.key yum install -y jenkins # firewall #firewall-cmd --permanent --new-service=jenkins #firewall-cmd --permanent --service=jenkins --set-short="Jenkins Service Ports" #firewall-cmd --permanent --service=jenkins --set-description="Jenkins Service firewalld port exceptions" #firewall-cmd --permanent --service=jenkins --add-port=8080/tcp #firewall-cmd --permanent --add-service=jenkins #firewall-cmd --zone=public --add-service=http --permanent #firewall-cmd --reload systemctl enable jenkins systemctl restart jenkins sleep 10}function configure_jenkins_server (){ # Jenkins cli echo "installing the Jenkins cli ..." cp /var/cache/jenkins/war/WEB-INF/jenkins-cli.jar /var/lib/jenkins/jenkins-cli.jar # Getting initial password # PASSWORD=$(cat /var/lib/jenkins/secrets/initialAdminPassword) PASSWORD="${jenkins_admin_password}" sleep 10 jenkins_dir="/var/lib/jenkins" plugins_dir="$jenkins_dir/plugins" cd $jenkins_dir # Open JNLP port xmlstarlet -q ed --inplace -u "/hudson/slaveAgentPort" -v 33453 config.xml cd $plugins_dir || { echo "unable to chdir to [$plugins_dir]"; exit 1; } # List of plugins that are needed to be installed plugin_list="git-client git github-api github-oauth github MSBuild ssh-slaves workflow-aggregator ws-cleanup" # remove existing plugins, if any ... rm -rfv $plugin_list for plugin in $plugin_list; do echo "installing plugin [$plugin] ..." java -jar $jenkins_dir/jenkins-cli.jar -s http://127.0.0.1:8080/ -auth admin:$PASSWORD install-plugin $plugin done # Restart jenkins after installing plugins java -jar $jenkins_dir/jenkins-cli.jar -s http://127.0.0.1:8080 -auth admin:$PASSWORD safe-restart}### script starts here ###install_packageswait_for_jenkinsupdating_jenkins_master_passwordwait_for_jenkinsconfigure_jenkins_serverecho "Done"exit 0
There is a lot of stuff that has been covered here. But the most tricky bit is changing Jenkins password. Here we are using a python script which uses brcypt to hash the plain text in Jenkins encryption format and xmlstarlet for replacing that password in the actual location. Also, we are using xmstarlet to edit the JNLP port for windows slave. Do remember initial username for Jenkins is admin.
Command to run: Initialize terraform – terraform init , Check and apply – terraform plan -> terraform apply
After successfully running apply command go to AWS console and check for a new instance coming up. Hit the <public ip=””>:8080 and enter credentials as you had passed and you will have the Jenkins master for yourself ready to be used. </public>
Note: I will be providing the terraform script and permission list of IAM roles for the user at the end of the blog.
Creating Terraform Script for Spinning up Linux Slave and connect it to master
We won’t be creating a new image here rather use the same one that we used for Jenkins master.
VPC will be same and updated Security groups for slave are below:
resource "aws_security_group""dev_jenkins_worker_linux" { name ="dev_jenkins_worker_linux" description ="Jenkins Server: created by Terraform for [dev]"# legacy name ofVPCID vpc_id ="${data.aws_vpc.default_vpc.id}" tags { Name ="dev_jenkins_worker_linux" env ="dev" }}################################################################################ ALLINBOUND################################################################################ sshresource "aws_security_group_rule""jenkins_worker_linux_from_source_ingress_ssh" { type ="ingress" from_port =22 to_port =22 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_linux.id}" cidr_blocks = ["<Your Public IP>/32"] description ="ssh to jenkins_worker_linux"}# sshresource "aws_security_group_rule""jenkins_worker_linux_from_source_ingress_webui" { type ="ingress" from_port =8080 to_port =8080 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_linux.id}" cidr_blocks = ["0.0.0.0/0"] description ="ssh to jenkins_worker_linux"}################################################################################ ALLOUTBOUND###############################################################################resource "aws_security_group_rule""jenkins_worker_linux_to_all_80" { type ="egress" from_port =80 to_port =80 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_linux.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins worker to all 80"}resource "aws_security_group_rule""jenkins_worker_linux_to_all_443" { type ="egress" from_port =443 to_port =443 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_linux.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins worker to all 443"}resource "aws_security_group_rule""jenkins_worker_linux_to_other_machines_ssh" { type ="egress" from_port =22 to_port =22 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_linux.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins worker linux to jenkins server"}resource "aws_security_group_rule""jenkins_worker_linux_to_jenkins_server_8080" { type ="egress" from_port =8080 to_port =8080 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_linux.id}" source_security_group_id ="${aws_security_group.jenkins_server.id}" description ="allow jenkins workers linux to jenkins server"}
Now that we have the required security groups in place it is time to bring into light terraform script for linux slave.
And now the final piece of code, which is user-data of slave machine.
#!/bin/bashset -xfunctionwait_for_jenkins (){ echo "Waiting jenkins to launch on 8080..."while (( 1 )); do echo "Waiting for Jenkins" nc -zv ${server_ip} 8080if (( $?==0 )); then break fi sleep 10 done echo "Jenkins launched"}functionslave_setup(){ # Wait till jar file gets available ret=1while (( $ret !=0 )); do wget -O/opt/jenkins-cli.jar http://${server_ip}:8080/jnlpJars/jenkins-cli.jar ret=$? echo "jenkins cli ret [$ret]" done ret=1while (( $ret !=0 )); do wget -O/opt/slave.jar http://${server_ip}:8080/jnlpJars/slave.jar ret=$? echo "jenkins slave ret [$ret]" done mkdir -p /opt/jenkins-slave chown -R ec2-user:ec2-user /opt/jenkins-slave # Register_slaveJENKINS_URL="http://${server_ip}:8080"USERNAME="${jenkins_username}" # PASSWORD=$(cat /tmp/secret)PASSWORD="${jenkins_password}"SLAVE_IP=$(ip -o -4 addr list ${device_name} | head -n1 | awk '{print $4}'| cut -d/-f1)NODE_NAME=$(echo "jenkins-slave-linux-$SLAVE_IP"| tr '.''-')NODE_SLAVE_HOME="/opt/jenkins-slave"EXECUTORS=2SSH_PORT=22CRED_ID="$NODE_NAME"LABELS="build linux docker"USERID="ec2-user" cd /opt # Creating CMD utility for jenkins-cli commands jenkins_cmd="java -jar /opt/jenkins-cli.jar -s $JENKINS_URL -auth $USERNAME:$PASSWORD" # Waiting for Jenkins to load all pluginswhile (( 1 )); do count=$($jenkins_cmd list-plugins 2>/dev/null| wc -l) ret=$? echo "count [$count] ret [$ret]"if (( $count >0 )); then break fi sleep 30 done # Delete Credentials if present for respective slave machines $jenkins_cmd delete-credentials system::system::jenkins _ $CRED_ID # Generating cred.xml for creating credentials on Jenkins server cat >/tmp/cred.xml <<EOF<com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKeyplugin="ssh-credentials@1.16"> <scope>GLOBAL</scope> <id>$CRED_ID</id> <description>Generated via Terraform for $SLAVE_IP</description> <username>$USERID</username> <privateKeySourceclass="com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKey\$DirectEntryPrivateKeySource"> <privateKey>${worker_pem}</privateKey> </privateKeySource></com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKey>EOF # Creating credential usingcred.xml cat /tmp/cred.xml | $jenkins_cmd create-credentials-by-xml system::system::jenkins _ # For Deleting Node, used when testing $jenkins_cmd delete-node $NODE_NAME # Generating node.xml for creating node on Jenkins server cat >/tmp/node.xml <<EOF<slave> <name>$NODE_NAME</name> <description>Linux Slave</description> <remoteFS>$NODE_SLAVE_HOME</remoteFS> <numExecutors>$EXECUTORS</numExecutors> <mode>NORMAL</mode> <retentionStrategyclass="hudson.slaves.RetentionStrategy\$Always"/> <launcherclass="hudson.plugins.sshslaves.SSHLauncher"plugin="ssh-slaves@1.5"> <host>$SLAVE_IP</host> <port>$SSH_PORT</port> <credentialsId>$CRED_ID</credentialsId> </launcher> <label>$LABELS</label> <nodeProperties/> <userId>$USERID</userId></slave>EOF sleep 10 # Creating node usingnode.xml cat /tmp/node.xml | $jenkins_cmd create-node $NODE_NAME}### script begins here ###wait_for_jenkinsslave_setupecho "Done"exit 0
This will not only create a node on Jenkins master but also attach it.
Command to run: Initialize terraform – terraform init, Check and apply – terraform plan -> terraform apply
One drawback of this is, if by any chance slave gets disconnected or goes down, it will remain on Jenkins master as offline, also it will not manually attach itself to Jenkins master.
Some solutions for them are:
1. Create a cron job on the slave which will run user-data after a certain interval.
2. Use swarm plugin.
3. As we are on AWS, we can even use Amazon EC2 Plugin.
Maybe in a future blog, we will cover using both of these plugins as well.
Using Packer to create AMI’s for Windows Slave
Windows AMI will also be created using packer. All the pointers for Windows will remain as it were for Linux.
Now when it comes to windows one should know that it does not behave the same way Linux does. For us to be able to communicate with this image an essential component required is WinRM. We set it up at the very beginning as part of user_data_file. Also, windows require user input for a lot of things and while automating it is not possible to provide it as it will break the flow of execution so we disable UAC and enable RDP so that we can connect to that machine from our local desktop for debugging if needed. And at last, we will execute install_windows.ps1 file which will set up our slave. Please note at the last we are calling two PowerShell scripts to generate random password every time a new machine is created. It is mandatory to have them or you will never be able to login into your machines.
There are multiple user-data in the above code, let’s understand them in their order of appearance.
SetUpWinRM.ps1:
<powershell>write-output "Running User Data Script"write-host "(host) Running User Data Script"Set-ExecutionPolicy Unrestricted -Scope LocalMachine -Force -ErrorAction Ignore# Don't set this before Set-ExecutionPolicy as it throws an error$ErrorActionPreference = "stop"# Remove HTTP listenerRemove-Item -Path WSMan:\Localhost\listener\listener* -Recurse$Cert = New-SelfSignedCertificate -CertstoreLocation Cert:\LocalMachine\My -DnsName "packer"New-Item -Path WSMan:\LocalHost\Listener -Transport HTTPS -Address * -CertificateThumbPrint $Cert.Thumbprint -Force# WinRMwrite-output "Setting up WinRM"write-host "(host) setting up WinRM"cmd.exe /c winrm quickconfig -qcmd.exe /c winrm set "winrm/config" '@{MaxTimeoutms="1800000"}'cmd.exe /c winrm set "winrm/config/winrs" '@{MaxMemoryPerShellMB="1024"}'cmd.exe /c winrm set "winrm/config/service" '@{AllowUnencrypted="true"}'cmd.exe /c winrm set "winrm/config/client" '@{AllowUnencrypted="true"}'cmd.exe /c winrm set "winrm/config/service/auth" '@{Basic="true"}'cmd.exe /c winrm set "winrm/config/client/auth" '@{Basic="true"}'cmd.exe /c winrm set "winrm/config/service/auth" '@{CredSSP="true"}'cmd.exe /c winrm set "winrm/config/listener?Address=*+Transport=HTTPS" "@{Port=`"5986`";Hostname=`"packer`";CertificateThumbprint=`"$($Cert.Thumbprint)`"}"cmd.exe /c netsh advfirewall firewall set rule group="remote administration" new enable=yescmd.exe /c netsh firewall add portopening TCP 5986 "Port 5986"cmd.exe /c net stop winrmcmd.exe /c sc config winrm start= autocmd.exe /c net start winrm</powershell>
The content is pretty straightforward as it is just setting up WInRM. The only thing that matters here is the <powershell> and </powershell>. They are mandatory as packer will not be able to understand what is the type of script. Next, we come across disable-uac.ps1 & enable-rdp.ps1, and we have discussed their purpose before. The last user-data is the actual user-data that we need to install all the required packages in the AMI.
Chocolatey: a blessing in disguise – Installing required applications in windows by scripting is a real headache as you have to write a lot of stuff just to install a single application but luckily for us we have chocolatey. It works as a package manager for windows and helps us install applications as we are installing packages on Linux. install_windows.ps1 has installation step for chocolatey and how it can be used to install other applications on windows.
See, such a small script and you can get all the components to run your Windows application in no time (Kidding… This script actually takes around 20 minutes to run :P)
Now that we have the image for ourselves let’s start with terraform script to make this machine a slave of your Jenkins master.
Creating Terraform Script for Spinning up Windows Slave and Connect it to Master
This time also we will first create the security groups and then create the slave machine from the same AMI that we developed above.
resource "aws_security_group""dev_jenkins_worker_windows" { name ="dev_jenkins_worker_windows" description ="Jenkins Server: created by Terraform for [dev]" # legacy name ofVPCID vpc_id ="${data.aws_vpc.default_vpc.id}" tags { Name ="dev_jenkins_worker_windows" env ="dev" }}################################################################################ ALLINBOUND################################################################################ sshresource "aws_security_group_rule""jenkins_worker_windows_from_source_ingress_webui" { type ="ingress" from_port =8080 to_port =8080 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_windows.id}" cidr_blocks = ["0.0.0.0/0"] description ="ssh to jenkins_worker_windows"}# rdpresource "aws_security_group_rule""jenkins_worker_windows_from_rdp" { type ="ingress" from_port =3389 to_port =3389 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_windows.id}" cidr_blocks = ["<Your Public IP>/32"] description ="rdp to jenkins_worker_windows"}################################################################################ ALLOUTBOUND###############################################################################resource "aws_security_group_rule""jenkins_worker_windows_to_all_80" { type ="egress" from_port =80 to_port =80 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_windows.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins worker to all 80"}resource "aws_security_group_rule""jenkins_worker_windows_to_all_443" { type ="egress" from_port =443 to_port =443 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_windows.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins worker to all 443"}resource "aws_security_group_rule""jenkins_worker_windows_to_jenkins_server_33453" { type ="egress" from_port =33453 to_port =33453 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_windows.id}" cidr_blocks = ["172.31.0.0/16"] description ="allow jenkins worker windows to jenkins server"}resource "aws_security_group_rule""jenkins_worker_windows_to_jenkins_server_8080" { type ="egress" from_port =8080 to_port =8080 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_windows.id}" source_security_group_id ="${aws_security_group.jenkins_server.id}" description ="allow jenkins workers windows to jenkins server"}resource "aws_security_group_rule""jenkins_worker_windows_to_all_22" { type ="egress" from_port =22 to_port =22 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_windows.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins worker windows to connect outbound from 22"}
Once security groups are in place we move towards creating the terraform file for windows machine itself. Windows can’t connect to Jenkins master using SSH the method we used while connecting the Linux slave instead we have to use JNLP. A quick recap, when creating Jenkins master we used xmlstarlet to modify the JNLP port and also added rules in sg group to allow connection for JNLP. Also, we have opened the port for RDP so that if any issue occurs you can get in the machine and debug it.
Terraform file:
# Setting Up Windows Slave data "aws_ami""jenkins_worker_windows" { most_recent =true owners = ["self"] filter { name ="name" values = ["windows-slave-for-jenkins*"] }}resource "aws_key_pair""jenkins_worker_windows" { key_name ="jenkins_worker_windows" public_key ="${file("jenkins_worker.pub")}"}data "template_file""userdata_jenkins_worker_windows" { template ="${file("scripts/jenkins_worker_windows.ps1")}" vars { env ="dev" region ="us-east-1" datacenter ="dev-us-east-1" node_name ="us-east-1-jenkins_worker_windows" domain ="" device_name ="eth0" server_ip ="${aws_instance.jenkins_server.private_ip}" worker_pem ="${data.local_file.jenkins_worker_pem.content}" jenkins_username ="admin" jenkins_password ="mysupersecretpassword" }}# lookup the security group of the Jenkins Serverdata "aws_security_group""jenkins_worker_windows" { filter { name ="group-name" values = ["dev_jenkins_worker_windows"] }}resource "aws_launch_configuration""jenkins_worker_windows" { name_prefix ="dev-jenkins-worker-" image_id ="${data.aws_ami.jenkins_worker_windows.image_id}" instance_type ="t3.medium" iam_instance_profile ="dev_jenkins_worker_windows" key_name ="${aws_key_pair.jenkins_worker_windows.key_name}" security_groups = ["${data.aws_security_group.jenkins_worker_windows.id}"] user_data ="${data.template_file.userdata_jenkins_worker_windows.rendered}" associate_public_ip_address =false root_block_device { delete_on_termination =true volume_size =100 } lifecycle { create_before_destroy =true }}resource "aws_autoscaling_group""jenkins_worker_windows" { name ="dev-jenkins-worker-windows" min_size ="1" max_size ="2" desired_capacity ="2" health_check_grace_period =60 health_check_type ="EC2" vpc_zone_identifier = ["${data.aws_subnet_ids.default_public.ids}"] launch_configuration ="${aws_launch_configuration.jenkins_worker_windows.name}" termination_policies = ["OldestLaunchConfiguration"] wait_for_capacity_timeout ="10m" default_cooldown =60 #lifecycle { # create_before_destroy =true #} ## on replacement, gives new service time to spin up before moving on to destroy #provisioner "local-exec" { # command ="sleep 60" #} tags = [ { key ="Name" value ="dev_jenkins_worker_windows" propagate_at_launch =true }, { key ="class" value ="dev_jenkins_worker_windows" propagate_at_launch =true }, ]}
Finally, we reach the user-data for the terraform plan. It will download the required jar file, create a node on Jenkins and register itself as a slave.
<powershell>function Wait-For-Jenkins { Write-Host "Waiting jenkins to launch on 8080..." Do { Write-Host "Waiting for Jenkins" Nc -zv ${server_ip} 8080 If( $? -eq $true ) { Break } Sleep 10 } While (1) Do { Write-Host "Waiting for JNLP" Nc -zv ${server_ip} 33453 If( $? -eq $true ) { Break } Sleep 10 } While (1) Write-Host "Jenkins launched"}function Slave-Setup(){ # Register_slave $JENKINS_URL="http://${server_ip}:8080" $USERNAME="${jenkins_username}" $PASSWORD="${jenkins_password}" $AUTH = -join ("$USERNAME", ":", "$PASSWORD") echo $AUTH # Below IP collection logic works for Windows Server 2016 edition and needs testing for windows server 2008 edition $SLAVE_IP=(ipconfig | findstr /r "[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*" | findstr "IPv4 Address").substring(39) | findstr /B "172.31" $NODE_NAME="jenkins-slave-windows-$SLAVE_IP" $NODE_SLAVE_HOME="C:\Jenkins\" $EXECUTORS=2 $JNLP_PORT=33453 $CRED_ID="$NODE_NAME" $LABELS="build windows" # Creating CMD utility for jenkins-cli commands # This is not working in windows therefore specify full path $jenkins_cmd = "java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth admin:$PASSWORD" Sleep 20 Write-Host "Downloading jenkins-cli.jar file" (New-Object System.Net.WebClient).DownloadFile("$JENKINS_URL/jnlpJars/jenkins-cli.jar", "C:\Jenkins\jenkins-cli.jar") Write-Host "Downloading slave.jar file" (New-Object System.Net.WebClient).DownloadFile("$JENKINS_URL/jnlpJars/slave.jar", "C:\Jenkins\slave.jar") Sleep 10 # Waiting for Jenkins to load all plugins Do { $count=(java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH list-plugins | Measure-Object -line).Lines $ret=$? Write-Host "count [$count] ret [$ret]"If ( $count -gt 0 ) { Break } sleep 30 } While ( 1 ) # For Deleting Node, used when testing Write-Host "Deleting Node $NODE_NAME if present" java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH delete-node $NODE_NAME # Generating node.xml for creating node on Jenkins server $NodeXml = @"<slave><name>$NODE_NAME</name><description>Windows Slave</description><remoteFS>$NODE_SLAVE_HOME</remoteFS><numExecutors>$EXECUTORS</numExecutors><mode>NORMAL</mode><retentionStrategyclass="hudson.slaves.RetentionStrategy`$Always`"/><launcherclass="hudson.slaves.JNLPLauncher"> <workDirSettings> <disabled>false</disabled> <internalDir>remoting</internalDir> <failIfWorkDirIsMissing>false</failIfWorkDirIsMissing> </workDirSettings></launcher><label>$LABELS</label><nodeProperties/></slave>"@ $NodeXml | Out-File -FilePath C:\Jenkins\node.xml type C:\Jenkins\node.xml # Creating node using node.xml Write-Host "Creating $NODE_NAME" Get-Content -Path C:\Jenkins\node.xml | java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH create-node $NODE_NAME Write-Host "Registering Node $NODE_NAME via JNLP" Start-Process java -ArgumentList "-jar C:\Jenkins\slave.jar -jnlpCredentials $AUTH -jnlpUrl $JENKINS_URL/computer/$NODE_NAME/slave-agent.jnlp"}### script begins here ###Wait-For-JenkinsSlave-Setupecho "Done"</powershell><persist>true</persist>
Command to run: Initialize terraform – terraform init, Check and apply – terraform plan -> terraform apply
Same drawbacks are applicable here and the same solutions will work here as well.
Congratulations! You have a Jenkins master with Windows and Linux slave attached to it.
This blog tries to highlight one of the ways in which we can use packer and Terraform to create AMI’s which will serve as Jenkins master and slave. We not only covered their creation but also focused on how to associate security groups and checked some of the basic IAM roles that can be applied. Although we have covered almost all the possible scenarios but still depending on use case, the required changes would be very less and this can serve as a boiler plate code when beginning to plan your infrastructure on cloud.
It has almost been a decade since Marc Andreessen made this prescient statement. Software is not only eating the world but doing so at an accelerating pace. There is no industry that hasn’t been challenged by technology startups with disruptive approaches.
Automakers are no longer just manufacturing companies: Tesla is disrupting the industry with their software approach to vehicle development and continuous over-the-air software delivery. Waymo’s autonomous cars have driven millions of miles and self-driving cars are a near-term reality. Uber is transforming the transportation industry into a service, potentially affecting the economics and incentives of almost 3–4% of the world GDP!
Social networks and media platforms had a significant and decisive impact on the US election results.
Banks and large financial institutions are being attacked by FinTech startups like WealthFront, Venmo, Affirm, Stripe, SoFi, etc. Bitcoin, Ethereum and the broader blockchain revolution can upend the core structure of banks and even sovereign currencies.
Traditional retail businesses are under tremendous pressure due to Amazon and other e-commerce vendors. Retail is now a customer ownership, recommendations, and optimization business rather than a brick and mortar one.
Enterprises need to adopt a new approach to software development and digital innovation. At Velotio, we are helping customers to modernize and transform their business with all of the approaches and best practices listed below.
Agility
In this fast-changing world, your business needs to be agile and fast-moving. You need to ship software faster, at a regular cadence, with high quality and be able to scale it globally.
Agile practices allow companies to rally diverse teams behind a defined process that helps to achieve inclusivity and drives productivity. Agile is about getting cross-functional teams to work in concert in planned short iterations with continuous learning and improvement.
Generally, teams that work in an Agile methodology will:
Conduct regular stand-ups and Scrum/Kanban planning meetings with the optimal use of tools like Jira, PivotalTracker, Rally, etc.
Use pair programming and code review practices to ensure better code quality.
Use continuous integration and delivery tools like Jenkins or CircleCI.
Design processes for all aspects of product management, development, QA, DevOps and SRE.
Use Slack, Hipchat or Teams for communication between team members and geographically diverse teams. Integrate all tools with Slack to ensure that it becomes the central hub for notifications and engagement.
Cloud-Native
Businesses need software that is purpose-built for the cloud model. What does that mean? Software team sizes are now in the hundreds of thousands. The number of applications and software stacks is growing rapidly in most companies. All companies use various cloud providers, SaaS vendors and best-of-breed hosted or on-premise software. Essentially, software complexity has increased exponentially which required a “cloud-native” approach to manage effectively. Cloud Native Computing Foundation defines cloud native as a software stack which is:
Containerized: Each part (applications, processes, etc) is packaged in its own container. This facilitates reproducibility, transparency, and resource isolation.
Dynamically orchestrated: Containers are actively scheduled and managed to optimize resource utilization.
Microservices oriented: Applications are segmented into micro services. This significantly increases the overall agility and maintainability of applications.
You can deep-dive into cloud native with this blog by our CTO, Chirag Jog.
Cloud native is disrupting the traditional enterprise software vendors. Software is getting decomposed into specialized best of breed components — much like the micro-services architecture. See the Cloud Native landscape below from CNCF.
DevOps
Process and toolsets need to change to enable faster development and deployment of software. Enterprises cannot compete without mature DevOps strategies. DevOps is essentially a set of practices, processes, culture, tooling, and automation that focuses on delivering software continuously with high quality.
DevOps tool chains & process
As you begin or expand your DevOps journey, a few things to keep in mind:
Customize to your needs: There is no single DevOps process or toolchain that suits all needs. Take into account your organization structure, team capabilities, current software process, opportunities for automation and goals while making decisions. For example, your infrastructure team may have automated deployments but the main source of your quality issues could be the lack of code reviews in your development team. So identify the critical pain points and sources of delay to address those first.
Automation: Automate everything that can be. The lesser the dependency on human intervention, the higher are the chances for success.
Culture: Align the incentives and goals with your development, ITOps, SecOps, SRE teams. Ensure that they collaborate effectively and ownership in the DevOps pipeline is well established.
Small wins: Pick one application or team and implement your DevOps strategy within it. That way you can focus your energies and refine your experiments before applying them broadly. Show success as measured by quantifiable parameters and use that to transform the rest of your teams.
Organizational dynamics & integrations: Adoption of new processes and tools will cause some disruptions and you may need to re-skill part of your team or hire externally. Ensure that compliance, SecOps & audit teams are aware of your DevOps journey and get their buy-in.
DevOps is a continuous journey: DevOps will never be done. Train your team to learn continuously and refine your DevOps practice to keep achieving your goal: delivering software reliably and quickly.
Micro-services
As the amount of software in an enterprise explodes, so does the complexity. The only way to manage this complexity is by splitting your software and teams into smaller manageable units. Micro-services adoption is primarily to manage this complexity.
Development teams across the board are choosing micro services to develop new applications and break down legacy monoliths. Every micro-service can be deployed, upgraded, scaled, monitored and restarted independent of other services. Micro-services should ideally be managed by an automated system so that teams can easily update live applications without affecting end-users.
There are companies with 100s of micro-services in production which is only possible with mature DevOps, cloud-native and agile practice adoption.
Interestingly, serverless platforms like Google Functions and AWS Lambda are taking the concept of micro-services to the extreme by allowing each function to act like an independent piece of the application. You can read about my thoughts on serverless computing in this blog: Serverless Computing Predictions for 2017.
Digital Transformation
Digital transformation involves making strategic changes to business processes, competencies, and models to leverage digital technologies. It is a very broad term and every consulting vendor twists it in various ways. Let me give a couple of examples to drive home the point that digital transformation is about using technology to improve your business model, gain efficiencies or built a moat around your business:
GE has done an excellent job transforming themselves from a manufacturing company into an IoT/software company with Predix. GE builds airplane engines, medical equipment, oil & gas equipment and much more. Predix is an IoT platform that is being embedded into all of GE’s products. This enabled them to charge airlines on a per-mile basis by taking the ownership of maintenance and quality instead of charging on a one-time basis. This also gives them huge amounts of data that they can leverage to improve the business as a whole. So digital innovation has enabled a business model improvement leading to higher profits.
Car companies are exploring models where they can provide autonomous car fleets to cities where they will charge on a per-mile basis. This will convert them into a “service” & “data” company from a pure manufacturing one.
Insurance companies need to built digital capabilities to acquire and retain customers. They need to build data capabilities and provide ongoing value with services rather than interact with the customer just once a year.
You would be better placed to compete in the market if you have automation and digital process in place so that you can build new products and pivot in an agile manner.
Big Data / Data Science
Businesses need to deal with increasing amounts of data due to IoT, social media, mobile and due to the adoption of software for various processes. And they need to use this data intelligently. Cloud platforms provide the services and solutions to accelerate your data science and machine learning strategies. AWS, Google Cloud & open-source libraries like Tensorflow, SciPy, Keras, etc. have a broad set of machine learning and big data services that can be leveraged. Companies need to build mature data processing pipelines to aggregate data from various sources and store it for quick and efficient access to various teams. Companies are leveraging these services and libraries to build solutions like:
Predictive analytics
Cognitive computing
Robotic Process Automation
Fraud detection
Customer churn and segmentation analysis
Recommendation engines
Forecasting
Anomaly detection
Companies are creating data science teams to build long term capabilities and moats around their business by using their data smartly.
Re-platforming & App Modernization
Enterprises want to modernize their legacy, often monolithic apps as they migrate to the cloud. The move can be triggered due to hardware refresh cycles or license renewals or IT cost optimization or adoption of software-focused business models.
Benefits of modernization to customers and businesses
Intelligent Applications
Software is getting more intelligent and to enable this, businesses need to integrate disparate datasets, distributed teams, and processes. This is best done on a scalable global cloud platform with agile processes. Big data and data science enables the creation of intelligent applications.
How can smart applications help your business?
New intelligent systems of engagement: intelligent apps surface insights to users enabling the user to be more effective and efficient. For example, CRMs and marketing software is getting intelligent and multi-platform enabling sales and marketing reps to become more productive.
Personalisation: E-Commerce, social networks and now B2B software is getting personalized. In order to improve user experience and reduce churn, your applications should be personalized based on the user preferences and traits.
Drive efficiencies: IoT is an excellent example where the efficiency of machines can be improved with data and cloud software. Real-time insights can help to optimize processes or can be used for preventive maintenance.
Creation of new business models: Traditional and modern industries can use AI to build new business models. For example, what if insurance companies allow you to pay insurance premiums only for the miles driven?
Security
Security threats to governments, enterprises and data have never been greater. As business adopt cloud native, DevOps & micro-services practices, their security practices need to evolve.
In our experience, these are few of the features of a mature cloud native security practice:
Automated: Systems are updated automatically with the latest fixes. Another approach is immutable infrastructure with the adoption of containers and serverless.
Proactive: Automated security processes tend to be proactive. For example, if a malware of vulnerability is found in one environment, automation can fix it in all environments. Mature DevOps & CI/CD processes ensure that fixes can be deployed in hours or days instead of weeks or months.
Cloud Platforms: Businesses have realized that the mega-clouds are way more secure than their own data centers can be. Many of these cloud platforms have audit, security and compliance services which should be leveraged.
Protecting credentials: Use AWS KMS, Hashicorp Vault or other solutions for protecting keys, passwords and authorizations.
Bug bounties: Either setup bug bounties internally or through sites like HackerOne. You want the good guys to work for you and this is an easy way to do that.
Conclusion
As you can see, all of these approaches and best practices are intertwined and need to be implemented in concert to gain the desired results. It is best to start with one project, one group or one application and build on early wins. Remember, that is is a process and you are looking for gradual improvements to achieve your final objectives.
Please let us know your thoughts and experiences by adding comments to this blog or reaching out to @kalpakshah or RSI. We would love to help your business adopt these best practices and help to build great software together. Drop me a note at kalpak (at) velotio (dot) com.
Web scraping is a process to crawl various websites and extract the required data using spiders. This data is processed in a data pipeline and stored in a structured format. Today, web scraping is widely used and has many use cases:
Using web scraping, Marketing & Sales companies can fetch lead-related information.
Web scraping is useful for Real Estate businesses to get the data of new projects, resale properties, etc.
Price comparison portals, like Trivago, extensively use web scraping to get the information of product and price from various e-commerce sites.
The process of web scraping usually involves spiders, which fetch the HTML documents from relevant websites, extract the needed content based on the business logic, and finally store it in a specific format. This blog is a primer to build highly scalable scrappers. We will cover the following items:
Ways to scrape: We’ll see basic ways to scrape data using techniques and frameworks in Python with some code snippets.
Scraping at scale: Scraping a single page is straightforward, but there are challenges in scraping millions of websites, including managing the spider code, collecting data, and maintaining a data warehouse. We’ll explore such challenges and their solutions to make scraping easy and accurate.
Scraping Guidelines: Scraping data from websites without the owner’s permission can be deemed as malicious. Certain guidelines need to be followed to ensure our scrappers are not blacklisted. We’ll look at some of the best practices one should follow for crawling.
So let’s start scraping.
Different Techniques for Scraping
Here, we will discuss how to scrape a page and the different libraries available in Python.
Note: Python is the most popular language for scraping.
1. Requests –HTTP Library in Python: To scrape the website or a page, first find out the content of the HTML page in an HTTP response object. The requests library from Python is pretty handy and easy to use. It uses urllib inside. I like ‘requests’ as it’s easy and the code becomes readable too.
#Example showing how to use the requests libraryimport requestsr = requests.get("https://velotio.com") #Fetch HTML Page
2. BeautifulSoup: Once you get the webpage, the next step is to extract the data. BeautifulSoup is a powerful Python library that helps you extract the data from the page. It’s easy to use and has a wide range of APIs that’ll help you extract the data. We use the requests library to fetch an HTML page and then use the BeautifulSoup to parse that page. In this example, we can easily fetch the page title and all links on the page. Check out the documentation for all the possible ways in which we can use BeautifulSoup.
from bs4 import BeautifulSoupimport requestsr = requests.get("https://velotio.com") #Fetch HTML Pagesoup = BeautifulSoup(r.text, "html.parser") #Parse HTML Pageprint "Webpage Title:" + soup.title.stringprint "Fetch All Links:" soup.find_all('a')
3.Python Scrapy Framework:
Scrapy is a Python-based web scraping framework that allows you to create different kinds of spiders to fetch the source code of the target website. Scrapy starts crawling the web pages present on a certain website, and then you can write the extraction logic to get the required data. Scrapy is built on the top of Twisted, a Python-based asynchronous library that performs the requests in an async fashion to boost up the spider performance. Scrapy is faster than BeautifulSoup. Moreover, it is a framework to write scrapers as opposed to BeautifulSoup, which is just a library to parse HTML pages.
Here is a simple example of how to use Scrapy. Install Scrapy via pip. Scrapy gives a shell after parsing a website:
$ pip install scrapy #Install Scrapy"$ scrapy shell https://velotio.comIn [1]: response.xpath("//a").extract() #Fetch all a hrefs
Now, let’s write a custom spider to parse a website.
$cat > myspider.py <import scrapyclass BlogSpider(scrapy.Spider):name = 'blogspider'start_urls = ['https://blog.scrapinghub.com']def parse(self, response):for title in response.css('h2.entry-title'):yield {'title': title.css('a ::text').extract_first()}EOFscrapy runspider myspider.py
That’s it. Your first custom spider is created. Now. let’s understand the code.
name: Name of the spider. In this case, it’s “blogspider”.
start_urls: A list of URLs where the spider will begin to crawl from.
parse(self, response): This function is called whenever the crawler successfully crawls a URL. The response object used earlier in the Scrapy shell is the same response object that is passed to the parse(..).
When you run this, Scrapy will look for start URL and will give you all the divs of the h2.entry-title class and extract the associated text from it. Alternatively, you can write your extraction logic in a parse method or create a separate class for extraction and call its object from the parse method.
You’ve seen how to extract simple items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient. Here is a tutorial for Scrapy and the additional documentation for LinkExtractor by which you can instruct Scrapy to extract links from a web page.
4. Python lxml.html library: This is another library from Python just like BeautifulSoup. Scrapy internally uses lxml. It comes with a list of APIs you can use for data extraction. Why will you use this when Scrapy itself can extract the data? Let’s say you want to iterate over the ‘div’ tag and perform some operation on each tag present under “div”, then you can use this library which will give you a list of ‘div’ tags. Now you can simply iterate over them using the iter() function and traverse each child tag inside the parent div tag. Such traversing operations are difficult in scraping. Here is the documentation for this library.
Challenges while Scraping at Scale
Let’s look at the challenges and solutions while scraping at large scale, i.e., scraping 100-200 websites regularly:
1. Data warehousing: Data extraction at a large scale generates vast volumes of information. Fault-tolerant, scalability, security, and high availability are the must-have features for a data warehouse. If your data warehouse is not stable or accessible then operations, like search and filter over data would be an overhead. To achieve this, instead of maintaining own database or infrastructure, you can use Amazon Web Services (AWS). You can use RDS (Relational Database Service) for a structured database and DynamoDB for the non-relational database. AWS takes care of the backup of data. It automatically takes a snapshot of the database. It gives you database error logs as well. This blog explains how to set up infrastructure in the cloud for scraping.
2. Pattern Changes: Scraping heavily relies on user interface and its structure, i.e., CSS and Xpath. Now, if the target website gets some adjustments then our scraper may crash completely or it can give random data that we don’t want. This is a common scenario and that’s why it’s more difficult to maintain scrapers than writing it. To handle this case, we can write the test cases for the extraction logic and run them daily, either manually or from CI tools, like Jenkins to track if the target website has changed or not.
3. Anti-scraping Technologies: Web scraping is a common thing these days, and every website host would want to prevent their data from being scraped. Anti-scraping technologies would help them in this. For example, if you are hitting a particular website from the same IP address on a regular interval then the target website can block your IP. Adding a captcha on a website also helps. There are methods by which we can bypass these anti-scraping methods. For e.g., we can use proxy servers to hide our original IP. There are several proxy services that keep on rotating the IP before each request. Also, it is easy to add support for proxy servers in the code, and in Python, the Scrapy framework does support it.
4. JavaScript-based dynamic content: Websites that heavily rely on JavaScript and Ajax to render dynamic content, makes data extraction difficult. Now, Scrapy and related frameworks/libraries will only work or extract what it finds in the HTML document. Ajax calls or JavaScript are executed at runtime so it can’t scrape that. This can be handled by rendering the web page in a headless browser such as Headless Chrome, which essentially allows running Chrome in a server environment. You can also use PhantomJS, which provides a headless Webkit-based environment.
5. Honeypot traps: Some websites have honeypot traps on the webpages for the detection of web crawlers. They are hard to detect as most of the links are blended with background color or the display property of CSS is set to none. To achieve this requires large coding efforts on both the server and the crawler side, hence this method is not frequently used.
6. Quality of data: Currently, AI and ML projects are in high demand and these projects need data at large scale. Data integrity is also important as one fault can cause serious problems in AI/ML algorithms. So, in scraping, it is very important to not just scrape the data, but verify its integrity as well. Now doing this in real-time is not possible always, so I would prefer to write test cases of the extraction logic to make sure whatever your spiders are extracting is correct and they are not scraping any bad data
7. More Data, More Time: This one is obvious. The larger a website is, the more data it contains, the longer it takes to scrape that site. This may be fine if your purpose for scanning the site isn’t time-sensitive, but that isn’t often the case. Stock prices don’t stay the same over hours. Sales listings, currency exchange rates, media trends, and market prices are just a few examples of time-sensitive data. What to do in this case then? Well, one solution could be to design your spiders carefully. If you’re using Scrapy like framework then apply proper LinkExtractor rules so that spider will not waste time on scraping unrelated URLs.
You may use multithreading scraping packages available in Python, such as Frontera and Scrapy Redis. Frontera lets you send out only one request per domain at a time, but can hit multiple domains at once, making it great for parallel scraping. Scrapy Redis lets you send out multiple requests to one domain. The right combination of these can result in a very powerful web spider that can handle both the bulk and variation for large websites.
8. Captchas: Captchas is a good way of keeping crawlers away from a website and it is used by many website hosts. So, in order to scrape the data from such websites, we need a mechanism to solve the captchas. There are packages, software that can solve the captcha and can act as a middleware between the target website and your spider. Also, you may use libraries like Pillow and Tesseract in Python to solve the simple image-based captchas.
9. Maintaining Deployment: Normally, we don’t want to limit ourselves to scrape just a few websites. We need the maximum amount of data that are present on the Internet and that may introduce scraping of millions of websites. Now, you can imagine the size of the code and the deployment. We can’t run spiders at this scale from a single machine. What I prefer here is to dockerize the scrapers and take advantage of the latest technologies, like AWS ECS, Kubernetes to run our scraper containers. This helps us keeping our scrapers in high availability state and it’s easy to maintain. Also, we can schedule the scrapers to run at regular intervals.
Scraping Guidelines/ Best Practices
1. Respect the robots.txt file: Robots.txt is a text file that webmasters create to instruct search engine robots on how to crawl and index pages on the website. This file generally contains instructions for crawlers. Now, before even planning the extraction logic, you should first check this file. Usually, you can find this at the website admin section. This file has all the rules set on how crawlers should interact with the website. For e.g., if a website has a link to download critical information then they probably don’t want to expose that to crawlers. Another important factor is the frequency interval for crawling, which means that crawlers can only hit the website at specified intervals. If someone has asked not to crawl their website then we better not do it. Because if they catch your crawlers, it can lead to some serious legal issues.
2. Do not hit the servers too frequently: As I mentioned above, some websites will have the frequency interval specified for crawlers. We better use it wisely because not every website is tested against the high load. If you are hitting at a constant interval then it creates huge traffic on the server-side, and it may crash or fail to serve other requests. This creates a high impact on user experience as they are more important than the bots. So, we should make the requests according to the specified interval in robots.txt or use a standard delay of 10 seconds. This also helps you not to get blocked by the target website.
3. User Agent Rotation and Spoofing: Every request consists of a User-Agent string in the header. This string helps to identify the browser you are using, its version, and the platform. If we use the same User-Agent in every request then it’s easy for the target website to check that request is coming from a crawler. So, to make sure we do not face this, try to rotate the User and the Agent between the requests. You can get examples of genuine User-Agent strings on the Internet very easily, try them out. If you’re using Scrapy, you can set USER_AGENT property in settings.py.
4. Disguise your requests by rotating IPs and Proxy Services: We’ve discussed this in the challenges above. It’s always better to use rotating IPs and proxy service so that your spider won’t get blocked.
5. Do not follow the same crawling pattern: Now, as you know many websites use anti-scraping technologies, so it’s easy for them to detect your spider if it’s crawling in the same pattern. Normally, we, as a human, would not follow a pattern on a particular website. So, to have your spiders run smoothly, we can introduce actions like mouse movements, clicking a random link, etc, which gives the impression of your spider as a human.
6. Scrape during off-peak hours: Off-peak hours are suitable for bots/crawlers as the traffic on the website is considerably less. These hours can be identified by the geolocation from where the site’s traffic originates. This also helps to improve the crawling rate and avoid the extra load from spider requests. Thus, it is advisable to schedule the crawlers to run in the off-peak hours.
7. Use the scraped data responsibly: We should always take the responsibility of the scraped data. It is not acceptable if someone is scraping the data and then republish it somewhere else. This can be considered as breaking the copyright laws and may lead to legal issues. So, it is advisable to check the target website’s Terms of Service page before scraping.
8. Use Canonical URLs: When we scrape, we tend to scrape duplicate URLs, and hence the duplicate data, which is the last thing we want to do. It may happen in a single website where we get multiple URLs having the same data. In this situation, duplicate URLs will have a canonical URL, which points to the parent or the original URL. By this, we make sure, we don’t scrape duplicate contents. In frameworks like Scrapy, duplicate URLs are handled by default.
9. Be transparent: Don’t misrepresent your purpose or use deceptive methods to gain access. If you have a login and a password that identifies you to gain access to a source, use it. Don’t hide who you are. If possible, share your credentials.
Conclusion
We’ve seen the basics of scraping, frameworks, how to crawl, and the best practices of scraping. To conclude:
Follow target URLs rules while scraping. Don’t make them block your spider.
Maintenance of data and spiders at scale is difficult. Use Docker/ Kubernetes and public cloud providers, like AWS to easily scale your web-scraping backend.
Always respect the rules of the websites you plan to crawl. If APIs are available, always use them first.
Asynchronous programming is a type of parallel programming in which a unit of work is allowed to run separately from the primary application thread. When the work is complete, it notifies the main thread about completion or failure of the worker thread. There are numerous benefits to using it, such as improved application performance and enhanced responsiveness.
Asynchronous programming has been gaining a lot of attention in the past few years, and for good reason. Although it can be more difficult than the traditional linear style, it is also much more efficient.
For example, instead of waiting for an HTTP request to finish before continuing execution, with Python async coroutines you can submit the request and do other work that’s waiting in a queue while waiting for the HTTP request to finish.
Asynchronicity seems to be a big reason why Node.js so popular for server-side programming. Much of the code we write, especially in heavy IO applications like websites, depends on external resources. This could be anything from a remote database call to POSTing to a REST service. As soon as you ask for any of these resources, your code is waiting around with nothing to do. With asynchronous programming, you allow your code to handle other tasks while waiting for these other resources to respond.
How Does Python Do Multiple Things At Once?
1. Multiple Processes
The most obvious way is to use multiple processes. From the terminal, you can start your script two, three, four…ten times and then all the scripts are going to run independently or at the same time. The operating system that’s underneath will take care of sharing your CPU resources among all those instances. Alternately you can use the multiprocessing library which supports spawning processes as shown in the example below.
from multiprocessing import Processdef print_func(continent='Asia'): print('The name of continent is : ', continent)if __name__ == "__main__": # confirms that the code is under main function names = ['America', 'Europe', 'Africa'] procs = [] proc = Process(target=print_func) # instantiating without any argument procs.append(proc) proc.start() # instantiating process with arguments for name in names: # print(name) proc = Process(target=print_func, args=(name,)) procs.append(proc) proc.start() # complete the processes for proc in procs: proc.join()
Output:
The name of continent is : AsiaThe name of continent is : AmericaThe name of continent is : EuropeThe name of continent is : Africa
2. Multiple Threads
The next way to run multiple things at once is to use threads. A thread is a line of execution, pretty much like a process, but you can have multiple threads in the context of one process and they all share access to common resources. But because of this, it’s difficult to write a threading code. And again, the operating system is doing all the heavy lifting on sharing the CPU, but the global interpreter lock (GIL) allows only one thread to run Python code at a given time even when you have multiple threads running code. So, In CPython, the GIL prevents multi-core concurrency. Basically, you’re running in a single core even though you may have two or four or more.
import threadingdef print_cube(num):""" function to print cube of given num""" print("Cube: {}".format(num * num * num))def print_square(num):""" function to print square of given num""" print("Square: {}".format(num * num))if __name__ == "__main__": # creating thread t1 = threading.Thread(target=print_square, args=(10,)) t2 = threading.Thread(target=print_cube, args=(10,)) # starting thread 1 t1.start() # starting thread 2 t2.start() # wait until thread 1 is completely executed t1.join() # wait until thread 2 is completely executed t2.join() # both threads completely executed print("Done!")
Output:
Square: 100Cube: 1000Done!
3. Coroutines using yield:
Coroutines are generalization of subroutines. They are used for cooperative multitasking where a process voluntarily yield (give away) control periodically or when idle in order to enable multiple applications to be run simultaneously. Coroutines are similar to generators but with few extra methods and slight change in how we use yield statement. Generators produce data for iteration while coroutines can also consume data.
def print_name(prefix):print("Searching prefix:{}".format(prefix))try : whileTrue: # yeild used to create coroutine name = (yield)if prefix inname:print(name) except GeneratorExit:print("Closing coroutine!!")corou =print_name("Dear")corou.__next__()corou.send("James")corou.send("Dear James")corou.close()
The fourth way is an asynchronous programming, where the OS is not participating. As far as OS is concerned you’re going to have one process and there’s going to be a single thread within that process, but you’ll be able to do multiple things at once. So, what’s the trick?
The answer is asyncio
asyncio is the new concurrency module introduced in Python 3.4. It is designed to use coroutines and futures to simplify asynchronous code and make it almost as readable as synchronous code as there are no callbacks.
asyncio uses different constructs: event loops, coroutines and futures.
An event loop manages and distributes the execution of different tasks. It registers them and handles distributing the flow of control between them.
Coroutines (covered above) are special functions that work similarly to Python generators, on await they release the flow of control back to the event loop. A coroutine needs to be scheduled to run on the event loop, once scheduled coroutines are wrapped in Tasks which is a type of Future.
Futures represent the result of a task that may or may not have been executed. This result may be an exception.
Using Asyncio, you can structure your code so subtasks are defined as coroutines and allows you to schedule them as you please, including simultaneously. Coroutines contain yield points where we define possible points where a context switch can happen if other tasks are pending, but will not if no other task is pending.
A context switch in asyncio represents the event loop yielding the flow of control from one coroutine to the next.
In the example, we run 3 async tasks that query Reddit separately, extract and print the JSON. We leverage aiohttp which is a http client library ensuring even the HTTP request runs asynchronously.
import signal import sys import asyncio import aiohttp import jsonloop = asyncio.get_event_loop() client = aiohttp.ClientSession(loop=loop)async def get_json(client, url): async with client.get(url) as response: assert response.status == 200 return await response.read()async def get_reddit_top(subreddit, client): data1 = await get_json(client, 'https://www.reddit.com/r/' + subreddit + '/top.json?sort=top&t=day&limit=5') j = json.loads(data1.decode('utf-8')) for i in j['data']['children']: score = i['data']['score'] title = i['data']['title'] link = i['data']['url'] print(str(score) + ': ' + title + ' (' + link + ')') print('DONE:', subreddit + '\n')def signal_handler(signal, frame): loop.stop() client.close() sys.exit(0)signal.signal(signal.SIGINT, signal_handler)asyncio.ensure_future(get_reddit_top('python', client)) asyncio.ensure_future(get_reddit_top('programming', client)) asyncio.ensure_future(get_reddit_top('compsci', client)) loop.run_forever()
Output:
50: Undershoot: Parsing theory in1965 (http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/07/knuth_1965_2.html)12:Questionaboutbest-prefix/failurefunction/primalmatchtableinkmpalgorithm (https://www.reddit.com/r/compsci/comments/8xd3m2/question_about_bestprefixfailure_functionprimal/)1:QuestionregardingcalculatingtheprobabilityoffailureofaRAIDsystem (https://www.reddit.com/r/compsci/comments/8xbkk2/question_regarding_calculating_the_probability_of/)DONE:compsci336: /r/thanosdidnothingwrong -- banningpeoplewithpython (https://clips.twitch.tv/AstutePluckyCocoaLitty)175:PythonRobotics:Pythonsamplecodesforroboticsalgorithms (https://atsushisakai.github.io/PythonRobotics/)23:PythonandFlaskTutorialinVSCode (https://code.visualstudio.com/docs/python/tutorial-flask)17:StartedanewblogonCelery - whatwouldyouliketoreadabout? (https://www.python-celery.com)14:ASimpleAnomalyDetectionAlgorithminPython (https://medium.com/@mathmare_/pyng-a-simple-anomaly-detection-algorithm-2f355d7dc054)DONE:python1360:gitbundle (https://dev.to/gabeguz/git-bundle-2l5o)1191:Whichhashingalgorithmisbestforuniquenessandspeed?IanBoyd's answer (top voted) is one of the best comments I'veseenonStackexchange. (https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed)430:ARMlaunches “Facts” campaignagainstRISC-V (https://riscv-basics.com/)244:ChoiceofsearchengineonAndroidnukedby “AnonymousCoward” (2009) (https://android.googlesource.com/platform/packages/apps/GlobalSearch/+/592150ac00086400415afe936d96f04d3be3ba0c)209:ExploitingfreelyaccessibleWhatsAppdataor “WhydoesWhatsAppwebknowmyphone’sbatterylevel?” (https://medium.com/@juan_cortes/exploiting-freely-accessible-whatsapp-data-or-why-does-whatsapp-know-my-battery-level-ddac224041b4)DONE:programming
Using Redis and Redis Queue(RQ):
Using asyncio and aiohttp may not always be in an option especially if you are using older versions of python. Also, there will be scenarios when you would want to distribute your tasks across different servers. In that case we can leverage RQ (Redis Queue). It is a simple Python library for queueing jobs and processing them in the background with workers. It is backed by Redis – a key/value data store.
In the example below, we have queued a simple function count_words_at_url using redis.
from mymodule import count_words_at_urlfrom redis import Redisfrom rq import Queueq = Queue(connection=Redis())job = q.enqueue(count_words_at_url, 'http://nvie.com')******mymodule.py******import requestsdef count_words_at_url(url):"""Just an example function that's called async.""" resp = requests.get(url) print( len(resp.text.split())) return( len(resp.text.split()))
Output:
15:10:45RQ worker 'rq:worker:EMPID18030.9865' started, version 0.11.015:10:45*** Listening on default...15:10:45 Cleaning registries for queue: default15:10:50default: mymodule.count_words_at_url('http://nvie.com') (a2b7451e-731f-4f31-9232-2b7e3549051f)32215:10:51default: Job OK (a2b7451e-731f-4f31-9232-2b7e3549051f)15:10:51 Result is kept for 500 seconds
Conclusion:
Let’s take a classical example chess exhibition where one of the best chess players competes against a lot of people. And if there are 24 games with 24 people to play with and the chess master plays with all of them synchronically, it’ll take at least 12 hours (taking into account that the average game takes 30 moves, the chess master thinks for 5 seconds to come up with a move and the opponent – for approximately 55 seconds). But using the asynchronous mode gives chess master the opportunity to make a move and leave the opponent thinking while going to the next one and making a move there. This way a move on all 24 games can be done in 2 minutes and all of them can be won in just one hour.
So, this is what’s meant when people talk about asynchronous being really fast. It’s this kind of fast. Chess master doesn’t play chess faster, the time is just more optimized and it’s not get wasted on waiting around. This is how it works.
In this analogy, the chess master will be our CPU and the idea is that we wanna make sure that the CPU doesn’t wait or waits the least amount of time possible. It’s about always finding something to do.
A practical definition of Async is that it’s a style of concurrent programming in which tasks release the CPU during waiting periods, so that other tasks can use it. In Python, there are several ways to achieve concurrency, based on our requirement, code flow, data manipulation, architecture design and use cases we can select any of these methods.
Containerized applications and Kubernetes adoption in cloud environments is on the rise. One of the challenges while deploying applications in Kubernetes is exposing these containerized applications to the outside world. This blog explores different options via which applications can be externally accessed with focus on Ingress – a new feature in Kubernetes that provides an external load balancer. This blog also provides a simple hand-on tutorial on Google Cloud Platform (GCP).
Ingress is the new feature (currently in beta) from Kubernetes which aspires to be an Application Load Balancer intending to simplify the ability to expose your applications and services to the outside world. It can be configured to give services externally-reachable URLs, load balance traffic, terminate SSL, offer name based virtual hosting etc. Before we dive into Ingress, let’s look at some of the alternatives currently available that help expose your applications, their complexities/limitations and then try to understand Ingress and how it addresses these problems.
Current ways of exposing applications externally:
There are certain ways using which you can expose your applications externally. Lets look at each of them:
EXPOSE Pod:
You can expose your application directly from your pod by using a port from the node which is running your pod, mapping that port to a port exposed by your container and using the combination of your HOST-IP:HOST-PORT to access your application externally. This is similar to what you would have done when running docker containers directly without using Kubernetes. Using Kubernetes you can use hostPortsetting in service configuration which will do the same thing. Another approach is to set hostNetwork: true in service configuration to use the host’s network interface from your pod.
Limitations:
In both scenarios you should take extra care to avoid port conflicts at the host, and possibly some issues with packet routing and name resolutions.
This would limit running only one replica of the pod per cluster node as the hostport you use is unique and can bind with only one service.
EXPOSE Service:
Kubernetes services primarily work to interconnect different pods which constitute an application. You can scale the pods of your application very easily using services. Services are not primarily intended for external access, but there are some accepted ways to expose services to the external world.
Basically, services provide a routing, balancing and discovery mechanism for the pod’s endpoints. Services target pods using selectors, and can map container ports to service ports. A service exposes one or more ports, although usually, you will find that only one is defined.
A service can be exposed using 3 ServiceType choices:
ClusterIP: Exposes the service on a cluster-internal IP. Choosing this value makes the service only reachable from within the cluster. This is the default ServiceType.
NodePort: Exposes the service on each Node’s IP at a static port (the NodePort). A ClusterIP service, to which the NodePort service will route, is automatically created. You’ll be able to contact the NodePort service, from outside the cluster, by requesting <nodeip>:<nodeport>.Here NodePort remains fixed and NodeIP can be any node IP of your Kubernetes cluster.</nodeport></nodeip>
LoadBalancer: Exposes the service externally using a cloud provider’s load balancer (eg. AWS ELB). NodePort and ClusterIP services, to which the external load balancer will route, are automatically created.
ExternalName: Maps the service to the contents of the externalName field (e.g. foo.bar.example.com), by returning a CNAME record with its value. No proxying of any kind is set up. This requires version 1.7 or higher of kube-dns
Limitations:
If we choose NodePort to expose our services, kubernetes will generate ports corresponding to the ports of your pods in the range of 30000-32767. You will need to add an external proxy layer that uses DNAT to expose more friendly ports. The external proxy layer will also have to take care of load balancing so that you leverage the power of your pod replicas. Also it would not be easy to add TLS or simple host header routing rules to the external service.
ClusterIP and ExternalName similarly while easy to use have the limitation where we can add any routing or load balancing rules.
Choosing LoadBalancer is probably the easiest of all methods to get your service exposed to the internet. The problem is that there is no standard way of telling a Kubernetes service about the elements that a balancer requires, again TLS and host headers are left out. Another limitation is reliance on an external load balancer (AWS’s ELB, GCP’s Cloud Load Balancer etc.)
Endpoints
Endpoints are usually automatically created by services, unless you are using headless services and adding the endpoints manually. An endpoint is a host:port tuple registered at Kubernetes, and in the service context it is used to route traffic. The service tracks the endpoints as pods, that match the selector are created, deleted and modified. Individually, endpoints are not useful to expose services, since they are to some extent ephemeral objects.
Summary
If you can rely on your cloud provider to correctly implement the LoadBalancer for their API, to keep up-to-date with Kubernetes releases, and you are happy with their management interfaces for DNS and certificates, then setting up your services as type LoadBalancer is quite acceptable.
On the other hand, if you want to manage load balancing systems manually and set up port mappings yourself, NodePort is a low-complexity solution. If you are directly using Endpoints to expose external traffic, perhaps you already know what you are doing (but consider that you might have made a mistake, there could be another option).
Given that none of these elements has been originally designed to expose services to the internet, their functionality may seem limited for this purpose.
Understanding Ingress
Traditionally, you would create a LoadBalancer service for each public application you want to expose. Ingress gives you a way to route requests to services based on the request host or path, centralizing a number of services into a single entrypoint.
Ingress is split up into two main pieces. The first is an Ingress resource, which defines how you want requests routed to the backing services and second is the Ingress Controller which does the routing and also keeps track of the changes on a service level.
Ingress Resources
The Ingress resource is a set of rules that map to Kubernetes services. Ingress resources are defined purely within Kubernetes as an object that other entities can watch and respond to.
Ingress Supports defining following rules in beta stage:
host header: Forward traffic based on domain names.
paths: Looks for a match at the beginning of the path.
TLS: If the ingress adds TLS, HTTPS and a certificate configured through a secret will be used.
When no host header rules are included at an Ingress, requests without a match will use that Ingress and be mapped to the backend service. You will usually do this to send a 404 page to requests for sites/paths which are not sent to the other services. Ingress tries to match requests to rules, and forwards them to backends, which are composed of a service and a port.
Ingress Controllers
Ingress controller is the entity which grants (or remove) access, based on the changes in the services, pods and Ingress resources. Ingress controller gets the state change data by directly calling Kubernetes API.
Ingress controllers are applications that watch Ingresses in the cluster and configure a balancer to apply those rules. You can configure any of the third party balancers like HAProxy, NGINX, Vulcand or Traefik to create your version of the Ingress controller. Ingress controller should track the changes in ingress resources, services and pods and accordingly update configuration of the balancer.
Ingress controllers will usually track and communicate with endpoints behind services instead of using services directly. This way some network plumbing is avoided, and we can also manage the balancing strategy from the balancer. Some of the open source implementations of Ingress Controllers can be found here.
Now, let’s do an exercise of setting up a HTTP Load Balancer using Ingress on Google Cloud Platform (GCP), which has already integrated the ingress feature in it’s Container Engine (GKE) service.
Ingress-based HTTP Load Balancer in Google Cloud Platform
The tutorial assumes that you have your GCP account setup done and a default project created. We will first create a Container cluster, followed by deployment of a nginx server service and an echoserver service. Then we will setup an ingress resource for both the services, which will configure the HTTP Load Balancer provided by GCP
Basic Setup
Get your project ID by going to the “Project info” section in your GCP dashboard. Start the Cloud Shell terminal, set your project id and the compute/zone in which you want to create your cluster.
$ gcloud config set project glassy-chalice-129514$ gcloud config set compute/zone us-east1-d# Create a 3 node cluster with name “loadbalancedcluster”$ gcloud container clusters create loadbalancedcluster
Fetch the cluster credentials for the kubectl tool:
When you create a Service of type NodePort with this command, Container Engine makes your Service available on a randomly-selected high port number (e.g. 30746) on all the nodes in your cluster. Verify the Service was created and a node port was allocated:
$ kubectl get service nginxNAMECLUSTER-IPEXTERNAL-IPPORT(S) AGEnginx 10.47.245.54<nodes>80:30746/TCP 20s$ kubectl get service echoserverNAMECLUSTER-IPEXTERNAL-IPPORT(S) AGEechoserver 10.47.251.9<nodes>8080:32301/TCP 33s
In the output above, the node port for the nginx Service is 30746 and for echoserver service is 32301. Also, note that there is no external IP allocated for this Services. Since the Container Engine nodes are not externally accessible by default, creating this Service does not make your application accessible from the Internet. To make your HTTP(S) web server application publicly accessible, you need to create an Ingress resource.
Step 3: Create an Ingress resource
On Container Engine, Ingress is implemented using Cloud Load Balancing. When you create an Ingress in your cluster, Container Engine creates an HTTP(S) load balancer and configures it to route traffic to your application. Container Engine has internally defined an Ingress Controller, which takes the Ingress resource as input for setting up proxy rules and talk to Kubernetes API to get the service related information.
The following config file defines an Ingress resource that directs traffic to your nginx and echoserver server:
To deploy this Ingress resource run in the cloud shell:
$ kubectl apply -f basic-ingress.yaml
Step 4: Access your application
Find out the external IP address of the load balancer serving your application by running:
$ kubectl get ingress fanout-ingresNAMEHOSTSADDRESSPORTSAGfanout-ingress *130.211.36.16880 36s
Use http://<external-ip-address> </external-ip-address>and http://<external-ip-address>/echo</external-ip-address> to access nginx and the echo-server.
Summary
Ingresses are simple and very easy to deploy, and really fun to play with. However, it’s currently in beta phase and misses some of the features that may restrict it from production use. Stay tuned to get updates in Ingress on Kubernetes page and their Github repo.
All modern era programmers can attest that containerization has afforded more flexibility and allows us to build truly cloud-native applications. Containers provide portability – ability to easily move applications across environments. Although complex applications comprise of many (10s or 100s) containers. Managing such applications is a real challenge and that’s where container orchestration and scheduling platforms like Kubernetes, Mesosphere, Docker Swarm, etc. come into the picture. Kubernetes, backed by Google is leading the pack given that Redhat, Microsoft and now Amazon are putting their weight behind it.
Kubernetes can run on any cloud or bare metal infrastructure. Setting up & managing Kubernetes can be a challenge but Google provides an easy way to use Kubernetes through the Google Container Engine(GKE) service.
What is GKE?
Google Container Engine is a Management and orchestration system for Containers. In short, it is a hosted Kubernetes. The goal of GKE is to increase the productivity of DevOps and development teams by hiding the complexity of setting up the Kubernetes cluster, the overlay network, etc.
Why GKE? What are the things that GKE does for the user?
GKE abstracts away the complexity of managing a highly available Kubernetes cluster.
GKE also provides easy integration with the Google storage services.
In this blog, we will see how to create your own Kubernetes cluster in GKE and how to deploy a multi-tier application in it. The blog assumes you have a basic understanding of Kubernetes and have used it before. It also assumes you have created an account with Google Cloud Platform. If you are not familiar with Kubernetes, this guide from Deis is a good place to start.
Google provides a Command-line interface (gcloud) to interact with all Google Cloud Platform products and services. gcloud is a tool that provides the primary command-line interface to Google Cloud Platform. Gcloud tool can be used in the scripts to automate the tasks or directly from the command-line. Follow this guide to install the gcloud tool.
Now let’s begin! The first step is to create the cluster.
Basic Steps to create cluster
In this section, I would like to explain about how to create GKE cluster. We will use a command-line tool to setup the cluster.
Set the zone in which you want to deploy the cluster
Let’s try to understand what each of these parameters mean:
–project: Project Name
–machine-type: Type of the machine like n1-standard-2, n1-standard-4
–image-type: OS image.”COS” i.e. Container Optimized OS from Google: More Info here.
–disk-size: Disk size of each instance.
–num-nodes: Number of nodes in the cluster.
–network: Network that users want to use for the cluster. In this case, we are using default network.
Apart from the above options, you can also use the following to provide specific requirements while creating the cluster:
–scopes: Scopes enable containers to direct access any Google service without needs credentials. You can specify comma separated list of scope APIs. For example:
Compute: Lets you view and manage your Google Compute Engine resources
You can find all the Scopes that Google supports here: .
–additional-zones: Specify additional zones to high availability. Eg. –additional-zones us-east1-b, us-east1-d . Here Kubernetes will create a cluster in 3 zones (1 specified at the beginning and additional 2 here).
–enable-autoscaling : To enable the autoscaling option. If you specify this option then you have to specify the minimum and maximum required nodes as follows; You can read more about how auto-scaling works here. Eg: –enable-autoscaling –min-nodes=15 –max-nodes=50
You can fetch the credentials of the created cluster. This step is to update the credentials in the kubeconfig file, so that kubectl will point to required cluster.
After creating Cluster, now let’s see how to deploy a multi tier application on it. Let’s use simple Python Flask app which will greet the user, store employee data & get employee data.
Application Deployment
I have created simple Python Flask application to deploy on K8S cluster created using GKE. you can go through the source code here. If you check the source code then you will find directory structure as follows:
In this, I have written a Dockerfile for the Python Flask application in order to build our own image to deploy. For MySQL, we won’t build an image of our own. We will use the latest MySQL image from the public docker repository.
Before deploying the application, let’s re-visit some of the important Kubernetes terms:
Pods:
The pod is a Docker container or a group of Docker containers which are deployed together on the host machine. It acts as a single unit of deployment.
Deployments:
Deployment is an entity which manages the ReplicaSets and provides declarative updates to pods. It is recommended to use Deployments instead of directly using ReplicaSets. We can use deployment to create, remove and update ReplicaSets. Deployments have the ability to rollout and rollback the changes.
Services:
Service in K8S is an abstraction which will connect you to one or more pods. You can connect to pod using the pod’s IP Address but since pods come and go, their IP Addresses change. Services get their own IP & DNS and those remain for the entire lifetime of the service.
Each tier in an application is represented by a Deployment. A Deployment is described by the YAML file. We have two YAML files – one for MySQL and one for the Python application.
You will find a ‘kind’ field in each YAML file. It is used to specify whether the given configuration is for deployment, service, pod, etc.
In the Python app service YAML, I am using type = LoadBalancer. In GKE, There are two types of cloud load balancers available to expose the application to outside world.
TCP load balancer: This is a TCP Proxy-based load balancer. We will use this in our example.
HTTP(s) load balancer: It can be created using Ingress. For more information, refer to this post that talks about Ingress in detail.
In the MySQL service, I’ve not specified any type, in that case, type ‘ClusterIP’ will get used, which will make sure that MySQL container is exposed to the cluster and the Python app can access it.
If you check the app.py, you can see that I have used “mysql-service.default” as a hostname. “Mysql-service.default” is a DNS name of the service. The Python application will refer to that DNS name while accessing the MySQL Database.
Now, let’s actually setup the components from the configurations. As mentioned above, we will first create services followed by deployments.
At this stage your application is completely deployed and is externally accessible.
Manual scaling of pods
Scaling your application up or down in Kubernetes is quite straightforward. Let’s scale up the test-app deployment.
$ kubectl scale deployment test-app --replicas=3
Deployment configuration for test-app will get updated and you can see 3 replicas of test-app are running. Verify it using,
kubectl get pods
In the same manner, you can scale down your application by reducing the replica count.
Cleanup :
Un-deploying an application from Kubernetes is also quite straightforward. All we have to do is delete the services and delete the deployments. The only caveat is that the deletion of the load balancer is an asynchronous process. You have to wait until it gets deleted.
$ kubectl delete service mysql-service$ kubectl delete service test-service
The above command will deallocate Load Balancer which was created as a part of test-service. You can check the status of the load balancer with the following command.
$ gcloud compute forwarding-rules list
Once the load balancer is deleted, you can clean-up the deployments as well.
$ kubectl delete deployments test-app$ kubectl delete deployments mysql
In this blog, we saw how easy it is to deploy, scale & terminate applications on Google Container Engine. Google Container Engine abstracts away all the complexity of Kubernetes and gives us a robust platform to run containerised applications. I am super excited about what the future holds for Kubernetes!
Check out some of Velotio’s other blogs on Kubernetes.
GraphQL is a new hype in the Field of API technologies. We have been constructing and using REST API’s for quite some time now and started hearing about GraphQL recently. GraphQL is usually described as a frontend-directed API technology as it allows front-end developers to request data in a more simpler way than ever before. The objective of this query language is to formulate client applications formed on an instinctive and adjustable format, for portraying their data prerequisites as well as interactions.
The Phoenix Framework is running on Elixir, which is built on top of Erlang. Elixir core strength is scaling and concurrency. Phoenix is a powerful and productive web framework that does not compromise speed and maintainability. Phoenix comes in with built-in support for web sockets, enabling you to build real-time apps.
Prerequisites:
Elixir & Erlang: Phoenix is built on top of these
Phoenix Web Framework: Used for writing the server application. (It’s a well-unknown and lightweight framework in elixir)
Absinthe: GraphQL library written for Elixir used for writing queries and mutations.
GraphiQL: Browser based GraphQL ide for testing your queries. Consider it similar to what Postman is used for testing REST APIs.
Overview:
The application we will be developing is a simple blog application written using Phoenix Framework with two schemas User and Post defined in Accounts and Blog resp. We will design the application to support API’s related to blog creation and management. Assuming you have Erlang, Elixir and mix installed.
Where to Start:
At first, we have to create a Phoenix web application using the following command:
mix phx.new --no-brunch --no-html
• –no-brunch – do not generate brunch files for static asset building. When choosing this option, you will need to manually handle JavaScript dependencies if building HTML apps
• –-no-html – do not generate HTML views.
Note: As we are going to mostly work with API, we don’t need any web pages, HTML views and so the command args and
Dependencies:
After we create the project, we need to add dependencies in mix.exs to make GraphQL available for the Phoenix application.
We can used following components to design/structure our GraphQL application:
GraphQL Schemas : This has to go inside lib/graphql_web/schema/schema.ex. The schema definitions your queries and mutations.
Custom types: Your schema may include some custom properties which should be defined inside lib/graphql_web/schema/types.ex
Resolvers: We have to write respective Resolver Function’s that handles the business logic and has to be mapped with respective query or mutation. Resolvers should be defined in their own files. We defined it inside lib/graphql/accounts/user_resolver.ex and lib/graphql/blog/post_resolver.ex folder.
Also, we need to uppdate the router we have to be able to make queries using the GraphQL client in lib/graphql_web/router.ex and also have to create a GraphQL pipeline to route the API request which also goes inside lib/graphql_web/router.ex:
pipeline :graphql do plug Graphql.Context #custom plug written into lib/graphql_web/plug/context.ex folderendscope "/api"dopipe_through(:graphql) #pipeline through which the request have to be routedforward("/", Absinthe.Plug, schema: GraphqlWeb.Schema)forward("/graphiql", Absinthe.Plug.GraphiQL, schema: GraphqlWeb.Schema)end
Writing GraphQL Queries:
Lets write some graphql queries which can be considered to be equivalent to GET requests in REST. But before getting into queries lets take a look at GraphQL schema we defined and its equivalent resolver mapping:
defmodule GraphqlWeb.Schema do use Absinthe.Schemaimport_types(GraphqlWeb.Schema.Types) query dofield :blog_posts, list_of(:blog_post) doresolve(&Graphql.Blog.PostResolver.all/2) endfield :blog_post, type: :blog_post doarg(:id, non_null(:id))resolve(&Graphql.Blog.PostResolver.find/2) endfield :accounts_users, list_of(:accounts_user) doresolve(&Graphql.Accounts.UserResolver.all/2) endfield :accounts_user, :accounts_user doarg(:email, non_null(:string))resolve(&Graphql.Accounts.UserResolver.find/2) end endend
You can see above we have defined four queries in the schema. Lets pick a query and see what goes into it :
field :accounts_user, :accounts_user doarg(:email, non_null(:string))resolve(&Graphql.Accounts.UserResolver.find/2)end
Above, we have retrieved a particular user using his email address through Graphql query.
arg(:, ): defines an non-null incoming string argument i.e user email for us.
Graphql.Accounts.UserResolver.find/2 : the resolver function that is mapped via schema, which contains the core business logic for retrieving an user.
Accounts_user : the custome defined type which is defined inside lib/graphql_web/schema/types.ex as follows:
We need to write a separate resolver function for every query we define. Will go over the resolver function for accounts_user which is present in lib/graphql/accounts/user_resolver.ex file:
defmodule Graphql.Accounts.UserResolver do alias Graphql.Accounts #import lib/graphql/accounts/accounts.ex as Accounts def all(_args, _info) do {:ok, Accounts.list_users()} end def find(%{email: email}, _info) do case Accounts.get_user_by_email(email) do nil -> {:error, "User email #{email} not found!"} user -> {:ok, user} end endend
This function is used to list all users or retrieve a particular user using an email address. Let’s run it now using GraphiQL browser. You need to have the server running on port 4000. To start the Phoenix server use:
mix deps.get #pulls all the dependenciesmix deps.compile #compile your codemix phx.server #starts the phoenix server
Let’s retrieve an user using his email address via query:
Above, we have retrieved the id, email and name fields by executing accountsUser query with an email address. GraphQL also allow us to define variables which we will show later when writing different mutations.
Let’s execute another query to list all blog posts that we have defined:
Writing GraphQL Mutations:
Let’s write some GraphQl mutations. If you have understood the way graphql queries are written mutations are much simpler and similar to queries and easy to understand. It is defined in the same form as queries with a resolver function. Different mutations we are gonna write are as follow:
create_post:- create a new blog post
update_post :- update a existing blog post
delete_post:- delete an existing blog post
The mutation looks as follows:
defmodule GraphqlWeb.Schema do use Absinthe.Schemaimport_types(GraphqlWeb.Schema.Types) query do mutation dofield :create_post, type: :blog_post doarg(:title, non_null(:string))arg(:body, non_null(:string))arg(:accounts_user_id, non_null(:id))resolve(&Graphql.Blog.PostResolver.create/2) endfield :update_post, type: :blog_post doarg(:id, non_null(:id))arg(:post, :update_post_params)resolve(&Graphql.Blog.PostResolver.update/2) endfield :delete_post, type: :blog_post doarg(:id, non_null(:id))resolve(&Graphql.Blog.PostResolver.delete/2) end end endend
Let’s run some mutations to create a post in GraphQL:
Notice the method is POST and not GET over here.
Let’s dig into update mutation function :
field :update_post, type: :blog_post doarg(:id, non_null(:id))arg(:post, :update_post_params)resolve(&Graphql.Blog.PostResolver.update/2)end
Here, update post takes two arguments as input , non null id and a post parameter of type update_post_params that holds the input parameter values to update. The mutation is defined in lib/graphql_web/schema/schema.ex while the input parameter values are defined in lib/graphql_web/schema/types.ex —
The rise of containers has reshaped the way we develop, deploy and maintain the software. Containers allow us to package the different services that constitute an application into separate containers, and to deploy those containers across a set of virtual and physical machines. This gives rise to container orchestration tool to automate the deployment, management, scaling and availability of a container-based application. Kubernetes allows deployment and management of container-based applications at scale. Learn more about backup and disaster recovery for your Kubernetes clusters.
One of the main advantages of Kubernetes is how it brings greater reliability and stability to the container-based distributed application, through the use of dynamic scheduling of containers. But, how do you make sure Kubernetes itself stays up when a component or its master node goes down?
Why we need Kubernetes High Availability?
Kubernetes High-Availability is about setting up Kubernetes, along with its supporting components in a way that there is no single point of failure. A single master cluster can easily fail, while a multi-master cluster uses multiple master nodes, each of which has access to same worker nodes. In a single master cluster the important component like API server, controller manager lies only on the single master node and if it fails you cannot create more services, pods etc. However, in case of Kubernetes HA environment, these important components are replicated on multiple masters(usually three masters) and if any of the masters fail, the other masters keep the cluster up and running.
Advantages of multi-master
In the Kubernetes cluster, the master node manages the etcd database, API server, controller manager, and scheduler, along with all the worker nodes. What if we have only a single master node and if that node fails, all the worker nodes will be unscheduled and the cluster will be lost.
In a multi-master setup, by contrast, a multi-master provides high availability for a single cluster by running multiple apiserver, etcd, controller-manager, and schedulers. This does not only provides redundancy but also improves network performance because all the masters are dividing the load among themselves.
A multi-master setup protects against a wide range of failure modes, from a loss of a single worker node to the failure of the master node’s etcd service. By providing redundancy, a multi-master cluster serves as a highly available system for your end-users.
Steps to Achieve Kubernetes HA
Before moving to steps to achieve high-availability, let us understand what we are trying to achieve through a diagram:
(Image Source: Kubernetes Official Documentation)
Master Node: Each master node in a multi-master environment run its’ own copy of Kube API server. This can be used for load balancing among the master nodes. Master node also runs its copy of the etcd database, which stores all the data of cluster. In addition to API server and etcd database, the master node also runs k8s controller manager, which handles replication and scheduler, which schedules pods to nodes.
Worker Node: Like single master in the multi-master cluster also the worker runs their own component mainly orchestrating pods in the Kubernetes cluster. We need 3 machines which satisfy the Kubernetes master requirement and 3 machines which satisfy the Kubernetes worker requirement.
For each master, that has been provisioned, follow the installation guide to install kubeadm and its dependencies. In this blog we will use k8s 1.10.4 to implement HA.
Note: Please note that cgroup driver for docker and kubelet differs in some version of k8s, make sure you change cgroup driver to cgroupfs for docker and kubelet. If cgroup driver for kubelet and docker differs then the master doesn’t come up when rebooted.
6. Create a directory /etc/kubernetes/pki/etcd on master-1 and master-2 and copy all the generated certificates into it.
7. On all masters, now generate peer and etcd certs in /etc/kubernetes/pki/etcd. To generate them, we need the previous CA certificates on all masters.
This will replace the default configuration with your machine’s hostname and IP address, so in case if you encounter any problem just check the hostname and IP address are correct and rerun cfssl command.
8. On all masters, Install etcd and set it’s environment file.
9. Now, we will create a 3 node etcd cluster on all 3 master nodes. Starting etcd service on all three nodes as systemd. Create a file /etc/systemd/system/etcd.service on all masters.
This will show the cluster healthy and connected to all three nodes.
Setup load balancer
There are multiple cloud provider solutions for load balancing like AWS elastic load balancer, GCE load balancing etc. There might not be a physical load balancer available, we can setup a virtual IP load balancer to healthy node master. We are using keepalived for load balancing, install keepalived on all master nodes
$ yum install keepalived -y
Create the following configuration file /etc/keepalived/keepalived.conf on all master nodes:
Please ensure that the following placeholders are replaced:
<master-private-ip> with the private IPv4 of the master server on which config file resides.</master-private-ip>
<master0-ip-address>, <master1-ip-address> and <master-2-ip-address> with the IP addresses of your three master nodes</master-2-ip-address></master1-ip-address></master0-ip-address>
<podcidr> with your Pod CIDR. Please read the </podcidr>CNI network section of the docs for more information. Some CNI providers do not require a value to be set. I am using weave-net as pod network, hence podCIDR will be 10.32.0.0/12
<load-balancer-ip> with the virtual IP set up in the load balancer in the previous section.</load-balancer-ip>
$ kubeadm init --config=config.yaml
10. Run kubeadm init on master1 and master2:
First of all copy /etc/kubernetes/pki/ca.crt, /etc/kubernetes/pki/ca.key, /etc/kubernetes/pki/sa.key, /etc/kubernetes/pki/sa.pub to master1’s and master2’s /etc/kubernetes/pki folder.
Note: Copying this files is crucial, otherwise the other two master nodes won’t go into the ready state.
Copy the config file config.yaml from master0 to master1 and master2. We need to change <master-private-ip> to current master host’s private IP.</master-private-ip>
$ kubeadm init --config=config.yaml
11. Now you can install pod network on all three masters to bring them in the ready state. I am using weave-net pod network, to apply weave-net run:
12. By default, k8s doesn’t schedule any workload on the master, so if you want to schedule workload on master node as well, taint all the master nodes using the command:
Even after one node failed, all the important components are up and running. The cluster is still accessible and you can create more pods, deployment services etc.
High availability is an important part of reliability engineering, focused on making system reliable and avoid any single point of failure of the complete system. At first glance, its implementation might seem quite complex, but high availability brings tremendous advantages to the system that requires increased stability and reliability. Using highly available cluster is one of the most important aspects of building a solid infrastructure.
In the world of data centers with wings and wheels, there is an opportunity to lay some work off from the centralized cloud computing by taking less compute intensive tasks to other components of the architecture. In this blog, we will explore the upcoming frontier of the web – Edge Computing.
What is the “Edge”?
The ‘Edge’ refers to having computing infrastructure closer to the source of data. It is the distributed framework where data is processed as close to the originating data source possible. This infrastructure requires effective use of resources that may not be continuously connected to a network such as laptops, smartphones, tablets, and sensors. Edge Computing covers a wide range of technologies including wireless sensor networks, cooperative distributed peer-to-peer ad-hoc networking and processing, also classifiable as local cloud/fog computing, mobile edge computing, distributed data storage and retrieval, autonomic self-healing networks, remote cloud services, augmented reality, and more.
Cloud Computing is expected to go through a phase of decentralization. Edge Computing is coming up with an ideology of bringing compute, storage and networking closer to the consumer.
But Why?
Legit question! Why do we even need Edge Computing? What are the advantages of having this new infrastructure?
Imagine a case of a self-driving car where the car is sending a live stream continuously to the central servers. Now, the car has to take a crucial decision. The consequences can be disastrous if the car waits for the central servers to process the data and respond back to it. Although algorithms like YOLO_v2 have sped up the process of object detection the latency is at that part of the system when the car has to send terabytes to the central server and then receive the response and then act! Hence, we need the basic processing like when to stop or decelerate, to be done in the car itself.
The goal of Edge Computing is to minimize the latency by bringing the public cloud capabilities to the edge. This can be achieved in two forms – custom software stack emulating the cloud services running on existing hardware, and the public cloud seamlessly extended to multiple point-of-presence (PoP) locations.
Following are some promising reasons to use Edge Computing:
Privacy: Avoid sending all raw data to be stored and processed on cloud servers.
Real-time responsiveness: Sometimes the reaction time can be a critical factor.
Reliability: The system is capable to work even when disconnected to cloud servers. Removes a single point of failure.
To understand the points mentioned above, let’s take the example of a device which responds to a hot keyword. Example, Jarvis from Iron Man. Imagine if your personal Jarvis sends all of your private conversations to a remote server for analysis. Instead, It is intelligent enough to respond when it is called. At the same time, it is real-time and reliable.
Intel CEO Brian Krzanich said in an event that autonomous cars will generate 40 terabytes of data for every eight hours of driving. Now with that flood of data, the time of transmission will go substantially up. In cases of self-driving cars, real-time or quick decisions are an essential need. Here edge computing infrastructure will come to rescue. These self-driving cars need to take decisions is split of a second whether to stop or not else consequences can be disastrous.
Another example can be drones or quadcopters, let’s say we are using them to identify people or deliver relief packages then the machines should be intelligent enough to take basic decisions like changing the path to avoid obstacles locally.
This model of Edge Computing is basically an extension of the public cloud. Content Delivery Networks are classic examples of this topology in which the static content is cached and delivered through a geographically spread edge locations.
Vapor IO is an emerging player in this category. They are attempting to build infrastructure for cloud edge. Vapor IO has various products like Vapor Chamber. These are self-monitored. They have sensors embedded in them using which they are continuously monitored and evaluated by Vapor Software, VEC(Vapor Edge Controller). They also have built OpenDCRE, which we will see later in this blog.
The fundamental difference between device edge and cloud edge lies in the deployment and pricing models. The deployment of these models – device edge and cloud edge – are specific to different use cases. Sometimes, it may be an advantage to deploy both the models.
Edges around you
Edge Computing examples can be increasingly found around us:
Smart street lights
Automated Industrial Machines
Mobile devices
Smart Homes
Automated Vehicles (cars, drones etc)
Data Transmission is expensive. By bringing compute closer to the origin of data, latency is reduced as well as end users have better experience. Some of the evolving use cases of Edge Computing are Augmented Reality(AR) or Virtual Reality(VR) and the Internet of things. For example, the rush which people got while playing an Augmented Reality based pokemon game, wouldn’t have been possible if “real-timeliness” was not present in the game. It was made possible because the smartphone itself was doing AR not the central servers. Even Machine Learning(ML) can benefit greatly from Edge Computing. All the heavy-duty training of ML algorithms can be done on the cloud and the trained model can be deployed on the edge for near real-time or even real-time predictions. We can see that in today’s data-driven world edge computing is becoming a necessary component of it.
There is a lot of confusion between Edge Computing and IOT. If stated simply, Edge Computing is nothing but the intelligent Internet of things(IOT) in a way. Edge Computing actually complements traditional IOT. In the traditional model of IOT, all the devices, like sensors, mobiles, laptops etc are connected to a central server. Now let’s imagine a case where you give the command to your lamp to switch off, for such simple task, data needs to be transmitted to the cloud, analyzed there and then lamp will receive a command to switch off. Edge Computing brings computing closer to your home, that is either the fog layer present between lamp and cloud servers is smart enough to process the data or the lamp itself.
If we look at the below image, it is a standard IOT implementation where everything is centralized. While Edge Computing philosophy talks about decentralizing the architecture.
The Fog
Sandwiched between edge layer and cloud layer, there is the Fog Layer. It bridges connection between other two layers.
The difference between fog and edge computing is described in this article –
Fog Computing – Fog computing pushes intelligence down to the local area network level of network architecture, processing data in a fog node or IoT gateway.
Edge computing pushes the intelligence, processing power and communication capabilities of an edge gateway or appliance directly into devices like programmable automation controllers (PACs).
How do we manage Edge Computing?
The Device Relationship Management or DRM refers to managing, monitoring the interconnected components over the internet. AWS IOT Core and AWS Greengrass, Nebbiolo Technologies have developed Fog Node and Fog OS, Vapor IO has OpenDCRE using which one can control and monitor the data centers.
Following image (source – AWS) shows how to manage ML on Edge Computing using AWS infrastructure.
AWS Greengrass makes it possible for users to use Lambda functions to build IoT devices and application logic. Specifically, AWS Greengrass provides cloud-based management of applications that can be deployed for local execution. Locally deployed Lambda functions are triggered by local events, messages from the cloud, or other sources.
This GitHub repo demonstrates a traffic light example using two Greengrass devices, a light controller, and a traffic light.
Conclusion
We believe that next-gen computing will be influenced a lot by Edge Computing and will continue to explore new use-cases that will be made possible by the Edge.