When working with servers or command-line-based applications, we spend most of our time on the command line. A good-looking and productive terminal is better in many aspects than a GUI (Graphical User Interface) environment since the command line takes less time for most use cases. Today, we’ll look at some of the features that make a terminal cool and productive.
You can use the following steps on Ubuntu 20.04. if you are using a different operating system, your commands will likely differ. If you’re using Windows, you can choose between Cygwin, WSL, and Git Bash.
Prerequisites
Let’s upgrade the system and install some basic tools needed.
Zsh is an extended Bourne shell with many improvements, including some features of Bash and other shells.
Let’s install Z-Shell:
sudo apt install zsh
Make it our default shell for our terminal:
chsh -s $(which zsh)
Now restart the system and open the terminal again to be welcomed by ZSH. Unlike other shells like Bash, ZSH requires some initial configuration, so it asks for some configuration options the first time we start it and saves them in a file called .zshrc in the home directory (/home/user) where the user is the current system user.
For now, we’ll skip the manual work and get a head start with the default configuration. Press 2, and ZSH will populate the .zshrc file with some default options. We can change these later.
The initial configuration setup can be run again as shown in the below image
Oh-My-ZSH
Oh-My-ZSH is a community-driven, open-source framework to manage your ZSH configuration. It comes with many plugins and helpers. It can be installed with one single command as below.
Installation
sh -c "$(wget https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh -O -)"
It’d take a backup of our existing .zshrc in a file zshrc.pre-oh-my-zsh, so whenever you uninstall it, the backup would be restored automatically.
Font
A good terminal needs some good fonts, we’d use Terminess nerd font to make our terminal look awesome, which can be downloaded here. Once downloaded, extract and move them to ~/.local/share/fonts to make them available for the current user or to /usr/share/fonts to be available for all the users.
tar -xvf Terminess.zipmv *.ttf ~/.local/share/fonts
Once the font is installed, it will look like:
Among all the things Oh-My-ZSH provides, 2 things are community favorites, plugins, and themes.
Theme
My go-to ZSH theme is powerlevel10k because it’s flexible, provides everything out of the box, and is easy to install with one command as shown below:
Close the terminal and start it again. Powerlevel10k will welcome you with the initial setup, go through the setup with the options you want. You can run this setup again by executing the below command:
p10k configure
Tools and plugins we can’t live without
Plugins can be added to the plugins array in the .zshrc file. For all the plugins you want to use from the below list, add those to the plugins array in the .zshrc file like so:
ZSH-Syntax-Highlighting
This enables the highlighting of commands as you type and helps you catch syntax errors before you execute them:
As you can see, “ls” is in green but “lss” is in red.
It’s a faster way of navigating the file system; it works by maintaining a database of directories you visit the most. More details can be found here.
sudo apt install autojump
You can also use the plugin Z as an alternative if you’re not able to install autojump or for any other reason.
Internal Plugins
Some plugins come installed with oh-my-zsh, and they can be included directly in .zshrc file without any installation.
copyfile
It copies the content of a file to the clipboard.
copyfile test.txt
copypath
It copies the absolute path of the current directory to the clipboard.
copybuffer
This plugin copies the command that is currently typed in the command prompt to the clipboard. It works with the keyboard shortcut CTRL + o.
sudo
Sometimes, we forget to prefix a command with sudo, but that can be done in just a second with this plugin. When you hit the ESC key twice, it will prefix the command you’ve typed in the terminal with sudo.
web-search
This adds some aliases for searching with Google, Wikipedia, etc. For example, if you want to web-search with Google, you can execute the below command:
Remember, you’d have to add each of these plugins in the .zshrc file as well. So, in the end, this is how the plugins array in .zshrc file should look like:
You can add more plugins, like docker, heroku, kubectl, npm, jsontools, etc., if you’re a developer. There are plugins for system admins as well or for anything else you need. You can explore them here.
Enhancd
Enhancd is the next-gen method to navigate file system with cli. It works with a fuzzy finder, we’ll install it fzf for this purpose.
sudo apt install fzf
Enhancd can be installed with the zplug plugin manager for ZSH, so first we’ll install zplug with the below command:
Now close your terminal, open it again, and use zplug to install enhanced
zplug "b4b4r07/enhancd", use:init.sh
Aliases
As a developer, I need to execute git commands many times a day, typing each command every time is too cumbersome, so we can use aliases for them. Aliases need to be added .zshrc, and here’s how we can add them.
Now, restart your terminal and execute the command colors in your terminal to see the magic!
Bonus – We can add some aliases as well if we want the same output of Colorls when we execute the command ls. Note that we’re adding another alias for ls to make it available as well.
alias cl='ls'alias ls='colorls'alias la='colorls -a'alias ll='colorls -l'alias lla='colorls -la'
These are the tools and plugins I can’t live without now, Let me know if I’ve missed anything.
Automation
Do you wanna repeat this process again, if let’s say, you’ve bought a new laptop and want the same setup?
You can automate all of this if your answer is no, and that’s why I’ve created Project Automator. This project does a lot more than just setting up a terminal: it works with Arch Linux as of now but you can take the parts you need and make it work with almost any *nix system you like.
Explaining how it works is beyond the scope of this article, so I’ll have to leave you guys here to explore it on your own.
Conclusion
We need to perform many tasks on our systems, and using a GUI(Graphical User Interface) tool for a task can consume a lot of your time, especially if you repeat the same task on a daily basis like converting a media stream, setting up tools on a system, etc.
Using a command-line tool can save you a lot of time and you can automate repetitive tasks with scripting. It can be a great tool for your arsenal.
Zappa is a very powerful open source python project which lets you build, deploy and update your WSGI app hosted on AWS Lambda + API Gateway easily.This blog is a detailed step-by-step focusing on challenges faced while deploying Django application on AWS Lambda using Zappa as a deployment tool.
Building Your Application
If you do not have a Django application already you can build one by cloning this GitHub repository.
Once you have cloned the repository you will need a virtual environment which provides an isolated Python environment for your application. I prefer virtualenvwrapper to create one.
Now if you run the server directly it will log a warning as the database is not set up yet.
$ python manage.py runserver
Performing system checks...System check identified no issues (0 silenced).You have 13 unapplied migration(s). Your project may not work properly until you apply the migrations for app(s): admin, auth, contenttypes, sessions.Run 'python manage.py migrate' to apply them.May 20, 2018-14:47:32Django version 1.11.11, usingsettings'django_zappa_sample.settings'Starting development server at http://127.0.0.1:8000/Quit the server withCONTROL-C.
Also trying to access admin page (http://localhost:8000/admin/) will throw an “OperationalError” exception with below log at server end.
Internal Server Error: /admin/Traceback (most recent call last): File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/core/handlers/exception.py", line 41, in inner response =get_response(request) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response response = self.process_exception_by_middleware(e, request) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response response =wrapped_callback(request, *callback_args, **callback_kwargs) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/admin/sites.py", line 242, in wrapperreturn self.admin_view(view, cacheable)(*args, **kwargs) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/utils/decorators.py", line 149, in _wrapped_view response =view_func(request, *args, **kwargs) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/views/decorators/cache.py", line 57, in _wrapped_view_func response =view_func(request, *args, **kwargs) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/admin/sites.py", line 213, in innerif not self.has_permission(request): File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/admin/sites.py", line 187, in has_permissionreturn request.user.is_active and request.user.is_staff File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/utils/functional.py", line 238, in inner self._setup() File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/utils/functional.py", line 386, in _setup self._wrapped = self._setupfunc() File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/auth/middleware.py", line 24, in<lambda> request.user =SimpleLazyObject(lambda: get_user(request)) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/auth/middleware.py", line 12, in get_user request._cached_user = auth.get_user(request) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/auth/__init__.py", line 211, in get_user user_id =_get_user_session_key(request) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/auth/__init__.py", line 61, in _get_user_session_keyreturnget_user_model()._meta.pk.to_python(request.session[SESSION_KEY]) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/sessions/backends/base.py", line 57, in __getitem__return self._session[key] File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/sessions/backends/base.py", line 207, in _get_session self._session_cache = self.load() File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/sessions/backends/db.py", line 35, in load expire_date__gt=timezone.now() File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/manager.py", line 85, in manager_methodreturngetattr(self.get_queryset(), name)(*args, **kwargs) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/query.py", line 374, in get num =len(clone) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/query.py", line 232, in __len__ self._fetch_all() File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/query.py", line 1118, in _fetch_all self._result_cache =list(self._iterable_class(self)) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/query.py", line 53, in __iter__ results = compiler.execute_sql(chunked_fetch=self.chunked_fetch) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 899, in execute_sql raise original_exceptionOperationalError: no such table: django_session[20/May/201814:59:23] "GET /admin/ HTTP/1.1"500153553Not Found: /favicon.ico
In order to fix this you need to run the migration into your database so that essential tables like auth_user, sessions, etc are created before any request is made to the server.
NOTE: Use DATABASES from project settings file to configure your database that you would want your Django application to use once hosted on AWS Lambda. By default, its configured to create a local SQLite database file as backend.
You can run the server again and it should now load the admin panel of your website.
Do verify if you have the zappa python package into your virtual environment before moving forward.
Configuring Zappa Settings
Deploying with Zappa is simple as it only needs a configuration file to run and rest will be managed by Zappa. To create this configuration file run from your project root directory –
$ zappa init
███████╗ █████╗ ██████╗ ██████╗ █████╗╚══███╔╝██╔══██╗██╔══██╗██╔══██╗██╔══██╗ ███╔╝ ███████║██████╔╝██████╔╝███████║ ███╔╝ ██╔══██║██╔═══╝ ██╔═══╝ ██╔══██║███████╗██║ ██║██║ ██║ ██║ ██║╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚═╝ ╚═╝Welcome to Zappa!Zappa is a system for running server-less Python web applications on AWS Lambda and AWSAPI Gateway.This `init` command will help you create and configure your new Zappa deployment.Let's get started!Your Zappa configuration can support multiple production stages, like 'dev', 'staging', and 'production'.What do you want to call thisenvironment (default 'dev'): AWS Lambda and API Gateway are only available in certain regions. Let's check to make sure you have a profile set up in one that will work.We found the following profiles: default, and hdx. Which would you like us to use? (default 'default'):Your Zappa deployments will need to be uploaded to a private S3 bucket.If you don't have a bucket yet, we'll create one for you too.What do you want call your bucket? (default 'zappa-108wqhyn4'): django-zappa-sample-bucketIt looks like this is a Django application!What is the modulepathtoyourprojects's Django settings?Wediscovered: django_zappa_sample.settingsWhereareyourproject's settings? (default 'django_zappa_sample.settings'):Youcanoptionallydeploytoallavailableregionsinordertoprovidefastglobalservice.IfyouareusingZappaforthefirsttime, youprobablydon't want to do this!Wouldyouliketodeploythisapplicationglobally? (default'n') [y/n/(p)rimary]: nOkay, here's your zappa_settings.json:{"dev": {"aws_region": "us-east-1", "django_settings": "django_zappa_sample.settings", "profile_name": "default", "project_name": "django-zappa-sa", "runtime": "python2.7", "s3_bucket": "django-zappa-sample-bucket" }}Does this look okay? (default 'y') [y/n]: yDone! Now you can deploy your Zappa application by executing: $ zappa deploy devAfter that, you can update your application code with: $ zappa update devTo learn more, check out our project page on GitHub here: https://github.com/Miserlou/Zappaand stop by our Slack channel here: https://slack.zappa.ioEnjoy!,~ Team Zappa!
You can verify zappa_settings.json generated at your project root directory.
TIP: The virtual environment name should not be the same as the Zappa project name, as this may cause errors.
Additionally, you could specify other settings in zappa_settings.json file as per requirement using Advanced Settings.
Now, you’re ready to deploy!
IAM Permissions
In order to deploy the Django Application to Lambda/Gateway, setup an IAM role (eg. ZappaLambdaExecutionRole) with the following permissions:
Before deploying the application, ensure that the IAM role is set in the config JSON as follows:
{"dev": {..."manage_roles": false, // Disable Zappa client managing roles."role_name": "MyLambdaRole", // Name of your Zappa execution role. Optional, default: --ZappaExecutionRole."role_arn": "arn:aws:iam::12345:role/app-ZappaLambdaExecutionRole", // ARN of your Zappa execution role. Optional....},...}
Once your settings are configured, you can package and deploy your application to a stage called “dev” with a single command:
$ zappa deploy dev
Calling deploy for stage dev..Downloading and installing dependencies..Packaging project aszip.Uploading django-zappa-sa-dev-1526831069.zip (10.9MiB)..100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [01:02<00:00, 75.3KB/s]Scheduling..Scheduled django-zappa-sa-dev-zappa-keep-warm-handler.keep_warm_callback with expression rate(4 minutes)!Uploading django-zappa-sa-dev-template-1526831157.json (1.6KiB)..100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.60K/1.60K [00:02<00:00, 792B/s]Waiting for stack django-zappa-sa-dev to create (this can take a bit)..100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|4/4 [00:11<00:00, 2.92s/res]Deploying API Gateway..Deployment complete!: https://akg59b222b.execute-api.us-east-1.amazonaws.com/dev
You should see that your Zappa deployment completed successfully with URL to API gateway created for your application.
Troubleshooting
1. If you are seeing the following error while deployment, it’s probably because you do not have sufficient privileges to run deployment on AWS Lambda. Ensure your IAM role has all the permissions as described above or set “manage_roles” to true so that Zappa can create and manage the IAM role for you.
Calling deploy for stage dev..Creating django-zappa-sa-dev-ZappaLambdaExecutionRole IAM Role..Error: Failed to manage IAM roles!You may lack the necessary AWS permissions to automatically manage a Zappa execution role.To fix this, see here: https://github.com/Miserlou/Zappa#using-custom-aws-iam-roles-and-policies
2. The below error will be caused as you have not listed “events.amazonaws.com” as Trusted Entity for your IAM Role. You can add the same or set “keep_warm” parameter to false in your Zappa settings file. Your Zappa deployment was partially deployed as it got terminated abnormally.
Downloading and installing dependencies..100%|████████████████████████████████████████████|44/44 [00:05<00:00, 7.92pkg/s]Packaging project aszip..Uploading django-zappa-sample-dev-1482817370.zip (8.8MiB)..100%|█████████████████████████████████████████| 9.22M/9.22M [00:17<00:00, 527KB/s]Scheduling...Oh no! An error occurred! :(==============Traceback (most recent call last):Traceback (most recent call last): File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 2610, in handle sys.exit(cli.handle()) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 505, in handle self.dispatch_command(self.command, stage) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 539, in dispatch_command self.deploy(self.vargs['zip']) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 800, in deploy self.zappa.add_binary_support(api_id=api_id, cors=self.cors) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/core.py", line 1490, in add_binary_support restApiId=api_id File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 314, in _api_call return self._make_api_call(operation_name, kwargs) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 612, in _make_api_call raise error_class(parsed_response, operation_name)ClientError: An error occurred (ValidationError) when calling the PutRole operation: Provided role 'arn:aws:iam:484375727565:role/lambda_basic_execution' cannot be assumed by principal'events.amazonaws.com'.==============Need help? Found a bug? Let us know!:DFile bug reports on GitHub here: https://github.com/Miserlou/ZappaAnd join our Slack channel here: https://slack.zappa.ioLove!,~ Team Zappa!
3. Adding the parameter and running zappa update will cause above error. As you can see it says “Stack django-zappa-sa-dev does not exists” as the previous deployment was unsuccessful. To fix this, delete the Lambda function from console and rerun the deployment.
4. If you run into any distribution error, please try down-grading your pip version to 9.0.1.
$ pip install pip==9.0.1
Calling deploy for stage dev..Downloading and installing dependencies..Oh no! An error occurred! :(==============Traceback (most recent call last): File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 2610, in handle sys.exit(cli.handle()) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 505, in handle self.dispatch_command(self.command, stage) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 539, in dispatch_command self.deploy(self.vargs['zip']) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 709, in deploy self.create_package() File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 2171, in create_package disable_progress=self.disable_progress File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/core.py", line 595, in create_lambda_zip installed_packages = self.get_installed_packages(site_packages, site_packages_64) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/core.py", line 751, in get_installed_packages pip.get_installed_distributions()AttributeError: 'module' object has no attribute 'get_installed_distributions'==============Need help? Found a bug? Let us know!:DFile bug reports on GitHub here: https://github.com/Miserlou/ZappaAnd join our Slack channel here: https://slack.zappa.ioLove!,~ Team Zappa!
or,
If you run into NotFoundException(Invalid REST API Identifier issue) please try undeploying the Zappa stage and retry again.
Calling deploy for stage dev..Downloading and installing dependencies..Packaging project aszip.Uploading django-zappa-sa-dev-1526830532.zip (10.9MiB)..100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:42<00:00, 331KB/s]Scheduling..Scheduled django-zappa-sa-dev-zappa-keep-warm-handler.keep_warm_callback with expression rate(4 minutes)!Uploading django-zappa-sa-dev-template-1526830690.json (1.6KiB)..100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.60K/1.60K [00:01<00:00, 801B/s]Oh no! An error occurred! :(==============Traceback (most recent call last): File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 2610, in handle sys.exit(cli.handle()) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 505, in handle self.dispatch_command(self.command, stage) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 539, in dispatch_command self.deploy(self.vargs['zip']) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 800, in deploy self.zappa.add_binary_support(api_id=api_id, cors=self.cors) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/core.py", line 1490, in add_binary_support restApiId=api_id File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 314, in _api_call return self._make_api_call(operation_name, kwargs) File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 612, in _make_api_call raise error_class(parsed_response, operation_name)NotFoundException: An error occurred (NotFoundException) when calling the GetRestApi operation: Invalid RESTAPI identifier specified 484375727565:akg59b222b==============Need help? Found a bug? Let us know!:DFile bug reports on GitHub here: https://github.com/Miserlou/ZappaAnd join our Slack channel here: https://slack.zappa.ioLove!,~ Team Zappa!
TIP: To understand how your application works on serverless environment please visit this link.
Post Deployment Setup
Migrate database
At this point, you should have an empty database for your Django application to fill up with a schema.
$ zappa manage.py migrate dev
Once you run above command the database migrations will be applied on the database as specified in your Django settings.
Creating Superuser of Django Application
You also might need to create a new superuser on the database. You could use the following command on your project directory.
Note that your application must be connected to the same database as this is run as standard Django administration command (not a Zappa command).
Managing static files
Your Django application will be having a dependency on static files, Django admin panel uses a combination of JS, CSS and image files.
NOTE: Zappa is for running your application code, not for serving static web assets. If you plan on serving custom static assets in your web application (CSS/JavaScript/images/etc.), you’ll likely want to use a combination of AWS S3 and AWS CloudFront.
You will need to add following packages to your virtual environment required for management of files to and from S3 django-storages and boto.
$ pip install django-storages botoAdd Django-Storage to your INSTALLED_APPSin settings.pyINSTALLED_APPS= (...,storages',)Configure Django-storage in settings.py asAWS_STORAGE_BUCKET_NAME='django-zappa-sample-bucket'AWS_S3_CUSTOM_DOMAIN='%s.s3.amazonaws.com'%AWS_STORAGE_BUCKET_NAMESTATIC_URL="https://%s/"%AWS_S3_CUSTOM_DOMAINSTATICFILES_STORAGE='storages.backends.s3boto.S3BotoStorage'
Once you have setup the Django application to serve your static files from AWS S3, run following command to upload the static file from your project to S3.
$ python manage.py collectstatic --noinput
or
$ zappa update dev$ zappa manage dev "collectstatic --noinput"
Check that at least 61 static files are moved to S3 bucket. Admin panel is built over 61 static files.
NOTE: STATICFILES_DIR must be configured properly to collect your files from the appropriate location.
Tip: You need to render static files in your templates by loading static path and using the same. Example, {% static %}
Setting Up API Gateway
To connect to your Django application you also need to ensure you have API gateway setup for your AWS Lambda Function. You need to have GET methods set up for all the URL resources used in your Django application. Alternatively, you can setup a proxy method to allow all subresources to be processed through one API method.
Go to AWS Lambda function console and add API Gateway from ‘Add triggers’.
1. Configure API, Deployment Stage, and Security for API Gateway. Click Save once it is done.
2. Go to API Gateway console and,
a. Recreate ANY method for / resource.
i. Check `Use Lambda Proxy integration`
ii. Set `Lambda Region` and `Lambda Function` and `Save` it.
a. Recreate ANY method for /{proxy+} resource.
i. Select `Lambda Function Proxy`
ii. Set`Lambda Region` and `Lambda Function` and `Save` it.
3. Click on Action and select Deploy API. Set Deployment Stage and click Deploy
4. Ensure that GET and POST method for / and Proxy are set as Override for this method
Setting Up Custom SSL Endpoint
Optionally, you could also set up your own custom defined SSL endpoint with Zappa and install your certificate with your domain by running certify with Zappa.
Now you are ready to launch your Django Application hosted on AWS Lambda.
Additional Notes:
Once deployed, you must run “zappa update <stage-name>” for updating your already hosted AWS Lambda function.</stage-name>
You can check server logs for investigation by running “zappa tail” command.
To un-deploy your application, simply run: `zappa undeploy <stage-name>`</stage-name>
You’ve seen how to deploy Django application on AWS Lambda using Zappa. If you are creating your Django application for first time you might also want to read Edgar Roman’s Django Zappa Guide.
Start building your Django application and let us know in the comments if you need any help during your application deployment over AWS Lambda.
GraphQL has revolutionized how a client queries a server. With the thin layer of GraphQL middleware, the client has the ability to query the data more comprehensively than what’s provided by the usual REST APIs.
One of the key principles of GraphQL involves having a single data graph of the implementing services that will allow the client to have a unified interface to access more data and services through a single query. Having said that, it can be challenging to follow this principle for an enterprise-level application on a single, monolith GraphQL server.
The Need for Federated Services
James Baxley III, the Engineering Manager at Apollo, in his talk here, puts forward the rationale behind choosing an independently managed federated set of services very well.
To summarize his point, let’s consider a very complex enterprise product. This product would essentially have multiple teams responsible for maintaining different modules of the product. Now, if we’re considering implementing a GraphQL layer at the backend, it would only make sense to follow the one graph principle of GraphQL: this says that to maximize the value of GraphQL, we should have a single unified data graph that’s operating at the data layer of this product. With that, it will be easier for a client to query a single graph and get all the data without having to query different graphs for different data portions.
However, it would be challenging to have all of the huge enterprise data graphs’ layer logic residing on a single codebase. In addition, we want teams to be able to independently implement, maintain, and ship different schemas of the data graph on their own release cycles.
Though there is only one graph, the implementation of that graph should be federated across multiple teams.
Now, let’s consider a massive enterprise e-commerce platform as an example. The different schemas of the e-commerce platform look something like:
Fig:- E-commerce platform set of schemas
Considering the above example, it would be a chaotic task to maintain the graph implementation logic of all these schemas on a single code base. Another overhead that this would bring is having to scale a huge monolith that’s implementing all these services.
Thus, one solution is a federation of services for a single distributed data graph. Each service can be implemented independently by individual teams while maintaining their own release cycles and having their own iterations of their services. Also, a federated set of services would still follow the Onegraph principle of GraphQL, which will allow the client to query a single endpoint for fetching any part of the data graph.
To further demonstrate the example above, let’s say the client asks for the top-five products, their reviews, and the vendor selling them. In a usual monolith GraphQL server, this query would involve writing a resolver that’s a mesh of the data sources of these individual schemas. It would be a task for teams to collaborate and come up with their individual implementations. Let’s consider a federated approach with separate services implementing products, reviews, and vendors. Each service is responsible for resolving only the part of the data graph that includes the schema and data source. This makes it extremely streamlined to allow different teams managing different schemas to collaborate easily.
Another advantage would be handling the scaling of individual services rather than maintaining a compute-heavy monolith for a huge data graph. For example, the products service is used the most on the platform, and the vendors service is scarcely used. In case of a monolith approach, the scaling would’ve had to take place on the overall server. This is eliminated with federated services where we can independently maintain and scale individual services like the products service.
Federated Implementation of GraphQL Services
A monolith GraphQL server that implements a lot of services for different schemas can be challenging to scale. Instead of implementing the complete data graph on a single codebase, the responsibilities of different parts of the data graph can be split across multiple composable services. Each one will contain the implementation of only the part of the data graph it is responsible for. Apollo Federation allows this division of services and follows a declarative programming model to allow splitting of concerns.
Architecture Overview
This article will not cover the basics of GraphQL, such as writing resolvers and schemas. If you’re not acquainted with the basics of GraphQL and setting up a basic GraphQL server using Apollo, I would highly recommend reading about it here. Then, you can come back here to understand the implementation of federated services using Apollo Federation.
Apollo Federation has two principal parts to it:
A collection of services that distinctly define separate GraphQL schemas
A gateway that builds the federated data graph and acts as a forefront to distinctly implement queries for different services
Fig:- Apollo Federation Architecture
Separation of Concerns
The usual way of going about implementing federated services would be by splitting an existing monolith based on the existing schemas defined. Although this way seems like a clear approach, it will quickly cause problems when multiple Schemas are involved.
To illustrate, this is a typical way to split services from a monolith based on the existing defined Schemas:
In the example above, although the tweets field belongs to the User schema, it wouldn’t make sense to populate this field in the User service. The tweets field of a User should be declared and resolved in the Tweet service itself. Similarly, it wouldn’t be right to resolve the creator field inside the Tweet service.
The reason behind this approach is the separation of concerns. The User service might not even have access to the Tweet datastore to be able to resolve the tweets field of a user. On the other hand, the Tweet service might not have access to the User datastore to resolve the creator field of the Tweet schema.
Considering the above schemas, each service is responsible for resolving the respective field of each Schema it is responsible for.
Implementation
To illustrate an Apollo Federation, we’ll be considering a Nodejs server built with Typescript. The packages used are provided by the Apollo libraries.
npm i --save apollo-server @apollo/federation @apollo/gateway
Some additional libraries to help run the services in parallel:
npm i --save nodemon ts-node concurrently
Let’s go ahead and write the structure for the gateway service first. Let’s create a file gateway.ts:
Note the serviceList is an empty array for now since we’ve yet to implement the individual services. In addition, we pass the subscriptions: false option to the apollo server config because currently, Apollo Federation does not support subscriptions.
Next, let’s add the User service in a separate file user.ts using:
The @key directive helps other services understand the User schema is, in fact, an entity that can be extended within other individual services. The fields will help other services uniquely identify individual instances of the User schema based on the id.
The Query and the Mutation types need to be extended by all implementing services according to the Apollo Federation documentation since they are always defined on a gateway level.
As a side note, the User model imported from datasources/model/User
import User from ‘./datasources/models/User’; is essentially a Mongoose ORM Model for MongoDB that will help in all the CRUD operations of a User entity in a MongoDB database. In addition, the mongoStore() function is responsible for establishing a connection to the MongoDB database server.
The User model implementation internally in Mongoose ORM looks something like this:
In the Query type, the users and the user(id: ID!) queries fetch a list or the details of individual users.
In the resolvers, we define a __resolveReference function responsible for returning an instance of the User entity to all other implementing services, which just have a reference id of a User entity and need to return an instance of the User entity. The ref parameter is an object { id: ‘userEntityId’ } that contains the id of an instance of the User entity that may be passed down from other implementing services that need to resolve the reference of a User entity based on the reference id. Internally, we fire a mongoose .findOne query to return an instance of the User from the users database based on the reference id. To illustrate the resolver,
At the end of the file, we make sure the service is running on a unique port number 4001, which we pass as an option while running the apollo server. That concludes the User service.
Next, let’s add the tweet service by creating a file tweet.ts using:
touch tweet.ts
The following code goes as a part of the tweet service:
The Tweet schema has the text field, which is the content of the tweet, a unique id of the tweet, and a creator field, which is of the User entity type and resolves into the details of the user that created the tweet:
We extend the User entity schema in this service, which has the id field with an @external directive. This helps the Tweet service understand that based on the given id field of the User entity schema, the instance of the User entity needs to be derived from another service (user service in this case).
As we discussed previously, the tweets field of the extended User schema for the user entity should be resolved in the Tweet service since all the resolvers and access to the data sources with respect to the Tweets entity resides in this service.
The Query and Mutation types of the Tweet service are pretty straightforward; we have a tweets and a tweet(id: ID!) queries to resolve a list or resolve an individual instance of the Tweet entity.
To resolve the creator field of the Tweet entity, the Tweet service needs to tell the gateway that this field will be resolved by the User service. Hence, we pass the id of the User and a __typename for the gateway to be able to call the right service to resolve the User entity instance. In the User service earlier, we wrote a __resolveReference resolver, which will resolve the reference of a User based on an id.
Now, we need to resolve the tweets field of the User entity extended in the Tweet service. We need to write a resolver where we get the parent user entity reference in the first argument of the resolver using which we can fire a Mongoose ORM query to return all the tweets created by the user given its id.
At the end of the file, similar to the User service, we make sure the Tweet service runs on a different port by adding the port: 4002 option to the Apollo server config. That concludes both our implementing services.
Now that we have our services ready, let’s update our gateway.ts file to reflect the added services:
The concurrently library helps run 3 separate scripts in parallel. The server:* scripts spin up a dev server using nodemon to watch and reload the server for changes and ts-node to execute Typescript node.
Let’s spin up our server:
npm start
On visiting the http://localhost:4000, you should see the GraphQL query playground running an Apollo server:
Querying and Mutation from the Client
Initially, let’s fire some mutations to create two users and some tweets by those users.
Mutations
Here we have created a user with the username “@elonmusk” that returns the id of the user. Fire the following mutations in the GraphQL playground:
We will create another user named “@billgates” and take a note of the ID.
Here is a simple mutation to create a tweet by the user “@elonmusk”. Now that we have two created users, let’s fire some mutations to create tweets by those users:
Here is another mutation that creates a tweet by the user“@billgates”.
After adding a couple of those, we are good to fire our queries, which will allow the gateway to compose the data by resolving fields through different services.
Queries
Initially, let’s list all the tweets along with their creator, which is of type User. The query will look something like:
{ tweets { text creator { username } }}
When the gateway encounters a query asking for tweet data, it forwards that query to the Tweet service since the Tweet service that extends the Query type has a tweet query defined in it.
On encountering the creator field of the tweet schema, which is of the type User, the creator resolver within the Tweet service is invoked. This is essentially just passing a __typename and an id, which tells the gateway to resolve this reference from another service.
In the User service, we have a __resolveReference function, which returns the complete instance of a user given it’s id passed from the Tweet service. It also helps all other implementing services that need the reference of a User entity resolved.
On firing the query, the response should look something like:
{"data": {"tweets": [ {"text": "I own Tesla","creator": {"username": "@elonmusk" } }, {"text": "I own SpaceX","creator": {"username": "@elonmusk" } }, {"text": "I own PayPal","creator": {"username": "@elonmusk" } }, {"text": "I own Microsoft","creator": {"username": "@billgates" } }, {"text": "I own XBOX","creator": {"username": "@billgates" } } ] }}
Now, let’s try it the other way round. Let’s list all users and add the field tweets that will be an array of all the tweets created by that user. The query should look something like:
{ users { username tweets { text } }}
When the gateway encounters the query of type users, it passes down that query to the user service. The User service is responsible for resolving the username field of the query.
On encountering the tweets field of the users query, the gateway checks if any other implementing service has extended the User entity and has a resolver written within the service to resolve any additional fields of the type User.
The Tweet service has extended the type User and has a resolver for the User type to resolve the tweets field, which will fetch all the tweets created by the user given the id of the user.
On firing the query, the response should be something like:
{"data": {"users": [ {"username": "@elonmusk","tweets": [ {"text": "I own Tesla" }, {"text": "I own SpaceX" }, {"text": "I own PayPal" } ] }, {"username": "@billgates","tweets": [ {"text": "I own Microsoft" }, {"text": "I own XBOX" } ] } ] }}
Conclusion
To scale an enterprise data graph on a monolith GraphQL service brings along a lot of challenges. Having the ability to distribute our data graph into implementing services that can be individually maintained or scaled using Apollo Federation helps to quell any concerns.
There are further advantages of federated services. Considering our example above, we could have two different kinds of datastores for the User and the Tweet service. While the User data could reside on a NoSQL database like MongoDB, the Tweet data could be on a SQL database like Postgres or SQL. This would be very easy to implement since each service is only responsible for resolving references only for the type they own.
Final Thoughts
One of the key advantages of having different services that can be maintained individually is the ability to deploy each service separately. In addition, this also enables deployment of different services independently to different platforms such as Firebase, Lambdas, etc.
A single monolith GraphQL server deployed on an instance or a single serverless platform can have some challenges with respect to scaling an instance or handling high concurrency as mentioned above.
By splitting out the services, we could have a separate serverless function for each implementing service that can be maintained or scaled individually and also a separate function on which the gateway can be deployed.
One popular usage of GraphQL Federation can be seen in this Netflix Technology blog, where they’ve explained how they solved a bottleneck with the GraphQL APIs in Netflix Studio . What they did was create a federated GraphQL microservices architecture, along with a Schema store using Apollo Federation. This solution helped them create a unified schema but with distributed ownership and implementation.
These days the concept of Machine Learning is evolving rapidly. The understanding of it is so vast and open that everyone is having their independent thoughts about it. Here I am putting mine. This blog is my experience with the learning algorithms. In this blog, we will get to know the basic difference between Artificial Intelligence, Machine Learning, and Deep Learning. We will also get to know the foundation Machine Learning Algorithm i.e Univariate Linear Regression.
Intermediate knowledge of Python and its library (Numpy, Pandas, MatPlotLib) is good to start. For Mathematics, a little knowledge of Algebra, Calculus and Graph Theory will help to understand the trick of the algorithm.
A way to Artificial intelligence, Machine Learning, and Deep Learning
These are the three buzzwords of today’s Internet world where we are seeing the future of the programming language. Specifically, we can say that this is the place where science domain meets with programming. Here we use scientific concepts and mathematics with a programming language to simulate the decision-making process. Artificial Intelligence is a program or the ability of a machine to make decisions more as humans do. Machine Learning is another program that supports Artificial Intelligence. It helps the machine to observe the pattern and learn from it to make a decision. Here programming is helping in observing the patterns not in making decisions. Machine learning requires more and more information from various sources to observe all of the variables for any given pattern to make more accurate decisions. Here deep learning is supporting machine learning by creating a network (neural network) to fetch all required information and provide it to machine learning algorithms.
What is Machine Learning
Definition: Machine Learning provides machines with the ability to learn autonomously based on experiences, observations and analyzing patterns within a given data set without explicitly programming.
This is a two-part process. In the first part, it observes and analyses the patterns of given data and makes a shrewd guess of a mathematical function that will be very close to the pattern. There are various methods for this. Few of them are Linear, Non-Linear, logistic, etc. Here we calculate the error function using the guessed mathematical function and the given data. In the second part we will minimize the error function. This minimized function is used for the prediction of the pattern.
Here are the general steps to understand the process of Machine Learning:
Plot the given dataset on x-y axis
By looking into the graph, we will guess more close mathematical function
Derive the Error function with the given dataset and guessed mathematical function
Try to minimize an error function by using some algorithms
Minimized error function will give us a more accurate mathematical function for the given patterns.
Getting Started with the First Algorithms: Linear Regression with Univariable
Linear Regression is a very basic algorithm or we can say the first and foundation algorithm to understand the concept of ML. We will try to understand this with an example of given data of prices of plots for a given area. This example will help us understand it better.
movieID title userID rating timestamp01 Toy story 1703.0116220819800011 Toy story 1754.0113367460600021 Toy story 1904.5105777839800031 Toy story 2672.5108428449900041 Toy story 3254.0113493939100051 Toy story 4933.5121771135500061 Toy story 5335.0105001240200071 Toy story 5454.0116233332600081 Toy story 5805.0116237488400091 Toy story 6224.01215485147000101 Toy story 7884.01188553740000
With this data, we can easily determine the price of plots of the given area. But what if we want the price of the plot with area 5.0 * 10 sq mtr. There is no direct price of this in our given dataset. So how we can get the price of the plots with the area not given in the dataset. This we can do using Linear Regression.
So at first, we will plot this data into a graph.
The below graphs describe the area of plots (10 sq mtr) in x-axis and its prices in y-axis (Lakhs INR).
Definition of Linear Regression
The objective of a linear regression model is to find a relationship between one or more features (independent variables) and a continuous target variable(dependent variable). When there is only feature it is called Univariate Linear Regression and if there are multiple features, it is called Multiple Linear Regression.
Hypothesis function:
Here we will try to find the relation between price and area of plots. As this is an example of univariate, we can see that the price is only dependent on the area of the plot.
By observing this pattern we can have our hypothesis function as below:
f(x) = w * x + b
where w is weightage and b is biased.
For the different value set of (w,b) there can be multiple line possible but for one set of value, it will be close to this pattern.
When we generalize this function for multivariable then there will be a set of values of w then these constants are also termed as model params.
Note: There is a range of mathematical functions that relate to this pattern and selection of the function is totally up to us. But point to be taken care is that neither it should be under or overmatched and function must be continuous so that we can easily differentiate it or it should have global minima or maxima.
Error for a point
As our hypothesis function is continuous, for every Xi (area points) there will be one Yi Predicted Price and Y will be the actual price.
So the error at any point,
Ei = Yi – Y = F(Xi) – Y
These errors are also called as residuals. These residuals can be positive (if actual points lie below the predicted line) or negative (if actual points lie above the predicted line). Our motive is to minimize this residual for each of the points.
Note: While observing the patterns it is possible that few points are very far from the pattern. For these far points, residuals will be much more so if these points are less in numbers than we can avoid these points considering that these are errors in the dataset. Such points are termed as outliers.
Energy Functions
As there are m training points, we can calculate the Average Energy function below
E (w,b) = 1/m ( iΣm (Ei) )
and
our motive is to minimize the energy functions
min (E (w,b)) at point ( w,b )
Little Calculus: For any continuous function, the points where the first derivative is zero are the points of either minima or maxima. If the second derivative is negative, it is the point of maxima and if it is positive, it is the point of minima.
Here we will do the trick – we will convert our energy function into an upper parabola by squaring the error function. It will ensure that our energy function will have only one global minima (the point of our concern). It will simplify our calculation that where the first derivative of the energy function will be zero is the point that we need and the value of (w,b) at that point will be our required point.
So our final Energy function is
E (w,b) = 1/2m ( iΣm (Ei)2 )
dividing by 2 doesn’t affect our result and at the time of derivation it will cancel out for e.g
the first derivative of x2 is 2x.
Gradient Descent Method
Gradient descent is a generic optimization algorithm. It iteratively hit and trials the parameters of the model in order to minimize the energy function.
In the above picture, we can see on the right side:
w0 and w1 is the random initialization and by following gradient descent it is moving towards global minima.
No of turns of the black line is the number of iterations so it must not be more or less.
The distance between the turns is alpha i.e the learning parameter.
By solving this left side equation we will be able to get model params at the global minima of energy functions.
Points to consider at the time of Gradient Descent calculations:
Random initialization: We start this algorithm at any random point that is set of random (w, b) value. By moving along this algorithm decide at which direction new trials have to be taken. As we know that it will be the upper parabola so by moving into the right direction (towards the global minima) we will get lesser value compared to the previous point.
No of iterations: No of iteration must not be more or less. If it is lesser, we will not reach global minima and if it is more, then it will be extra calculations around the global minima.
Alpha as learning parameters: when alpha is too small then gradient descent will be slow as it takes unnecessary steps to reach the global minima. If alpha is too big then it might overshoot the global minima. In this case it will neither converge nor diverge.
Implementation of Gradient Descent in Python
""" Method to read the csv file using Pandas and later use this data for linear regression. """""" Better run with Python 3+. """# Library to read csv file effectivelyimport pandasimport matplotlib.pyplot as pltimport numpy as np# Method to read the csv filedef load_data(file_name): column_names = ['area', 'price'] # To read columns io = pandas.read_csv(file_name,names=column_names, header=None) x_val = (io.values[1:, 0]) y_val = (io.values[1:, 1]) size_array = len(y_val) for i in range(size_array): x_val[i] = float(x_val[i]) y_val[i] = float(y_val[i]) return x_val, y_val# Call the method for a specific filex_raw, y_raw = load_data('area-price.csv')x_raw = x_raw.astype(np.float)y_raw = y_raw.astype(np.float)y = y_raw# Modelingw, b = 0.1, 0.1num_epoch = 100converge_rate = np.zeros([num_epoch , 1], dtype=float)learning_rate = 1e-3for e in range(num_epoch): # Calculate the gradient of the loss function with respect to arguments (model parameters) manually. y_predicted = w * x_raw + b grad_w, grad_b = (y_predicted - y).dot(x_raw), (y_predicted - y).sum() # Update parameters. w, b = w - learning_rate * grad_w, b - learning_rate * grad_b converge_rate[e] = np.mean(np.square(y_predicted-y))print(w, b)print(f"predicted function f(x) = x * {w} + {b}" )calculatedprice = (10 * w) + bprint(f"price of plot with area 10 sqmtr = 10 * {w} + {b} = {calculatedprice}")
This is the basic implementation of Gradient Descent algorithms using numpy and Pandas. It is basically reading the area-price.csv file. Here we are normalizing the x-axis for better readability of data points over the graph. We have taken (w,b) as (0.1, 0.1) as random initialization. We have taken 100 as count of iterations and learning rate as .001.
In every iteration, we are calculating w and b value and seeing it for converging rate.
We can repeat this calculation for (w,b) for different values of random initialization, no of iterations and learning rate (alpha).
Note: There is another python Library TensorFlow which is more preferable for such calculations. There are inbuilt functions of Gradient Descent in TensorFlow. But for better understanding, we have used library numpy and pandas here.
RMSE (Root Mean Square Error)
RMSE: This is the method to verify that our calculation of (w,b) is accurate at what extent. Below is the basic formula of calculation of RMSE where f is the predicted value and the observed value.
Note: There is no absolute good or bad threshold value for RMSE, however, we can assume this based on our observed value. For an observed value ranges from 0 to 1000, the RMSE value of 0.7 is small, but if the range goes from 0 to 1, it is not that small.
Conclusion
As part of this article, we have seen a little introduction to Machine Learning and the need for it. Then with the help of a very basic example, we learned about one of the various optimization algorithms i.e. Linear Regression (for univariate only). This can be generalized for multivariate also. We then use the Gradient Descent Method for the calculation of the predicted data model in Linear Regression. We also learned the basic flow details of Gradient Descent. There is one example in python for displaying Linear Regression via Gradient Descent.
Amazon API Gateway is a fully managed service that allows you to create, secure, publish, test and monitor your APIs. We often come across scenarios where customers of these APIs expect a platform to learn and discover APIs that are available to them (often with examples).
The Serverless Developer Portal is one such application that is used for developer engagement by making your APIs available to your customers. Further, your customers can use the developer portal to subscribe to an API, browse API documentation, test published APIs, monitor their API usage, and submit their feedback.
This blog is a detailed step-by-step guide for deploying the Serverless Developer Portal for APIs that are managed via Amazon API Gateway.
Advantages
The users of the Amazon API Gateway can be vaguely categorized as –
API Publishers – They can use the Serverless Developer Portal to expose and secure their APIs for customers which can be integrated with AWS Marketplace for monetary benefits. Furthermore, they can customize the developer portal, including content, styling, logos, custom domains, etc.
API Consumers – They could be Frontend/Backend developers, third party customers, or simply students. They can explore available APIs, invoke the APIs, and go through the documentation to get an insight into how each API works with different requests.
Developer Portal Architecture
We would need to establish a basic understanding of how the developer portal works. The Serverless Developer Portal is a serverless application built on microservice architecture using Amazon API Gateway, Amazon Cognito, AWS Lambda, Simple Storage Service and Amazon CloudFront.
The developer portal comprises multiple microservices and components as described in the following figure.
There are a few key pieces in the above architecture –
Identity Management: Amazon Cognito is basically the secure user directory of the developer portal responsible for user management. It allows you to configure triggers for registration, authentication, and confirmation, thereby giving you more control over the authentication process.
Business Logic: AWS Cloudfront is configured to serve your static content hosted in a private S3 bucket. The static content is built using the React JS framework which interacts with backend APIs dictating the business logic for various events.
Catalog Management: Developer portal uses catalog for rendering the APIs with Swagger specifications on the APIs page. The catalog file (catalog.json in S3 Artifact bucket) is updated whenever an API is published or removed. This is achieved by creating an S3 trigger on AWS Lambda responsible for studying the content of the catalog directory and generating a catalog for the developer portal.
API Key Creation: API Key is created for consumers at the time of registration. Whenever you subscribe to an API, associated Usage Plans are updated to your API key, thereby giving you access to those APIs as defined by the usage plan. Cognito User – API key mapping is stored in the DynamoDB table along with other registration related details.
Static Asset Uploader: AWS Lambda (Static-Asset-Uploader) is responsible for updating/deploying static assets for the developer portal. Static assets include – content, logos, icons, CSS, JavaScripts, and other media files.
Let’s move forward to building and deploying a simple Serverless Developer Portal.
Building Your API
Start with deploying an API which can be accessed using API Gateway from
If you do not have any such API available, create a simple application by jumping to the section, “API Performance Across the Globe,” on this blog.
Setup custom domain name
For professional projects, I recommend that you create a custom domain name as they provide simpler and more intuitive URLs you can provide to your API users.
Make sure your API Gateway domain name is updated in the Route53 record set created after you set up your custom domain name.
There are two ways you can enable CORS on a resource:
Enable CORS Using the Console
Enable CORS on a resource using the import API from Amazon API Gateway
Let’s discuss the easiest way to do it using a console.
Open API Gateway console.
Select the API Gateway for your API from the list.
Choose a resource to enable CORS for all the methods under that resource. Alternatively, you could choose a method under the resource to enable CORS for just this method.
Select Enable CORS from the Actions drop-down menu.
In the Enable CORS form, do the following: – Leave Access-Control-Allow-Headers and Access-Control-Allow-Origin header to default values. – Click on Enable CORS and replace existing CORS headers.
Review the changes in Confirm method changes popup, choose Yes, overwrite existing values to apply your CORS settings.
Once enabled, you can see a mock integration on the OPTIONS method for the selected resource. You must enable CORS for ${proxy} resources too.
To verify the CORS is enabled on API resource, try curl on OPTIONS method
There are two ways to deploy the developer portal for your API.
Using SAR
An easy way will be to deploy api-gateway-dev-portal directly from AWS Serverless Application Repository.
Note -If you intend to upgrade your Developer portal to a major version then you need to refer to the Upgrading Instructions which is currently under development.
Using AWS SAM
Ensure that you have the latest AWS CLI and AWS SAM CLI installed and configured.
Update the Cloudformation template file – cloudformation/template.yaml.
Parameters you must configure and verify includes:
ArtifactsS3BucketName
DevPortalSiteS3BucketName
DevPortalCustomersTableName
DevPortalPreLoginAccountsTableName
DevPortalAdminEmail
DevPortalFeedbackTableName
CognitoIdentityPoolName
CognitoDomainNameOrPrefix
CustomDomainName
CustomDomainNameAcmCertArn
UseRoute53Nameservers
AccountRegistrationMode
You can view your template file in AWS Cloudformation Designer to get a better idea of all the components/services involved and how they are connected.
Replace the static files in your project with the ones you would like to use. dev-portal/public/custom-content lambdas/static-asset-uploader/build – api-logo contains the logos you would like to show on the API page (in png format). Portal checks for an api-id_stage.png file when rendering the API page. If not found, it chooses the default logo – default.png. – content-fragments includes various markdown files comprising the content of the different pages in the portal. – Other static assets including favicon.ico, home-image.png and nav-logo.png that appear on your portal.
Let’s create a ZIP file of your code and dependencies, and upload it to Amazon S3. Running below command creates an AWS SAM template packaged.yaml, replacing references to local artifacts with the Amazon S3 location where the command uploaded the artifacts:
sam package--template-file ./cloudformation/template.yaml --output-template-file ./cloudformation/packaged.yaml --s3-bucket {your-lambda-artifacts-bucket-name}
Run the following command from the project root to deploy your portal, replace: – {your-template-bucket-name} with the name of your Amazon S3 bucket. – {custom-prefix} with a prefix that is globally unique. – {cognito-domain-or-prefix} with a unique string.
sam deploy --template-file ./cloudformation/packaged.yaml --s3-bucket {your-template-bucket-name} --stack-name "{custom-prefix}-dev-portal"--capabilities CAPABILITY_NAMED_IAM
Note: Ensure that you have required privileges to make deployments, as, during the deployment process, it attempts to create various resources such as AWS Lambda, Cognito User Pool, IAM roles, API Gateway, Cloudfront Distribution, etc.
After your developer portal has been fully deployed, you can get its URL by following.
Open the AWS CloudFormation console.
Select your stack you created above.
Open the Outputs section. The URL for the developer portal is specified in the WebSiteURL property.
Create Usage Plan
Create a usage plan, to list your API under a subscribable APIs category allowing consumers to access the API using their API keys in the developer portal. Ensure that the API gateway stage is configured for the usage plan.
Publishing an API
Only Administrators have permission to publish an API. To create an Administrator account for your developer portal –
1. Go to the WebSiteURL obtained after the successful deployment.
2. On the top right of the home page click on Register.
4. Enter the confirmation code received on your email address provided in the previous step.
5. Promote the user as Administrator by adding it to AdminGroup.
Open Amazon Cognito User Pool console.
Select the User Pool created for your developer portal.
From the General Settings > Users and Groups page, select the User you want to promote as Administrator.
Click on Add to group and then select the Admin group from the dropdown and confirm.
6. You will be required to log in again to log in as an Administrator. Click on the Admin Panel and choose the API you wish to publish from the APIs list.
Setting up an account
The signup process depends on the registration mode selected for the developer portal.
For request registration mode, you need to wait for the Administrator to approve your registration request.
For invite registration mode, you can only register on the portal when invited by the portal administrator.
Subscribing an API
Sign in to the developer portal.
Navigate to the Dashboard page and Copy your API Key.
Go to APIs Page to see a list of published APIs.
Select an API you wish to subscribe to and hit the Subscribe button.
Tips
When a user subscribes to API, all the APIs published under that usage plan are accessible no matter whether they are published or not.
Whenever you subscribe to an API, the catalog is exported from API Gateway resource documentation. You can customize the workflow or override the catalog swagger definition JSON in S3 bucket as defined in ArtifactsS3BucketName under /catalog/<apiid>_<stage>.json</stage></apiid>.
For backend APIs, CORS requests are allowed only from custom domain names selected for your developer portal.
Ensure to set the CORS response header from the published API in order to invoke them from the developer portal.
Summary
You’ve seen how to deploy a Serverless Developer Portal and publish an API. If you are creating a serverless application for the first time, you might want to read more on Serverless Computing and AWS Gateway before you get started.
Start building your own developer portal. To know more on distributing your API Gateway APIs to your customers follow this AWS guide.
Automation is everywhere and it is better to adopt it as soon as possible. Today, in this blog post, we are going to discuss creating the infrastructure. For this, we will be using AWS for hosting our deployment pipeline. Packer will be used to create AMI’s and Terraform will be used for creating the master/slaves. We will be discussing different ways of connecting the slaves and will also run a sample application with the pipeline.
Please remember the intent of the blog is to accumulate all the different components together, this means some of the code which should be available in development code repo is also included here. Now that we have highlighted the required tools, 10000 ft view and intent of the blog. Let’s begin.
Using Packer to Create AMI’s for Jenkins Master and Linux Slave
Hashicorp has bestowed with some of the most amazing tools for simplifying our life. Packer is one of them. Packer can be used to create custom AMI from already available AMI’s. We just need to create a JSON file and pass installation script as part of creation and it will take care of developing the AMI for us. Install packer depending upon your requirement from Packer downloads page. For simplicity purpose, we will be using Linux machine for creating Jenkins Master and Linux Slave. JSON file for both of them will be same but can be separated if needed.
Note: user-data passed from terraform will be different which will eventually differentiate their usage.
We are using Amazon Linux 2 – JSON file for the same.
As you can see the file is pretty simple. The only thing of interest here is the install_amazon.bash script. In this blog post, we will deploy a Node-based application which is running inside a docker container. Content of the bash file is as follows:
#!/bin/bashset -x# For Nodecurl -sL https://rpm.nodesource.com/setup_10.x | sudo -E bash -# For xmlstarletsudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpmsudo yum update -ysleep 10# Setting up Dockersudo yum install -y dockersudo usermod -a -G docker ec2-user# Just to be safe removing previously available java if presentsudo yum remove -y javasudo yum install -y python2-pip jq unzip vim tree biosdevname nc mariadb bind-utils at screen tmux xmlstarlet git java-1.8.0-openjdk nc gcc-c++ make nodejssudo -H pip install awscli bcryptsudo -H pip install --upgrade awsclisudo -H pip install --upgrade aws-ec2-assign-elastic-ipsudo npm install -g @angular/clisudo systemctl enable dockersudo systemctl enable atdsudo yum clean allsudo rm -rf /var/cache/yum/exit 0@velotiotech
Now there are a lot of things mentioned let’s check them out. As mentioned earlier we will be discussing different ways of connecting to a slave and for one of them, we need xmlstarlet. Rest of the things are packages that we might need in one way or the other.
Update ami_users with actual user value. This can be found on AWS console Under Support and inside of it Support Center.
Validate what we have written is right or not by running packer validate amazon.json.
Once confirmed, build the packer image by running packer build amazon.json.
After completion check your AWS console and you will find a new AMI created in “My AMI’s”.
It’s now time to start using terraform for creating the machines.
Prerequisite:
1. Please make sure you create a provider.tf file.
provider "aws" { region ="us-east-1" shared_credentials_file ="~/.aws/credentials" profile ="dev"}
The ‘credentials file’ will contain aws_access_key_id and aws_secret_access_key.
2. Keep SSH keys handy for server/slave machines. Here is a nice article highlighting how to create it or else create them before hand on aws console and reference it in the code.
3. VPC:
# lookup for the "default"VPCdata "aws_vpc""default_vpc" {default=true}# subnet list in the "default"VPC# The "default"VPC has all "public subnets"data "aws_subnet_ids""default_public" { vpc_id ="${data.aws_vpc.default_vpc.id}"}
Creating Terraform Script for Spinning up Jenkins Master
Creating Terraform Script for Spinning up Jenkins Master. Get terraform from terraform download page.
We will need to set up the Security Group before setting up the instance.
# Security Group:resource "aws_security_group""jenkins_server" { name ="jenkins_server" description ="Jenkins Server: created by Terraform for [dev]" # legacy name ofVPCID vpc_id ="${data.aws_vpc.default_vpc.id}" tags { Name ="jenkins_server" env ="dev" }}################################################################################ ALLINBOUND################################################################################ sshresource "aws_security_group_rule""jenkins_server_from_source_ingress_ssh" { type ="ingress" from_port =22 to_port =22 protocol ="tcp" security_group_id ="${aws_security_group.jenkins_server.id}" cidr_blocks = ["<Your Public IP>/32", "172.0.0.0/8"] description ="ssh to jenkins_server"}# webresource "aws_security_group_rule""jenkins_server_from_source_ingress_webui" { type ="ingress" from_port =8080 to_port =8080 protocol ="tcp" security_group_id ="${aws_security_group.jenkins_server.id}" cidr_blocks = ["0.0.0.0/0"] description ="jenkins server web"}# JNLPresource "aws_security_group_rule""jenkins_server_from_source_ingress_jnlp" { type ="ingress" from_port =33453 to_port =33453 protocol ="tcp" security_group_id ="${aws_security_group.jenkins_server.id}" cidr_blocks = ["172.31.0.0/16"] description ="jenkins server JNLP Connection"}################################################################################ ALLOUTBOUND###############################################################################resource "aws_security_group_rule""jenkins_server_to_other_machines_ssh" { type ="egress" from_port =22 to_port =22 protocol ="tcp" security_group_id ="${aws_security_group.jenkins_server.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins servers to ssh to other machines"}resource "aws_security_group_rule""jenkins_server_outbound_all_80" { type ="egress" from_port =80 to_port =80 protocol ="tcp" security_group_id ="${aws_security_group.jenkins_server.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins servers for outbound yum"}resource "aws_security_group_rule""jenkins_server_outbound_all_443" { type ="egress" from_port =443 to_port =443 protocol ="tcp" security_group_id ="${aws_security_group.jenkins_server.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins servers for outbound yum"}
Now that we have a custom AMI and security groups for ourselves let’s use them to create a terraform instance.
# AMI lookup for this Jenkins Serverdata "aws_ami""jenkins_server" { most_recent =true owners = ["self"] filter { name ="name" values = ["amazon-linux-for-jenkins*"] }}resource "aws_key_pair""jenkins_server" { key_name ="jenkins_server" public_key ="${file("jenkins_server.pub")}"}# lookup the security group of the Jenkins Serverdata "aws_security_group""jenkins_server" { filter { name ="group-name" values = ["jenkins_server"] }}# userdata for the Jenkins server ...data "template_file""jenkins_server" { template ="${file("scripts/jenkins_server.sh")}" vars { env ="dev" jenkins_admin_password ="mysupersecretpassword" }}# the Jenkins server itselfresource "aws_instance""jenkins_server" { ami ="${data.aws_ami.jenkins_server.image_id}" instance_type ="t3.medium" key_name ="${aws_key_pair.jenkins_server.key_name}" subnet_id ="${data.aws_subnet_ids.default_public.ids[0]}" vpc_security_group_ids = ["${data.aws_security_group.jenkins_server.id}"] iam_instance_profile ="dev_jenkins_server" user_data ="${data.template_file.jenkins_server.rendered}" tags {"Name"="jenkins_server" } root_block_device { delete_on_termination =true }}output "jenkins_server_ami_name" { value ="${data.aws_ami.jenkins_server.name}"}output "jenkins_server_ami_id" { value ="${data.aws_ami.jenkins_server.id}"}output "jenkins_server_public_ip" { value ="${aws_instance.jenkins_server.public_ip}"}output "jenkins_server_private_ip" { value ="${aws_instance.jenkins_server.private_ip}"}
As mentioned before, we will be discussing multiple ways in which we can connect the slaves to Jenkins master. But it is already known that every time a new Jenkins comes up, it generates a unique password. Now there are two ways to deal with this, one is to wait for Jenkins to spin up and retrieve that password or just directly edit the admin password while creating Jenkins master. Here we will be discussing how to change the password when configuring Jenkins. (If you need the script to retrieve Jenkins password as soon as it gets created than comment and I will share that with you as well).
Below is the user data to install Jenkins master, configure its password and install required packages.
#!/bin/bashset -xfunctionwait_for_jenkins(){while (( 1 )); do echo "waiting for Jenkins to launch on port [8080] ..." nc -zv 127.0.0.18080if (( $?==0 )); then break fi sleep 10 done echo "Jenkins launched"}functionupdating_jenkins_master_password (){ cat >/tmp/jenkinsHash.py <<EOFimport bcryptimport sysif not sys.argv[1]: sys.exit(10)plaintext_pwd=sys.argv[1]encrypted_pwd=bcrypt.hashpw(sys.argv[1], bcrypt.gensalt(rounds=10, prefix=b"2a"))isCorrect=bcrypt.checkpw(plaintext_pwd, encrypted_pwd)if not isCorrect: sys.exit(20);print "{}".format(encrypted_pwd)EOF chmod +x /tmp/jenkinsHash.py # Wait till /var/lib/jenkins/users/admin* folder gets created sleep 10 cd /var/lib/jenkins/users/admin* pwd while (( 1 )); do echo "Waiting for Jenkins to generate admin user's config file ..." if [[ -f "./config.xml" ]]; then break fi sleep 10 done echo "Admin config file created" admin_password=$(python /tmp/jenkinsHash.py ${jenkins_admin_password} 2>&1) # Please do not remove alter quote as it keeps the hash syntax intact or else while substitution, $<character> will be replaced by null xmlstarlet -q ed --inplace -u "/user/properties/hudson.security.HudsonPrivateSecurityRealm_-Details/passwordHash" -v '#jbcrypt:'"$admin_password" config.xml # Restart systemctl restart jenkins sleep 10}function install_packages (){ wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat-stable/jenkins.repo rpm --import https://jenkins-ci.org/redhat/jenkins-ci.org.key yum install -y jenkins # firewall #firewall-cmd --permanent --new-service=jenkins #firewall-cmd --permanent --service=jenkins --set-short="Jenkins Service Ports" #firewall-cmd --permanent --service=jenkins --set-description="Jenkins Service firewalld port exceptions" #firewall-cmd --permanent --service=jenkins --add-port=8080/tcp #firewall-cmd --permanent --add-service=jenkins #firewall-cmd --zone=public --add-service=http --permanent #firewall-cmd --reload systemctl enable jenkins systemctl restart jenkins sleep 10}function configure_jenkins_server (){ # Jenkins cli echo "installing the Jenkins cli ..." cp /var/cache/jenkins/war/WEB-INF/jenkins-cli.jar /var/lib/jenkins/jenkins-cli.jar # Getting initial password # PASSWORD=$(cat /var/lib/jenkins/secrets/initialAdminPassword) PASSWORD="${jenkins_admin_password}" sleep 10 jenkins_dir="/var/lib/jenkins" plugins_dir="$jenkins_dir/plugins" cd $jenkins_dir # Open JNLP port xmlstarlet -q ed --inplace -u "/hudson/slaveAgentPort" -v 33453 config.xml cd $plugins_dir || { echo "unable to chdir to [$plugins_dir]"; exit 1; } # List of plugins that are needed to be installed plugin_list="git-client git github-api github-oauth github MSBuild ssh-slaves workflow-aggregator ws-cleanup" # remove existing plugins, if any ... rm -rfv $plugin_list for plugin in $plugin_list; do echo "installing plugin [$plugin] ..." java -jar $jenkins_dir/jenkins-cli.jar -s http://127.0.0.1:8080/ -auth admin:$PASSWORD install-plugin $plugin done # Restart jenkins after installing plugins java -jar $jenkins_dir/jenkins-cli.jar -s http://127.0.0.1:8080 -auth admin:$PASSWORD safe-restart}### script starts here ###install_packageswait_for_jenkinsupdating_jenkins_master_passwordwait_for_jenkinsconfigure_jenkins_serverecho "Done"exit 0
There is a lot of stuff that has been covered here. But the most tricky bit is changing Jenkins password. Here we are using a python script which uses brcypt to hash the plain text in Jenkins encryption format and xmlstarlet for replacing that password in the actual location. Also, we are using xmstarlet to edit the JNLP port for windows slave. Do remember initial username for Jenkins is admin.
Command to run: Initialize terraform – terraform init , Check and apply – terraform plan -> terraform apply
After successfully running apply command go to AWS console and check for a new instance coming up. Hit the <public ip=””>:8080 and enter credentials as you had passed and you will have the Jenkins master for yourself ready to be used. </public>
Note: I will be providing the terraform script and permission list of IAM roles for the user at the end of the blog.
Creating Terraform Script for Spinning up Linux Slave and connect it to master
We won’t be creating a new image here rather use the same one that we used for Jenkins master.
VPC will be same and updated Security groups for slave are below:
resource "aws_security_group""dev_jenkins_worker_linux" { name ="dev_jenkins_worker_linux" description ="Jenkins Server: created by Terraform for [dev]"# legacy name ofVPCID vpc_id ="${data.aws_vpc.default_vpc.id}" tags { Name ="dev_jenkins_worker_linux" env ="dev" }}################################################################################ ALLINBOUND################################################################################ sshresource "aws_security_group_rule""jenkins_worker_linux_from_source_ingress_ssh" { type ="ingress" from_port =22 to_port =22 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_linux.id}" cidr_blocks = ["<Your Public IP>/32"] description ="ssh to jenkins_worker_linux"}# sshresource "aws_security_group_rule""jenkins_worker_linux_from_source_ingress_webui" { type ="ingress" from_port =8080 to_port =8080 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_linux.id}" cidr_blocks = ["0.0.0.0/0"] description ="ssh to jenkins_worker_linux"}################################################################################ ALLOUTBOUND###############################################################################resource "aws_security_group_rule""jenkins_worker_linux_to_all_80" { type ="egress" from_port =80 to_port =80 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_linux.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins worker to all 80"}resource "aws_security_group_rule""jenkins_worker_linux_to_all_443" { type ="egress" from_port =443 to_port =443 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_linux.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins worker to all 443"}resource "aws_security_group_rule""jenkins_worker_linux_to_other_machines_ssh" { type ="egress" from_port =22 to_port =22 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_linux.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins worker linux to jenkins server"}resource "aws_security_group_rule""jenkins_worker_linux_to_jenkins_server_8080" { type ="egress" from_port =8080 to_port =8080 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_linux.id}" source_security_group_id ="${aws_security_group.jenkins_server.id}" description ="allow jenkins workers linux to jenkins server"}
Now that we have the required security groups in place it is time to bring into light terraform script for linux slave.
And now the final piece of code, which is user-data of slave machine.
#!/bin/bashset -xfunctionwait_for_jenkins (){ echo "Waiting jenkins to launch on 8080..."while (( 1 )); do echo "Waiting for Jenkins" nc -zv ${server_ip} 8080if (( $?==0 )); then break fi sleep 10 done echo "Jenkins launched"}functionslave_setup(){ # Wait till jar file gets available ret=1while (( $ret !=0 )); do wget -O/opt/jenkins-cli.jar http://${server_ip}:8080/jnlpJars/jenkins-cli.jar ret=$? echo "jenkins cli ret [$ret]" done ret=1while (( $ret !=0 )); do wget -O/opt/slave.jar http://${server_ip}:8080/jnlpJars/slave.jar ret=$? echo "jenkins slave ret [$ret]" done mkdir -p /opt/jenkins-slave chown -R ec2-user:ec2-user /opt/jenkins-slave # Register_slaveJENKINS_URL="http://${server_ip}:8080"USERNAME="${jenkins_username}" # PASSWORD=$(cat /tmp/secret)PASSWORD="${jenkins_password}"SLAVE_IP=$(ip -o -4 addr list ${device_name} | head -n1 | awk '{print $4}'| cut -d/-f1)NODE_NAME=$(echo "jenkins-slave-linux-$SLAVE_IP"| tr '.''-')NODE_SLAVE_HOME="/opt/jenkins-slave"EXECUTORS=2SSH_PORT=22CRED_ID="$NODE_NAME"LABELS="build linux docker"USERID="ec2-user" cd /opt # Creating CMD utility for jenkins-cli commands jenkins_cmd="java -jar /opt/jenkins-cli.jar -s $JENKINS_URL -auth $USERNAME:$PASSWORD" # Waiting for Jenkins to load all pluginswhile (( 1 )); do count=$($jenkins_cmd list-plugins 2>/dev/null| wc -l) ret=$? echo "count [$count] ret [$ret]"if (( $count >0 )); then break fi sleep 30 done # Delete Credentials if present for respective slave machines $jenkins_cmd delete-credentials system::system::jenkins _ $CRED_ID # Generating cred.xml for creating credentials on Jenkins server cat >/tmp/cred.xml <<EOF<com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKeyplugin="ssh-credentials@1.16"> <scope>GLOBAL</scope> <id>$CRED_ID</id> <description>Generated via Terraform for $SLAVE_IP</description> <username>$USERID</username> <privateKeySourceclass="com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKey\$DirectEntryPrivateKeySource"> <privateKey>${worker_pem}</privateKey> </privateKeySource></com.cloudbees.jenkins.plugins.sshcredentials.impl.BasicSSHUserPrivateKey>EOF # Creating credential usingcred.xml cat /tmp/cred.xml | $jenkins_cmd create-credentials-by-xml system::system::jenkins _ # For Deleting Node, used when testing $jenkins_cmd delete-node $NODE_NAME # Generating node.xml for creating node on Jenkins server cat >/tmp/node.xml <<EOF<slave> <name>$NODE_NAME</name> <description>Linux Slave</description> <remoteFS>$NODE_SLAVE_HOME</remoteFS> <numExecutors>$EXECUTORS</numExecutors> <mode>NORMAL</mode> <retentionStrategyclass="hudson.slaves.RetentionStrategy\$Always"/> <launcherclass="hudson.plugins.sshslaves.SSHLauncher"plugin="ssh-slaves@1.5"> <host>$SLAVE_IP</host> <port>$SSH_PORT</port> <credentialsId>$CRED_ID</credentialsId> </launcher> <label>$LABELS</label> <nodeProperties/> <userId>$USERID</userId></slave>EOF sleep 10 # Creating node usingnode.xml cat /tmp/node.xml | $jenkins_cmd create-node $NODE_NAME}### script begins here ###wait_for_jenkinsslave_setupecho "Done"exit 0
This will not only create a node on Jenkins master but also attach it.
Command to run: Initialize terraform – terraform init, Check and apply – terraform plan -> terraform apply
One drawback of this is, if by any chance slave gets disconnected or goes down, it will remain on Jenkins master as offline, also it will not manually attach itself to Jenkins master.
Some solutions for them are:
1. Create a cron job on the slave which will run user-data after a certain interval.
2. Use swarm plugin.
3. As we are on AWS, we can even use Amazon EC2 Plugin.
Maybe in a future blog, we will cover using both of these plugins as well.
Using Packer to create AMI’s for Windows Slave
Windows AMI will also be created using packer. All the pointers for Windows will remain as it were for Linux.
Now when it comes to windows one should know that it does not behave the same way Linux does. For us to be able to communicate with this image an essential component required is WinRM. We set it up at the very beginning as part of user_data_file. Also, windows require user input for a lot of things and while automating it is not possible to provide it as it will break the flow of execution so we disable UAC and enable RDP so that we can connect to that machine from our local desktop for debugging if needed. And at last, we will execute install_windows.ps1 file which will set up our slave. Please note at the last we are calling two PowerShell scripts to generate random password every time a new machine is created. It is mandatory to have them or you will never be able to login into your machines.
There are multiple user-data in the above code, let’s understand them in their order of appearance.
SetUpWinRM.ps1:
<powershell>write-output "Running User Data Script"write-host "(host) Running User Data Script"Set-ExecutionPolicy Unrestricted -Scope LocalMachine -Force -ErrorAction Ignore# Don't set this before Set-ExecutionPolicy as it throws an error$ErrorActionPreference = "stop"# Remove HTTP listenerRemove-Item -Path WSMan:\Localhost\listener\listener* -Recurse$Cert = New-SelfSignedCertificate -CertstoreLocation Cert:\LocalMachine\My -DnsName "packer"New-Item -Path WSMan:\LocalHost\Listener -Transport HTTPS -Address * -CertificateThumbPrint $Cert.Thumbprint -Force# WinRMwrite-output "Setting up WinRM"write-host "(host) setting up WinRM"cmd.exe /c winrm quickconfig -qcmd.exe /c winrm set "winrm/config" '@{MaxTimeoutms="1800000"}'cmd.exe /c winrm set "winrm/config/winrs" '@{MaxMemoryPerShellMB="1024"}'cmd.exe /c winrm set "winrm/config/service" '@{AllowUnencrypted="true"}'cmd.exe /c winrm set "winrm/config/client" '@{AllowUnencrypted="true"}'cmd.exe /c winrm set "winrm/config/service/auth" '@{Basic="true"}'cmd.exe /c winrm set "winrm/config/client/auth" '@{Basic="true"}'cmd.exe /c winrm set "winrm/config/service/auth" '@{CredSSP="true"}'cmd.exe /c winrm set "winrm/config/listener?Address=*+Transport=HTTPS" "@{Port=`"5986`";Hostname=`"packer`";CertificateThumbprint=`"$($Cert.Thumbprint)`"}"cmd.exe /c netsh advfirewall firewall set rule group="remote administration" new enable=yescmd.exe /c netsh firewall add portopening TCP 5986 "Port 5986"cmd.exe /c net stop winrmcmd.exe /c sc config winrm start= autocmd.exe /c net start winrm</powershell>
The content is pretty straightforward as it is just setting up WInRM. The only thing that matters here is the <powershell> and </powershell>. They are mandatory as packer will not be able to understand what is the type of script. Next, we come across disable-uac.ps1 & enable-rdp.ps1, and we have discussed their purpose before. The last user-data is the actual user-data that we need to install all the required packages in the AMI.
Chocolatey: a blessing in disguise – Installing required applications in windows by scripting is a real headache as you have to write a lot of stuff just to install a single application but luckily for us we have chocolatey. It works as a package manager for windows and helps us install applications as we are installing packages on Linux. install_windows.ps1 has installation step for chocolatey and how it can be used to install other applications on windows.
See, such a small script and you can get all the components to run your Windows application in no time (Kidding… This script actually takes around 20 minutes to run :P)
Now that we have the image for ourselves let’s start with terraform script to make this machine a slave of your Jenkins master.
Creating Terraform Script for Spinning up Windows Slave and Connect it to Master
This time also we will first create the security groups and then create the slave machine from the same AMI that we developed above.
resource "aws_security_group""dev_jenkins_worker_windows" { name ="dev_jenkins_worker_windows" description ="Jenkins Server: created by Terraform for [dev]" # legacy name ofVPCID vpc_id ="${data.aws_vpc.default_vpc.id}" tags { Name ="dev_jenkins_worker_windows" env ="dev" }}################################################################################ ALLINBOUND################################################################################ sshresource "aws_security_group_rule""jenkins_worker_windows_from_source_ingress_webui" { type ="ingress" from_port =8080 to_port =8080 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_windows.id}" cidr_blocks = ["0.0.0.0/0"] description ="ssh to jenkins_worker_windows"}# rdpresource "aws_security_group_rule""jenkins_worker_windows_from_rdp" { type ="ingress" from_port =3389 to_port =3389 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_windows.id}" cidr_blocks = ["<Your Public IP>/32"] description ="rdp to jenkins_worker_windows"}################################################################################ ALLOUTBOUND###############################################################################resource "aws_security_group_rule""jenkins_worker_windows_to_all_80" { type ="egress" from_port =80 to_port =80 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_windows.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins worker to all 80"}resource "aws_security_group_rule""jenkins_worker_windows_to_all_443" { type ="egress" from_port =443 to_port =443 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_windows.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins worker to all 443"}resource "aws_security_group_rule""jenkins_worker_windows_to_jenkins_server_33453" { type ="egress" from_port =33453 to_port =33453 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_windows.id}" cidr_blocks = ["172.31.0.0/16"] description ="allow jenkins worker windows to jenkins server"}resource "aws_security_group_rule""jenkins_worker_windows_to_jenkins_server_8080" { type ="egress" from_port =8080 to_port =8080 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_windows.id}" source_security_group_id ="${aws_security_group.jenkins_server.id}" description ="allow jenkins workers windows to jenkins server"}resource "aws_security_group_rule""jenkins_worker_windows_to_all_22" { type ="egress" from_port =22 to_port =22 protocol ="tcp" security_group_id ="${aws_security_group.dev_jenkins_worker_windows.id}" cidr_blocks = ["0.0.0.0/0"] description ="allow jenkins worker windows to connect outbound from 22"}
Once security groups are in place we move towards creating the terraform file for windows machine itself. Windows can’t connect to Jenkins master using SSH the method we used while connecting the Linux slave instead we have to use JNLP. A quick recap, when creating Jenkins master we used xmlstarlet to modify the JNLP port and also added rules in sg group to allow connection for JNLP. Also, we have opened the port for RDP so that if any issue occurs you can get in the machine and debug it.
Terraform file:
# Setting Up Windows Slave data "aws_ami""jenkins_worker_windows" { most_recent =true owners = ["self"] filter { name ="name" values = ["windows-slave-for-jenkins*"] }}resource "aws_key_pair""jenkins_worker_windows" { key_name ="jenkins_worker_windows" public_key ="${file("jenkins_worker.pub")}"}data "template_file""userdata_jenkins_worker_windows" { template ="${file("scripts/jenkins_worker_windows.ps1")}" vars { env ="dev" region ="us-east-1" datacenter ="dev-us-east-1" node_name ="us-east-1-jenkins_worker_windows" domain ="" device_name ="eth0" server_ip ="${aws_instance.jenkins_server.private_ip}" worker_pem ="${data.local_file.jenkins_worker_pem.content}" jenkins_username ="admin" jenkins_password ="mysupersecretpassword" }}# lookup the security group of the Jenkins Serverdata "aws_security_group""jenkins_worker_windows" { filter { name ="group-name" values = ["dev_jenkins_worker_windows"] }}resource "aws_launch_configuration""jenkins_worker_windows" { name_prefix ="dev-jenkins-worker-" image_id ="${data.aws_ami.jenkins_worker_windows.image_id}" instance_type ="t3.medium" iam_instance_profile ="dev_jenkins_worker_windows" key_name ="${aws_key_pair.jenkins_worker_windows.key_name}" security_groups = ["${data.aws_security_group.jenkins_worker_windows.id}"] user_data ="${data.template_file.userdata_jenkins_worker_windows.rendered}" associate_public_ip_address =false root_block_device { delete_on_termination =true volume_size =100 } lifecycle { create_before_destroy =true }}resource "aws_autoscaling_group""jenkins_worker_windows" { name ="dev-jenkins-worker-windows" min_size ="1" max_size ="2" desired_capacity ="2" health_check_grace_period =60 health_check_type ="EC2" vpc_zone_identifier = ["${data.aws_subnet_ids.default_public.ids}"] launch_configuration ="${aws_launch_configuration.jenkins_worker_windows.name}" termination_policies = ["OldestLaunchConfiguration"] wait_for_capacity_timeout ="10m" default_cooldown =60 #lifecycle { # create_before_destroy =true #} ## on replacement, gives new service time to spin up before moving on to destroy #provisioner "local-exec" { # command ="sleep 60" #} tags = [ { key ="Name" value ="dev_jenkins_worker_windows" propagate_at_launch =true }, { key ="class" value ="dev_jenkins_worker_windows" propagate_at_launch =true }, ]}
Finally, we reach the user-data for the terraform plan. It will download the required jar file, create a node on Jenkins and register itself as a slave.
<powershell>function Wait-For-Jenkins { Write-Host "Waiting jenkins to launch on 8080..." Do { Write-Host "Waiting for Jenkins" Nc -zv ${server_ip} 8080 If( $? -eq $true ) { Break } Sleep 10 } While (1) Do { Write-Host "Waiting for JNLP" Nc -zv ${server_ip} 33453 If( $? -eq $true ) { Break } Sleep 10 } While (1) Write-Host "Jenkins launched"}function Slave-Setup(){ # Register_slave $JENKINS_URL="http://${server_ip}:8080" $USERNAME="${jenkins_username}" $PASSWORD="${jenkins_password}" $AUTH = -join ("$USERNAME", ":", "$PASSWORD") echo $AUTH # Below IP collection logic works for Windows Server 2016 edition and needs testing for windows server 2008 edition $SLAVE_IP=(ipconfig | findstr /r "[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*\.[0-9][0-9]*" | findstr "IPv4 Address").substring(39) | findstr /B "172.31" $NODE_NAME="jenkins-slave-windows-$SLAVE_IP" $NODE_SLAVE_HOME="C:\Jenkins\" $EXECUTORS=2 $JNLP_PORT=33453 $CRED_ID="$NODE_NAME" $LABELS="build windows" # Creating CMD utility for jenkins-cli commands # This is not working in windows therefore specify full path $jenkins_cmd = "java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth admin:$PASSWORD" Sleep 20 Write-Host "Downloading jenkins-cli.jar file" (New-Object System.Net.WebClient).DownloadFile("$JENKINS_URL/jnlpJars/jenkins-cli.jar", "C:\Jenkins\jenkins-cli.jar") Write-Host "Downloading slave.jar file" (New-Object System.Net.WebClient).DownloadFile("$JENKINS_URL/jnlpJars/slave.jar", "C:\Jenkins\slave.jar") Sleep 10 # Waiting for Jenkins to load all plugins Do { $count=(java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH list-plugins | Measure-Object -line).Lines $ret=$? Write-Host "count [$count] ret [$ret]"If ( $count -gt 0 ) { Break } sleep 30 } While ( 1 ) # For Deleting Node, used when testing Write-Host "Deleting Node $NODE_NAME if present" java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH delete-node $NODE_NAME # Generating node.xml for creating node on Jenkins server $NodeXml = @"<slave><name>$NODE_NAME</name><description>Windows Slave</description><remoteFS>$NODE_SLAVE_HOME</remoteFS><numExecutors>$EXECUTORS</numExecutors><mode>NORMAL</mode><retentionStrategyclass="hudson.slaves.RetentionStrategy`$Always`"/><launcherclass="hudson.slaves.JNLPLauncher"> <workDirSettings> <disabled>false</disabled> <internalDir>remoting</internalDir> <failIfWorkDirIsMissing>false</failIfWorkDirIsMissing> </workDirSettings></launcher><label>$LABELS</label><nodeProperties/></slave>"@ $NodeXml | Out-File -FilePath C:\Jenkins\node.xml type C:\Jenkins\node.xml # Creating node using node.xml Write-Host "Creating $NODE_NAME" Get-Content -Path C:\Jenkins\node.xml | java -jar C:\Jenkins\jenkins-cli.jar -s $JENKINS_URL -auth $AUTH create-node $NODE_NAME Write-Host "Registering Node $NODE_NAME via JNLP" Start-Process java -ArgumentList "-jar C:\Jenkins\slave.jar -jnlpCredentials $AUTH -jnlpUrl $JENKINS_URL/computer/$NODE_NAME/slave-agent.jnlp"}### script begins here ###Wait-For-JenkinsSlave-Setupecho "Done"</powershell><persist>true</persist>
Command to run: Initialize terraform – terraform init, Check and apply – terraform plan -> terraform apply
Same drawbacks are applicable here and the same solutions will work here as well.
Congratulations! You have a Jenkins master with Windows and Linux slave attached to it.
This blog tries to highlight one of the ways in which we can use packer and Terraform to create AMI’s which will serve as Jenkins master and slave. We not only covered their creation but also focused on how to associate security groups and checked some of the basic IAM roles that can be applied. Although we have covered almost all the possible scenarios but still depending on use case, the required changes would be very less and this can serve as a boiler plate code when beginning to plan your infrastructure on cloud.
Web scraping is a process to crawl various websites and extract the required data using spiders. This data is processed in a data pipeline and stored in a structured format. Today, web scraping is widely used and has many use cases:
Using web scraping, Marketing & Sales companies can fetch lead-related information.
Web scraping is useful for Real Estate businesses to get the data of new projects, resale properties, etc.
Price comparison portals, like Trivago, extensively use web scraping to get the information of product and price from various e-commerce sites.
The process of web scraping usually involves spiders, which fetch the HTML documents from relevant websites, extract the needed content based on the business logic, and finally store it in a specific format. This blog is a primer to build highly scalable scrappers. We will cover the following items:
Ways to scrape: We’ll see basic ways to scrape data using techniques and frameworks in Python with some code snippets.
Scraping at scale: Scraping a single page is straightforward, but there are challenges in scraping millions of websites, including managing the spider code, collecting data, and maintaining a data warehouse. We’ll explore such challenges and their solutions to make scraping easy and accurate.
Scraping Guidelines: Scraping data from websites without the owner’s permission can be deemed as malicious. Certain guidelines need to be followed to ensure our scrappers are not blacklisted. We’ll look at some of the best practices one should follow for crawling.
So let’s start scraping.
Different Techniques for Scraping
Here, we will discuss how to scrape a page and the different libraries available in Python.
Note: Python is the most popular language for scraping.
1. Requests –HTTP Library in Python: To scrape the website or a page, first find out the content of the HTML page in an HTTP response object. The requests library from Python is pretty handy and easy to use. It uses urllib inside. I like ‘requests’ as it’s easy and the code becomes readable too.
#Example showing how to use the requests libraryimport requestsr = requests.get("https://velotio.com") #Fetch HTML Page
2. BeautifulSoup: Once you get the webpage, the next step is to extract the data. BeautifulSoup is a powerful Python library that helps you extract the data from the page. It’s easy to use and has a wide range of APIs that’ll help you extract the data. We use the requests library to fetch an HTML page and then use the BeautifulSoup to parse that page. In this example, we can easily fetch the page title and all links on the page. Check out the documentation for all the possible ways in which we can use BeautifulSoup.
from bs4 import BeautifulSoupimport requestsr = requests.get("https://velotio.com") #Fetch HTML Pagesoup = BeautifulSoup(r.text, "html.parser") #Parse HTML Pageprint "Webpage Title:" + soup.title.stringprint "Fetch All Links:" soup.find_all('a')
3.Python Scrapy Framework:
Scrapy is a Python-based web scraping framework that allows you to create different kinds of spiders to fetch the source code of the target website. Scrapy starts crawling the web pages present on a certain website, and then you can write the extraction logic to get the required data. Scrapy is built on the top of Twisted, a Python-based asynchronous library that performs the requests in an async fashion to boost up the spider performance. Scrapy is faster than BeautifulSoup. Moreover, it is a framework to write scrapers as opposed to BeautifulSoup, which is just a library to parse HTML pages.
Here is a simple example of how to use Scrapy. Install Scrapy via pip. Scrapy gives a shell after parsing a website:
$ pip install scrapy #Install Scrapy"$ scrapy shell https://velotio.comIn [1]: response.xpath("//a").extract() #Fetch all a hrefs
Now, let’s write a custom spider to parse a website.
$cat > myspider.py <import scrapyclass BlogSpider(scrapy.Spider):name = 'blogspider'start_urls = ['https://blog.scrapinghub.com']def parse(self, response):for title in response.css('h2.entry-title'):yield {'title': title.css('a ::text').extract_first()}EOFscrapy runspider myspider.py
That’s it. Your first custom spider is created. Now. let’s understand the code.
name: Name of the spider. In this case, it’s “blogspider”.
start_urls: A list of URLs where the spider will begin to crawl from.
parse(self, response): This function is called whenever the crawler successfully crawls a URL. The response object used earlier in the Scrapy shell is the same response object that is passed to the parse(..).
When you run this, Scrapy will look for start URL and will give you all the divs of the h2.entry-title class and extract the associated text from it. Alternatively, you can write your extraction logic in a parse method or create a separate class for extraction and call its object from the parse method.
You’ve seen how to extract simple items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient. Here is a tutorial for Scrapy and the additional documentation for LinkExtractor by which you can instruct Scrapy to extract links from a web page.
4. Python lxml.html library: This is another library from Python just like BeautifulSoup. Scrapy internally uses lxml. It comes with a list of APIs you can use for data extraction. Why will you use this when Scrapy itself can extract the data? Let’s say you want to iterate over the ‘div’ tag and perform some operation on each tag present under “div”, then you can use this library which will give you a list of ‘div’ tags. Now you can simply iterate over them using the iter() function and traverse each child tag inside the parent div tag. Such traversing operations are difficult in scraping. Here is the documentation for this library.
Challenges while Scraping at Scale
Let’s look at the challenges and solutions while scraping at large scale, i.e., scraping 100-200 websites regularly:
1. Data warehousing: Data extraction at a large scale generates vast volumes of information. Fault-tolerant, scalability, security, and high availability are the must-have features for a data warehouse. If your data warehouse is not stable or accessible then operations, like search and filter over data would be an overhead. To achieve this, instead of maintaining own database or infrastructure, you can use Amazon Web Services (AWS). You can use RDS (Relational Database Service) for a structured database and DynamoDB for the non-relational database. AWS takes care of the backup of data. It automatically takes a snapshot of the database. It gives you database error logs as well. This blog explains how to set up infrastructure in the cloud for scraping.
2. Pattern Changes: Scraping heavily relies on user interface and its structure, i.e., CSS and Xpath. Now, if the target website gets some adjustments then our scraper may crash completely or it can give random data that we don’t want. This is a common scenario and that’s why it’s more difficult to maintain scrapers than writing it. To handle this case, we can write the test cases for the extraction logic and run them daily, either manually or from CI tools, like Jenkins to track if the target website has changed or not.
3. Anti-scraping Technologies: Web scraping is a common thing these days, and every website host would want to prevent their data from being scraped. Anti-scraping technologies would help them in this. For example, if you are hitting a particular website from the same IP address on a regular interval then the target website can block your IP. Adding a captcha on a website also helps. There are methods by which we can bypass these anti-scraping methods. For e.g., we can use proxy servers to hide our original IP. There are several proxy services that keep on rotating the IP before each request. Also, it is easy to add support for proxy servers in the code, and in Python, the Scrapy framework does support it.
4. JavaScript-based dynamic content: Websites that heavily rely on JavaScript and Ajax to render dynamic content, makes data extraction difficult. Now, Scrapy and related frameworks/libraries will only work or extract what it finds in the HTML document. Ajax calls or JavaScript are executed at runtime so it can’t scrape that. This can be handled by rendering the web page in a headless browser such as Headless Chrome, which essentially allows running Chrome in a server environment. You can also use PhantomJS, which provides a headless Webkit-based environment.
5. Honeypot traps: Some websites have honeypot traps on the webpages for the detection of web crawlers. They are hard to detect as most of the links are blended with background color or the display property of CSS is set to none. To achieve this requires large coding efforts on both the server and the crawler side, hence this method is not frequently used.
6. Quality of data: Currently, AI and ML projects are in high demand and these projects need data at large scale. Data integrity is also important as one fault can cause serious problems in AI/ML algorithms. So, in scraping, it is very important to not just scrape the data, but verify its integrity as well. Now doing this in real-time is not possible always, so I would prefer to write test cases of the extraction logic to make sure whatever your spiders are extracting is correct and they are not scraping any bad data
7. More Data, More Time: This one is obvious. The larger a website is, the more data it contains, the longer it takes to scrape that site. This may be fine if your purpose for scanning the site isn’t time-sensitive, but that isn’t often the case. Stock prices don’t stay the same over hours. Sales listings, currency exchange rates, media trends, and market prices are just a few examples of time-sensitive data. What to do in this case then? Well, one solution could be to design your spiders carefully. If you’re using Scrapy like framework then apply proper LinkExtractor rules so that spider will not waste time on scraping unrelated URLs.
You may use multithreading scraping packages available in Python, such as Frontera and Scrapy Redis. Frontera lets you send out only one request per domain at a time, but can hit multiple domains at once, making it great for parallel scraping. Scrapy Redis lets you send out multiple requests to one domain. The right combination of these can result in a very powerful web spider that can handle both the bulk and variation for large websites.
8. Captchas: Captchas is a good way of keeping crawlers away from a website and it is used by many website hosts. So, in order to scrape the data from such websites, we need a mechanism to solve the captchas. There are packages, software that can solve the captcha and can act as a middleware between the target website and your spider. Also, you may use libraries like Pillow and Tesseract in Python to solve the simple image-based captchas.
9. Maintaining Deployment: Normally, we don’t want to limit ourselves to scrape just a few websites. We need the maximum amount of data that are present on the Internet and that may introduce scraping of millions of websites. Now, you can imagine the size of the code and the deployment. We can’t run spiders at this scale from a single machine. What I prefer here is to dockerize the scrapers and take advantage of the latest technologies, like AWS ECS, Kubernetes to run our scraper containers. This helps us keeping our scrapers in high availability state and it’s easy to maintain. Also, we can schedule the scrapers to run at regular intervals.
Scraping Guidelines/ Best Practices
1. Respect the robots.txt file: Robots.txt is a text file that webmasters create to instruct search engine robots on how to crawl and index pages on the website. This file generally contains instructions for crawlers. Now, before even planning the extraction logic, you should first check this file. Usually, you can find this at the website admin section. This file has all the rules set on how crawlers should interact with the website. For e.g., if a website has a link to download critical information then they probably don’t want to expose that to crawlers. Another important factor is the frequency interval for crawling, which means that crawlers can only hit the website at specified intervals. If someone has asked not to crawl their website then we better not do it. Because if they catch your crawlers, it can lead to some serious legal issues.
2. Do not hit the servers too frequently: As I mentioned above, some websites will have the frequency interval specified for crawlers. We better use it wisely because not every website is tested against the high load. If you are hitting at a constant interval then it creates huge traffic on the server-side, and it may crash or fail to serve other requests. This creates a high impact on user experience as they are more important than the bots. So, we should make the requests according to the specified interval in robots.txt or use a standard delay of 10 seconds. This also helps you not to get blocked by the target website.
3. User Agent Rotation and Spoofing: Every request consists of a User-Agent string in the header. This string helps to identify the browser you are using, its version, and the platform. If we use the same User-Agent in every request then it’s easy for the target website to check that request is coming from a crawler. So, to make sure we do not face this, try to rotate the User and the Agent between the requests. You can get examples of genuine User-Agent strings on the Internet very easily, try them out. If you’re using Scrapy, you can set USER_AGENT property in settings.py.
4. Disguise your requests by rotating IPs and Proxy Services: We’ve discussed this in the challenges above. It’s always better to use rotating IPs and proxy service so that your spider won’t get blocked.
5. Do not follow the same crawling pattern: Now, as you know many websites use anti-scraping technologies, so it’s easy for them to detect your spider if it’s crawling in the same pattern. Normally, we, as a human, would not follow a pattern on a particular website. So, to have your spiders run smoothly, we can introduce actions like mouse movements, clicking a random link, etc, which gives the impression of your spider as a human.
6. Scrape during off-peak hours: Off-peak hours are suitable for bots/crawlers as the traffic on the website is considerably less. These hours can be identified by the geolocation from where the site’s traffic originates. This also helps to improve the crawling rate and avoid the extra load from spider requests. Thus, it is advisable to schedule the crawlers to run in the off-peak hours.
7. Use the scraped data responsibly: We should always take the responsibility of the scraped data. It is not acceptable if someone is scraping the data and then republish it somewhere else. This can be considered as breaking the copyright laws and may lead to legal issues. So, it is advisable to check the target website’s Terms of Service page before scraping.
8. Use Canonical URLs: When we scrape, we tend to scrape duplicate URLs, and hence the duplicate data, which is the last thing we want to do. It may happen in a single website where we get multiple URLs having the same data. In this situation, duplicate URLs will have a canonical URL, which points to the parent or the original URL. By this, we make sure, we don’t scrape duplicate contents. In frameworks like Scrapy, duplicate URLs are handled by default.
9. Be transparent: Don’t misrepresent your purpose or use deceptive methods to gain access. If you have a login and a password that identifies you to gain access to a source, use it. Don’t hide who you are. If possible, share your credentials.
Conclusion
We’ve seen the basics of scraping, frameworks, how to crawl, and the best practices of scraping. To conclude:
Follow target URLs rules while scraping. Don’t make them block your spider.
Maintenance of data and spiders at scale is difficult. Use Docker/ Kubernetes and public cloud providers, like AWS to easily scale your web-scraping backend.
Always respect the rules of the websites you plan to crawl. If APIs are available, always use them first.
With the evolving architectural design of web applications, microservices have been a successful new trend in architecting the application landscape. Along with the advancements in application architecture, transport method protocols, such as REST and gRPC are getting better in efficiency and speed. Also, containerizing microservice applications help greatly in agile development and high-speed delivery.
In this blog, I will try to showcase how simple it is to build a cloud-native application on the microservices architecture using Go.
We will break the solution into multiple steps. We will learn how to:
1) Build a microservice and set of other containerized services which will have a very specific set of independent tasks and will be related only with the specific logical component.
2) Use go-kit as the framework for developing and structuring the components of each service.
3) Build APIs that will use HTTP (REST) and Protobuf (gRPC) as the transport mechanisms, PostgreSQL for databases and finally deploy it on Azure stack for API management and CI/CD.
Note: Deployment, setting up the CI-CD and API-Management on Azure or any other cloud is not in the scope of the current blog.
Prerequisites:
A beginner’s level of understanding of web services, Rest APIs and gRPC
GoLand/ VS Code
Properly installed and configured Go. If not, check it out here
Set up a new project directory under the GOPATH
Understanding of the standard Golang project. For reference, visit here
PostgreSQL client installed
Go kit
What are we going to do?
We will develop a simple web application working on the following problem statement:
A global publishing company that publishes books and journals wants to develop a service to watermark their documents. A document (books, journals) has a title, author and a watermark property
The watermark operation can be in Started, InProgress and Finished status
The specific set of users should be able to do the watermark on a document
Once the watermark is done, the document can never be re-marked
Example of a document:
{content: “book”, title: “The Dark Code”, author: “Bruce Wayne”, topic: “Science”}
For a detailed understanding of the requirement, please refer to this.
Architecture:
In this project, we will have 3 microservices: Authentication Service, Database Service and the Watermark Service. We have a PostgreSQL database server and an API-Gateway.
Authentication Service:
The application is supposed to have a role-based and user-based access control mechanism. This service will authenticate the user according to its specific role and return HTTP status codes only. 200 when the user is authorized and 401 for unauthorized users.
APIs:
/user/access, Method: GET, Secured: True, payload: user: <name></name> It will take the user name as an input and the auth service will return the roles and the privileges assigned to it
/authenticate, Method: GET, Secured: True, payload: user: <name>, operation: <op></op></name> It will authenticate the user with the passed operation if it is accessible for the role
/healthz, Method: GET, Secured: True It will return the status of the service
Database Service:
We will need databases for our application to store the user, their roles and the access privileges to that role. Also, the documents will be stored in the database without the watermark. It is a requirement that any document cannot have a watermark at the time of creation. A document is said to be created successfully only when the data inputs are valid and the database service returns the success status.
We will be using two databases for two different services for them to be consumed. This design is not necessary, but just to follow the “Single Database per Service” rule under the microservice architecture.
APIs:
/get, Method: GET, Secured: True, payload: filters: []filter{“field-name”: “value”} It will return the list of documents according to the specific filters passed
/update, Method: POST, Secured: True, payload: “Title”: <id>, document: {“field”: “value”, …}</id> It will update the document for the given title id
/add, Method: POST, Secured: True, payload: document: {“field”: “value”, …} It will add the document and return the title-ID
/remove Method: POST, Secured: True, payload: title: <id></id> It will remove the document entry according to the passed title-id
/healthz, Method: GET, Secured: True It will return the status of the service
Watermark Service:
This is the main service that will perform the API calls to watermark the passed document. Every time a user needs to watermark a document, it needs to pass the TicketID in the watermark API request along with the appropriate Mark. It will try to call the database Update API internally with the provided request and returns the status of the watermark process which will be initially “Started”, then in some time the status will be “InProgress” and if the call was valid, the status will be “Finished”, or “Error”, if the request is not valid.
APIs:
/get, Method: GET, Secured: True, payload: filters: []filter{“field-name”: “value”} It will return the list of documents according to the specific filters passed
/status, Method: GET, Secured: True, payload: “Ticket”: <id></id> It will return the status of the document for watermark operation for the passed ticket-id
/addDocument, Method: POST, Secured: True, payload: document: {“field”: “value”, …} It will add the document and return the title-ID
/watermark, Method: POST, Secured: True, payload: title: <id>, mark: “string”</id> It is the main watermark operation API which will accept the mark string
/healthz, Method: GET, Secured: True It will return the status of the service
Operations and Flow:
Watermark Service APIs are the only ones that will be used by the user/actor to request watermark or add the document. Authentication and Database service APIs are the private ones that will be called by other services internally. The only URL accessible to the user is the API Gateway URL.
The user will access the API Gateway URL with the required user name, the ticket-id and the mark with which the user wants the document to apply watermark
The user should not know about the authentication or database services
Once the request is made by the user, it will be accepted by the API Gateway. The gateway will validate the request along with the payload
An API forwarding rule of configuring the traffic of a specific request to a service should be defined in the gateway. The request when validated, will be forwarded to the service according to that rule.
We will define an API forwarding rule where the request made for any watermark will be first forwarded to the authentication service which will authenticate the request, check for authorized users and return the appropriate status code.
The authorization service will check for the user from which the request has been made, into the user database and its roles and permissions. It will send the response accordingly
Once the request has been authorized by the service, it will be forwarded back to the actual watermark service
The watermark service then performs the appropriate operation of putting the watermark on the document or add a new entry of the document or any other request
The operation from the watermark service of Get, Watermark or AddDocument will be performed by calling the database CRUD APIs and forwarded to the user
If the request is to AddDocument then the service should return the “TicketID” or if it is for watermark then it should return the status of the operation
Note:
Each user will have some specific roles, based on which the access controls will be identified for the user. For the sake of simplicity, the roles will be based on the type of document only, not the specific name of the book or journal
Getting Started:
Let’s start by creating a folder for our application in the $GOPATH. This will be the root folder containing our set of services.
Project Layout:
The project will follow the standard Golang project layout. If you want the full working code, please refer here
api: Stores the versions of the APIs swagger files and also the proto and pb files for the gRPC protobuf interface.
cmd: This will contain the entry point (main.go) files for all the services and also any other container images if any
docs: This will contain the documentation for the project
config: All the sample files or any specific configuration files should be stored here
deploy: This directory will contain the deployment files used to deploy the application
internal: This package is the conventional internal package identified by the Go compiler. It contains all the packages which need to be private and imported by its child directories and immediate parent directory. All the packages from this directory are common across the project
pkg: This directory will have the complete executing code of all the services in separate packages.
tests: It will have all the integration and E2E tests
vendor: This directory stores all the third-party dependencies locally so that the version doesn’t mismatch later
We are going to use the Go kit framework for developing the set of services. The official Go kit examples of services are very good, though the documentation is not that great.
Watermark Service:
1. Under the Go kit framework, a service should always be represented by an interface.
Create a package named watermark in the pkg folder. Create a new service.go file in that package. This file is the blueprint of our service.
package watermarkimport ("context""github.com/velotiotech/watermark-service/internal")typeService interface {// Get the list of all documents Get(ctx context.Context, filters ...internal.Filter) ([]internal.Document, error) Status(ctx context.Context, ticketID string) (internal.Status, error) Watermark(ctx context.Context, ticketID, mark string) (int, error) AddDocument(ctx context.Context, doc *internal.Document) (string, error) ServiceStatus(ctx context.Context) (int, error)}
2. As per the functions defined in the interface, we will need five endpoints to handle the requests for the above methods. If you are wondering why we are using a context package, please refer here. Contexts enable the microservices to handle the multiple concurrent requests, but maybe in this blog, we are not using it too much. It’s just the best way to work with it.
3. Implementing our service:
package watermarkimport ("context""net/http""os""github.com/velotiotech/watermark-service/internal""github.com/go-kit/kit/log""github.com/lithammer/shortuuid/v3")typewatermarkService struct{}func NewService() Service { return&watermarkService{} }func (w *watermarkService) Get(_ context.Context, filters ...internal.Filter) ([]internal.Document, error) {// query the database using the filters and return the list of documents// return error if the filter (key) is invalid and also return error if no item founddoc := internal.Document{Content: "book",Title: "Harry Potter and Half Blood Prince",Author: "J.K. Rowling",Topic: "Fiction and Magic", }return []internal.Document{doc}, nil}func (w *watermarkService) Status(_ context.Context, ticketID string) (internal.Status, error) {// query database using the ticketID and return the document info// return err if the ticketID is invalid or no Document exists for that ticketIDreturn internal.InProgress, nil}func (w *watermarkService) Watermark(_ context.Context, ticketID, mark string) (int, error) {// update the database entry with watermark field as non empty// first check if the watermark status is not already in InProgress, Started or Finished state// If yes, then return invalid request// return error if no item found using the ticketIDreturn http.StatusOK, nil}func (w *watermarkService) AddDocument(_ context.Context, doc *internal.Document) (string, error) {// add the document entry in the database by calling the database service// return error if the doc is invalid and/or the database invalid entry errornewTicketID := shortuuid.New()return newTicketID, nil}func (w *watermarkService) ServiceStatus(_ context.Context) (int, error) { logger.Log("Checking the Service health...")return http.StatusOK, nil}var logger log.Loggerfunc init() { logger = log.NewLogfmtLogger(log.NewSyncWriter(os.Stderr)) logger = log.With(logger, "ts", log.DefaultTimestampUTC)}
We have defined the new type watermarkService empty struct which will implement the above-defined service interface. This struct implementation will be hidden from the rest of the world.
NewService() is created as the constructor of our “object”. This is the only function available outside this package to instantiate the service.
4. Now we will create the endpoints package which will contain two files. One is where we will store all types of requests and responses. The other file will be endpoints which will have the actual implementation of the requests parsing and calling the appropriate service function.
– Create a file named reqJSONMap.go. We will define all the requests and responses struct with the fields in this file such as GetRequest, GetResponse, StatusRequest, StatusResponse, etc. Add the necessary fields in these structs which we want to have input in a request or we want to pass the output in the response.
In this file, we have a struct Set which is the collection of all the endpoints. We have a constructor for the same. We have the internal constructor functions which will return the objects which implement the generic endpoint. Endpoint interface of Go kit such as MakeGetEndpoint(), MakeStatusEndpoint() etc.
In order to expose the Get, Status, Watermark, ServiceStatus and AddDocument APIs, we need to create endpoints for all of them. These functions handle the incoming requests and call the specific service methods
5. Adding the Transports method to expose the services. Our services will support HTTP and will be exposed using Rest APIs and protobuf and gRPC.
Create a separate package of transport in the watermark directory. This package will hold all the handlers, decoders and encoders for a specific type of transport mechanism
6. Create a file http.go: This file will have the transport functions and handlers for HTTP with a separate path as the API routes.
This file is the map of the JSON payload to their requests and responses. It contains the HTTP handler constructor which registers the API routes to the specific handler function (endpoints) and also the decoder-encoder of the requests and responses respectively into a server object for a request. The decoders and encoders are basically defined just to translate the request and responses in the desired form to be processed. In our case, we are just converting the requests/responses using the json encoder and decoder into the appropriate request and response structs.
We have the generic encoder for the response output, which is a simple JSON encoder.
7. Create another file in the same transport package with the name grpc.go. Similar to above, the name of the file is self-explanatory. It is the map of protobuf payload to their requests and responses. We create a gRPC handler constructor which will create the set of grpcServers and registers the appropriate endpoint to the decoders and encoders of the request and responses
– Before moving on to the implementation, we have to create a proto file that acts as the definition of all our service interface and the requests response structs, so that the protobuf files (.pb) can be generated to be used as an interface between services to communicate.
– Create package pb in the api/v1 package path. Create a new file watermarksvc.proto. Firstly, we will create our service interface, which represents the remote functions to be called by the client. Refer to this for syntax and deep understanding of the protobuf.
We will convert the service interface to the service interface in the proto file. Also, we have created the request and response structs exactly the same once again in the proto file so that they can be understood by the RPC defined in the service.
Note: Creating the proto files and generating the pb files using protoc is not the scope of this blog. We have assumed that you already know how to create a proto file and generate a pb file from it. If not, please refer protobuf and protoc gen
I have also created a script to generate the pb file, which just needs the path with the name of the proto file.
#!/usr/bin/env sh# Install proto3 from source# brew install autoconf automake libtool# git clone https://github.com/google/protobuf# ./autogen.sh ; ./configure ; make ; make install## Update protoc Go bindings via# go get -u github.com/golang/protobuf/{proto,protoc-gen-go}## See also# https://github.com/grpc/grpc-go/tree/master/examplesREPO_ROOT="${REPO_ROOT:-$(cd "$(dirname "$0")/../.." && pwd)}"PB_PATH="${REPO_ROOT}/api/v1/pb"PROTO_FILE=${1:-"watermarksvc.proto"}echo "Generating pb files for ${PROTO_FILE} service"protoc -I="${PB_PATH}""${PB_PATH}/${PROTO_FILE}"--go_out=plugins=grpc:"${PB_PATH}"
8. Now, once the pb file is generated in api/v1/pb/watermark package, we will create a new struct grpcserver, grouping all the endpoints for gRPC. This struct should implement pb.WatermarkServer which is the server interface referred by the services.
To implement these services, we are defining the functions such as func (g *grpcServer) Get(ctx context.Context, r *pb.GetRequest) (*pb.GetReply, error). This function should take the request param and run the ServeGRPC() function and then return the response. Similarly, we should implement the ServeGRPC() functions for the rest of the functions.
These functions are the actual Remote Procedures to be called by the service.
We will also need to add the decode and encode functions for the request and response structs from protobuf structs. These functions will map the proto Request/Response struct to the endpoint req/resp structs. For example: func decodeGRPCGetRequest(_ context.Context, grpcReq interface{}) (interface{}, error). This will assert the grpcReq to pb.GetRequest and use its fields to fill the new struct of type endpoints.GetRequest{}. The decoding and encoding functions should be implemented similarly for the other requests and responses.
9. Finally, we just have to create the entry point files (main) in the cmd for each service. As we already have mapped the appropriate routes to the endpoints by calling the service functions and also we mapped the proto service server to the endpoints by calling ServeGRPC() functions, now we have to call the HTTP and gRPC server constructors here and start them.
Create a package watermark in the cmd directory and create a file watermark.go which will hold the code to start and stop the HTTP and gRPC server for the service
package mainimport ("fmt""net""net/http""os""os/signal""syscall" pb "github.com/velotiotech/watermark-service/api/v1/pb/watermark""github.com/velotiotech/watermark-service/pkg/watermark""github.com/velotiotech/watermark-service/pkg/watermark/endpoints""github.com/velotiotech/watermark-service/pkg/watermark/transport""github.com/go-kit/kit/log" kitgrpc "github.com/go-kit/kit/transport/grpc""github.com/oklog/oklog/pkg/group""google.golang.org/grpc")const ( defaultHTTPPort ="8081" defaultGRPCPort ="8082")func main() {var ( logger log.Logger httpAddr = net.JoinHostPort("localhost", envString("HTTP_PORT", defaultHTTPPort)) grpcAddr = net.JoinHostPort("localhost", envString("GRPC_PORT", defaultGRPCPort)) ) logger = log.NewLogfmtLogger(log.NewSyncWriter(os.Stderr)) logger = log.With(logger, "ts", log.DefaultTimestampUTC)var ( service = watermark.NewService() eps = endpoints.NewEndpointSet(service) httpHandler = transport.NewHTTPHandler(eps) grpcServer = transport.NewGRPCServer(eps) )var g group.Group {// The HTTP listener mounts the Go kit HTTP handler we created. httpListener, err := net.Listen("tcp", httpAddr)if err != nil { logger.Log("transport", "HTTP", "during", "Listen", "err", err) os.Exit(1) } g.Add(func() error { logger.Log("transport", "HTTP", "addr", httpAddr) return http.Serve(httpListener, httpHandler) }, func(error) { httpListener.Close() }) } {// The gRPC listener mounts the Go kit gRPC server we created. grpcListener, err := net.Listen("tcp", grpcAddr)if err != nil { logger.Log("transport", "gRPC", "during", "Listen", "err", err) os.Exit(1) } g.Add(func() error { logger.Log("transport", "gRPC", "addr", grpcAddr)// we add the Go Kit gRPC Interceptor to our gRPC service as it is used by// the here demonstrated zipkin tracing middleware. baseServer := grpc.NewServer(grpc.UnaryInterceptor(kitgrpc.Interceptor)) pb.RegisterWatermarkServer(baseServer, grpcServer) return baseServer.Serve(grpcListener) }, func(error) { grpcListener.Close() }) } {// This function just sits and waits for ctrl-C.cancelInterrupt :=make(chan struct{}) g.Add(func() error { c :=make(chan os.Signal, 1) signal.Notify(c, syscall.SIGINT, syscall.SIGTERM) select { case sig :=<-c: return fmt.Errorf("received signal %s", sig) case <-cancelInterrupt: return nil } }, func(error) {close(cancelInterrupt) }) } logger.Log("exit", g.Run())}func envString(env, fallback string) string {e := os.Getenv(env)if e =="" {return fallback }return e}
Let’s walk you through the above code. Firstly, we will use the fixed ports to make the server listen to them. 8081 for HTTP Server and 8082 for gRPC Server. Then in these code stubs, we will create the HTTP and gRPC servers, endpoints of the service backend and the service.
service = watermark.NewService()eps = endpoints.NewEndpointSet(service)grpcServer = transport.NewGRPCServer(eps)httpHandler = transport.NewHTTPHandler(eps)
Now the next step is interesting. We are creating a variable of oklog.Group. If you are new to this term, please refer here. Group helps you elegantly manage the group of Goroutines. We are creating three Goroutines: One for HTTP server, second for gRPC server and the last one for watching on the cancel interrupts. Just like this:
Similarly, we will start a gRPC server and a cancel interrupt watcher. Great!! We are done here. Now, let’s run the service.
go run ./cmd/watermark/watermark.go
The server has started locally. Now, just open a Postman or run curl to one of the endpoints. See below: We ran the HTTP server to check the service status:
We have successfully created a service and ran the endpoints.
Further:
I really like to make a project complete always with all the other maintenance parts revolving around. Just like adding the proper README, have proper .gitignore, .dockerignore, Makefile, Dockerfiles, golang-ci-lint config files, and CI-CD config files etc.
I have created a separate Dockerfile for each of the three services in path /images/.
I have created a multi-staged dockerfile to create the binary of the service and run it. We will just copy the appropriate directories of code in the docker image, build the image all in one and then create a new image in the same file and copy the binary in it from the previous one. Similarly, the dockerfiles are created for other services also.
In the dockerfile, we have given the CMD as go run watermark. This command will be the entry point of the container. I have also created a Makefile which has two main targets: build-image and build-push. The first one is to build the image and the second is to push it.
Note: I am keeping this blog concise as it is difficult to cover all the things. The code in the repo that I have shared in the beginning covers most of the important concepts around services. I am still working and continue committing improvements and features.
Let’s see how we can deploy:
We will see how to deploy all these services in the containerized orchestration tools (ex: Kubernetes). Assuming you have worked on Kubernetes with at least a beginner’s understanding before.
In deploy dir, create a sample deployment having three containers: auth, watermark and database. Since for each container, the entry point commands are already defined in the dockerfiles, we don’t need to send any args or cmd in the deployment.
We will also need the service which will be used to route the external traffic of request from another load balancer service or nodeport type service. To make it work, we might have to create a nodeport type of service to expose the watermark-service to make it running for now.
Another important and very interesting part is to deploy the API Gateway. It is required to have at least some knowledge of any cloud provider stack to deploy the API Gateway. I have used Azure stack to deploy an API Gateway using the resource called as “API-Management” in the Azure plane. Refer the rules config files for the Azure APIM api-gateway:
Further, only a proper CI/CD setup is remaining which is one of the most essential parts of a project after development. I would definitely like to discuss all the above deployment-related stuff in more detail but that is not in the scope of my current blog. Maybe I will post another blog for the same.
Wrapping up:
We have learned how to build a complete project with three microservices in Golang using one of the best-distributed system development frameworks: Go kit. We have also used the database PostgreSQL using the GORM used heavily in the Go community. We did not stop just at the development but also we tried to theoretically cover the development lifecycle of the project by understanding what, how and where to deploy.
We created one microservice completely from scratch. Go kit makes it very simple to write the relationship between endpoints, service implementations and the communication/transport mechanisms. Now, go and try to create other services from the problem statement.
A serverless architecture is a way to implement and run applications and services or micro-services without need to manage infrastructure. Your application still runs on servers, but all the servers management is done by AWS. Now we don’t need to provision, scale or maintain servers to run our applications, databases and storage systems. Services which are developed by developers who don’t let developers build application from scratch.
Why Serverless
More focus on development rather than managing servers.
Cost Effective.
Application which scales automatically.
Quick application setup.
Services For ServerLess
For implementing serverless architecture there are multiple services which are provided by cloud partners though we will be exploring most of the services from AWS. Following are the services which we can use depending on the application requirement.
Lambda: It is used to write business logic / schedulers / functions.
S3: It is mostly used for storing objects but it also gives the privilege to host WebApps. You can host a static website on S3.
API Gateway: It is used for creating, publishing, maintaining, monitoring and securing REST and WebSocket APIs at any scale.
Cognito: It provides authentication, authorization & user management for your web and mobile apps. Your users can sign in directly sign in with a username and password or through third parties such as Facebook, Amazon or Google.
DynamoDB: It is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.
Three-tier Serverless Architecture
So, let’s take a use case in which you want to develop a three tier serverless application. The three tier architecture is a popular pattern for user facing applications, The tiers that comprise the architecture include the presentation tier, the logic tier and the data tier. The presentation tier represents the component that users directly interact with web page / mobile app UI. The logic tier contains the code required to translate user action at the presentation tier to the functionality that drives the application’s behaviour. The data tier consists of your storage media (databases, file systems, object stores) that holds the data relevant to the application. Figure shows the simple three-tier application.
Figure: Simple Three-Tier Architectural Pattern
Presentation Tier
The presentation tier of the three tier represents the View part of the application. Here you can use S3 to host static website. On a static website, individual web pages include static content and they also contain client side scripting.
The following is a quick procedure to configure an Amazon S3 bucket for static website hosting in the S3 console.
To configure an S3 bucket for static website hosting
1. Log in to the AWS Management Console and open the S3 console at
2. In the Bucket name list, choose the name of the bucket that you want to enable static website hosting for.
3. Choose Properties.
4. Choose Static Website Hosting
Once you enable your bucket for static website hosting, browsers can access all of your content through the Amazon S3 website endpoint for your bucket.
5. Choose Use this bucket to host.
A. For Index Document, type the name of your index document, which is typically named index.html. When you configure a S3 bucket for website hosting, you must specify an index document, which will be returned by S3 when requests are made to the root domain or any of the subfolders.
B. (Optional) For 4XX errors, you can optionally provide your own custom error document that provides additional guidance for your users. Type the name of the file that contains the custom error document. If an error occurs, S3 returns an error document.
C. (Optional) If you want to give advanced redirection rules, In the edit redirection rule text box, you have to XML to describe the rule. E.g.
7. Add a bucket policy to the website bucket that grants access to the object in the S3 bucket for everyone. You must make the objects that you want to serve publicly readable, when you configure a S3 bucket as a website. To do so, you write a bucket policy that grants everyone S3:GetObject permission. The following bucket policy grants everyone access to the objects in the example-bucket bucket.
Note: If you choose Disable Website Hosting, S3 removes the website configuration from the bucket, so that the bucket no longer accessible from the website endpoint, but the bucket is still available at the REST endpoint.
Logic Tier
The logic tier represents the brains of the application. Here the two core services for serverless will be used i.e. API Gateway and Lambda to form your logic tier can be so revolutionary. The feature of the 2 services allow you to build a serverless production application which is highly scalable, available and secure. Your application could use number of servers, however by leveraging this pattern you do not have to manage a single one. In addition, by using these managed services together you get following benefits:
No operating system to choose, secure or manage.
No servers to right size, monitor.
No risk to your cost by over-provisioning.
No Risk to your performance by under-provisioning.
API Gateway
API Gateway is a fully managed service for defining, deploying and maintaining APIs. Anyone can integrate with the APIs using standard HTTPS requests. However, it has specific features and qualities that result it being an edge for your logic tier.
Integration with Lambda
API Gateway gives your application a simple way to leverage the innovation of AWS lambda directly (HTTPS Requests). API Gateway forms the bridge that connects your presentation tier and the functions you write in Lambda. After defining the client / server relationship using your API, the contents of the client’s HTTPS requests are passed to Lambda function for execution. The content include request metadata, request headers and the request body.
API Performance Across the Globe
Each deployment of API Gateway includes an Amazon CloudFront distribution under the covers. Amazon CloudFront is a content delivery web service that used Amazon’s global network of edge locations as connection points for clients integrating with API. This helps drive down the total response time latency of your API. Through its use of multiple edge locations across the world, Amazon CloudFront also provides you capabilities to combat distributed denial of service (DDoS) attack scenarios.
You can improve the performance of specific API requests by using API Gateway to store responses in an optional in-memory cache. This not only provides performance benefits for repeated API requests, but is also reduces backend executions, which can reduce overall cost.
Let’s dive into each step
1. Create Lambda Function Login to Aws Console and head over to Lambda Service and Click on “Create A Function”
A. Choose first option “Author from scratch” B. Enter Function Name C. Select Runtime e.g. Python 2.7 D. Click on “Create Function”
As your function is ready, you can see your basic function will get generated in language you choose to write. E.g.
import jsondef lambda_handler(event, context): # TODO implement return { 'statusCode': 200, 'body': json.dumps('Hello from Lambda!') }
2. Testing Lambda Function
Click on “Test” button at the top right corner where we need to configure test event. As we are not sending any events, just give event a name, for example, “Hello World” template as it is and “Create” it.
Now, when you hit the “Test” button again, it runs through testing the function we created earlier and returns the configured value.
Create & Configure API Gateway connecting to Lambda
We are done with creating lambda functions but how to invoke function from outside world ? We need endpoint, right ?
Go to API Gateway & click on “Get Started” and agree on creating an Example API but we will not use that API we will create “New API”. Give it a name by keeping “Endpoint Type” regional for now.
Create the API and you will go on the page “resources” page of the created API Gateway. Go through the following steps:
A. Click on the “Actions”, then click on “Create Method”. Select Get method for our function. Then, “Tick Mark” on the right side of “GET” to set it up. B. Choose “Lambda Function” as integration type. C. Choose the region where we created earlier. D. Write the name of Lambda Function we created E. Save the method where it will ask you for confirmation of “Add Permission to Lambda Function”. Agree to that & that is done. F. Now, we can test our setup. Click on “Test” to run API. It should give the response text we had on the lambda test screen.
Now, to get endpoint. We need to deploy the API. On the Actions dropdown, click on Deploy API under API Actions. Fill in the details of deployment and hit Deploy.
After that, we will get our HTTPS endpoint.
On the above screen you can see the things like cache settings, throttling, logging which can be configured. Save the changes and browse the invoke URL from which we will get the response which was earlier getting from Lambda. So, here is our logic tier of serverless application is to be done.
Data Tier
By using Lambda as your logic tier, you have a number of data storage options for your data tier. These options fall into broad categories: Amazon VPC hosted data stores and IAM-enabled data stores. Lambda has the ability to integrate with both securely.
Amazon VPC Hosted Data Stores
Amazon RDS
Amazon ElasticCache
Amazon Redshift
IAM-Enabled Data Stores
Amazon DynamoDB
Amazon S3
Amazon ElasticSearch Service
You can use any of those for storage purpose, But DynamoDB is one of best suited for ServerLess application.
Why DynamoDB ?
It is NoSQL DB, also that is fully managed by AWS.
It provides fast & prectable performance with seamless scalability.
DynamoDB lets you offload the administrative burden of operating and scaling a distributed system.
It offers encryption at rest, which eliminates the operational burden and complexity involved in protecting sensitive data.
You can scale up/down your tables throughput capacity without downtime/performance degradation.
It provides On-Demand backups as well as enable point in time recovery for your DynamoDB tables.
DynamoDB allows you to delete expired items from table automatically to help you reduce storage usage and the cost of storing data that is no longer relevant.
Following is the sample script for DynamoDB with Python which you can use with lambda.
from __future__ import print_function # Python 2/3 compatibilityimport boto3import jsonimport decimalfrom boto3.dynamodb.conditions import Key, Attrfrom botocore.exceptions import ClientError# Helper class to convert a DynamoDB item to JSON.class DecimalEncoder(json.JSONEncoder): def default(self, o): if isinstance(o, decimal.Decimal): if o % 1 > 0: return float(o) else: return int(o) return super(DecimalEncoder, self).default(o)dynamodb = boto3.resource("dynamodb", region_name='us-west-2', endpoint_url="http://localhost:8000")table = dynamodb.Table('Movies')title = "The Big New Movie"year = 2015try: response = table.get_item( Key={ 'year': year, 'title': title } )except ClientError as e: print(e.response['Error']['Message'])else: item = response['Item'] print("GetItem succeeded:") print(json.dumps(item, indent=4, cls=DecimalEncoder))
Note: To run the above script successfully you need to attach policy to your role for lambda. So in this case you need to attach policy for DynamoDB operations to take place & for CloudWatch if required to store your logs. Following is the policy which you can attach to your role for DB executions.
You can implement the following popular architecture patterns using API Gateway & Lambda as your logic tier, Amazon S3 for presentation tier, DynamoDB as your data tier. For each example, we will only use AWS Service that do not require users to manage their own infrastructure.
Mobile Backend
1. Presentation Tier: A mobile application running on each user’s smartphone.
2. Logic Tier: API Gateway & Lambda. The logic tier is globally distributed by the Amazon CloudFront distribution created as part of each API Gateway each API. A set of lambda functions can be specific to user / device identity management and authentication & managed by Amazon Cognito, which provides integration with IAM for temporary user access credentials as well as with popular third party identity providers. Other Lambda functions can define core business logic for your Mobile Back End.
3. Data Tier: The various data storage services can be leveraged as needed; options are given above in data tier.
Amazon S3 Hosted Website
1. Presentation Tier: Static website content hosted on S3, distributed by Amazon CLoudFront. Hosting static website content on S3 is a cost effective alternative to hosting content on server-based infrastructure. However, for a website to contain rich feature, the static content often must integrate with a dynamic back end.
2. Logic Tier: API Gateway & Lambda, static web content hosted in S3 can directly integrate with API Gateway, which can be CORS complaint.
3. Data Tier: The various data storage services can be leveraged based on your requirement.
ServerLess Costing
At the top of the AWS invoice, we can see the total costing of AWS Services. The bill was processed for 2.1 million API request & all of the infrastructure required to support them.
Following is the list of services with their costing.
Note: You can get your costing done from AWS Calculator using following links;
The three-tier architecture pattern encourages the best practice of creating application component that are easy to maintain, develop, decoupled & scalable. Serverless Application services varies based on the requirements over development.
In this blog, we will try to understand Istio and its YAML configurations. You will also learn why Istio is great for managing traffic and how to set it up using Google Kubernetes Engine (GKE). I’ve also shed some light on deploying Istio in various environments and applications like intelligent routing, traffic shifting, injecting delays, and testing the resiliency of your application.
What is Istio?
The Istio’s website says it is “An open platform to connect, manage, and secure microservices”.
As a network of microservices known as ‘Service Mesh’ grows in size and complexity, it can become tougher to understand and manage. Its requirements can include discovery, load balancing, failure recovery, metrics, and monitoring, and often more complex operational requirements such as A/B testing, canary releases, rate limiting, access control, and end-to-end authentication. Istio claims that it provides complete end to end solution to these problems.
Why Istio?
Provides automatic load balancing for various protocols like HTTP, gRPC, WebSocket, and TCP traffic. It means you can cater to the needs of web services and also frameworks like Tensorflow (it uses gRPC).
To control the flow of traffic and API calls between services, make calls more reliable, and make the network more robust in the face of adverse conditions.
To gain understanding of the dependencies between services and the nature and flow of traffic between them, providing the ability to quickly identify issues etc.
Let’s explore the architecture of Istio.
Istio’s service mesh is split logically into two components:
Data plane – set of intelligent proxies (Envoy) deployed as sidecars to the microservice they control communications between microservices.
Control plane – manages and configures proxies to route traffic. It also enforces policies.
Envoy – Istio uses an extended version of envoy (L7 proxy and communication bus designed for large modern service-oriented architectures) written in C++. It manages inbound and outbound traffic for service mesh.
Enough of theory, now let us setup Istio to see things in action. A notable point is that Istio is pretty fast. It’s written in Go and adds a very tiny overhead to your system.
Setup Istio on GKE
You can either setup Istio via command line or via UI. We have used command line installation for this blog.
Sample Book Review Application
Following this link, you can easily
The Bookinfo application is broken into four separate microservices:
productpage. The productpage microservice calls the details and reviews microservices to populate the page.
details. The details microservice contains book information.
reviews. The reviews microservice contains book reviews. It also calls the ratings microservice.
ratings. The ratings microservice contains book ranking information that accompanies a book review.
There are 3 versions of the reviews microservice:
Version v1 doesn’t call the ratings service.
Version v2 calls the ratings service and displays each rating as 1 to 5 black stars.
Version v3 calls the ratings service and displays each rating as 1 to 5 red stars.
The end-to-end architecture of the application is shown below.
If everything goes well, You will have a web app like this (served at http://GATEWAY_URL/productpage)
Let’s take a case where 50% of traffic is routed to v1 and the remaining 50% to v3.
This is how the config file looks like (/path/to/istio-0.2.12/samples/bookinfo/kube/route-rule-reviews-50-v3.yaml)
Istio provides a simple Domain-specific language (DSL) to control how API calls and layer-4 traffic flow across various services in the application deployment.
In the above configuration, we are trying to Add a “Route Rule”. It means we will be routing the traffic coming to destinations. The destination is the name of the service to which the traffic is being routed. The route labels identify the specific service instances that will receive traffic.
In this Kubernetes deployment of Istio, the route label “version: v1” and “version: v3” indicates that only pods containing the label “version: v1” and “version: v3” will receive 50% traffic each.
Now multiple route rules could be applied to the same destination. The order of evaluation of rules corresponding to a given destination, when there is more than one, can be specified by setting the precedence field of the rule.
The precedence field is an optional integer value, 0 by default. Rules with higher precedence values are evaluated first. If there is more than one rule with the same precedence value the order of evaluation is undefined.
When is precedence useful? Whenever the routing story for a particular service is purely weight based, it can be specified in a single rule.
Once a rule is found that applies to the incoming request, it will be executed and the rule-evaluation process will terminate. That’s why it’s very important to carefully consider the priorities of each rule when there is more than one.
In short, it means route label “version: v1” is given preference over route label “version: v3”.
Intelligent Routing Using Istio
We will demonstrate an example in which we will be aiming to get more control over routing the traffic coming to our app. Before reading ahead, make sure that you have installed Istio and book review application.
First, we will set a default version for all microservices.
Then wait a few seconds for the rules to propagate to all pods before attempting to access the application. This will set the default route to v1 version (which doesn’t call rating service). Now we want a specific user, say Velotio, to see v2 version. We write a yaml (test-velotio.yaml) file.
Now if any other user logs in it won’t see any ratings (it will see v1 version) but when “velotio” user logs in it will see v2 version!
This is how we can intelligently do content-based routing. We used Istio to send 100% of the traffic to the v1 version of each of the Bookinfo services. You then set a rule to selectively send traffic to version v2 of the reviews service based on a header (i.e., a user cookie) in a request.
Traffic Shifting
Now Let’s take a case in which we have to shift traffic from an old service to a new service.
We can use Istio to gradually transfer traffic from one microservice to another one. For example, we can move 10, 20, 25..100% of traffic. Here for simplicity of the blog, we will move traffic from reviews:v1 to reviews:v3 in two steps 40% to 100%.
Now, Refresh the productpage in your browser and you should now see red colored star ratings approximately 40% of the time. Once that is stable, we transfer all the traffic to v3.
Inject Delays and Test the Resiliency of Your Application
Here we will check fault injection using HTTP delay. To test our Bookinfo application microservices for resiliency, we will inject a 7s delay between the reviews:v2 and ratings microservices, for user “Jason”. Since the reviews:v2 service has a 10s timeout for its calls to the ratings service, we expect the end-to-end flow to continue without any errors.
> istioctl get routerule ratings-test-delay -o yaml
Now we allow several seconds to account for rule propagation delay to all pods. Log in as user “Jason”. If the application’s front page was set to correctly handle delays, we expect it to load within approximately 7 seconds.
Conclusion
In this blog we only explored the routing capabilities of Istio. We found Istio to give us good amount of control over routing, fault injection etc in microservices. Istio has a lot more to offer like load balancing and security. We encourage you guys to toy around with Istio and tell us about your experiences.