Tag: programming how tos

  • An Introduction to Asynchronous Programming in Python

    Introduction

    Asynchronous programming is a type of parallel programming in which a unit of work is allowed to run separately from the primary application thread. When the work is complete, it notifies the main thread about completion or failure of the worker thread. There are numerous benefits to using it, such as improved application performance and enhanced responsiveness.

    Asynchronous programming has been gaining a lot of attention in the past few years, and for good reason. Although it can be more difficult than the traditional linear style, it is also much more efficient.

    For example, instead of waiting for an HTTP request to finish before continuing execution, with Python async coroutines you can submit the request and do other work that’s waiting in a queue while waiting for the HTTP request to finish.

    Asynchronicity seems to be a big reason why Node.js so popular for server-side programming. Much of the code we write, especially in heavy IO applications like websites, depends on external resources. This could be anything from a remote database call to POSTing to a REST service. As soon as you ask for any of these resources, your code is waiting around with nothing to do. With asynchronous programming, you allow your code to handle other tasks while waiting for these other resources to respond.

    How Does Python Do Multiple Things At Once?

    1. Multiple Processes

    The most obvious way is to use multiple processes. From the terminal, you can start your script two, three, four…ten times and then all the scripts are going to run independently or at the same time. The operating system that’s underneath will take care of sharing your CPU resources among all those instances. Alternately you can use the multiprocessing library which supports spawning processes as shown in the example below.

    from multiprocessing import Process
    
    
    def print_func(continent='Asia'):
        print('The name of continent is : ', continent)
    
    if __name__ == "__main__":  # confirms that the code is under main function
        names = ['America', 'Europe', 'Africa']
        procs = []
        proc = Process(target=print_func)  # instantiating without any argument
        procs.append(proc)
        proc.start()
    
        # instantiating process with arguments
        for name in names:
            # print(name)
            proc = Process(target=print_func, args=(name,))
            procs.append(proc)
            proc.start()
    
        # complete the processes
        for proc in procs:
            proc.join()

    Output:

    The name of continent is :  Asia
    The name of continent is :  America
    The name of continent is :  Europe
    The name of continent is :  Africa

    2. Multiple Threads

    The next way to run multiple things at once is to use threads. A thread is a line of execution, pretty much like a process, but you can have multiple threads in the context of one process and they all share access to common resources. But because of this, it’s difficult to write a threading code. And again, the operating system is doing all the heavy lifting on sharing the CPU, but the global interpreter lock (GIL) allows only one thread to run Python code at a given time even when you have multiple threads running code. So, In CPython, the GIL prevents multi-core concurrency. Basically, you’re running in a single core even though you may have two or four or more.

    import threading
     
    def print_cube(num):
        """
        function to print cube of given num
        """
        print("Cube: {}".format(num * num * num))
     
    def print_square(num):
        """
        function to print square of given num
        """
        print("Square: {}".format(num * num))
     
    if __name__ == "__main__":
        # creating thread
        t1 = threading.Thread(target=print_square, args=(10,))
        t2 = threading.Thread(target=print_cube, args=(10,))
     
        # starting thread 1
        t1.start()
        # starting thread 2
        t2.start()
     
        # wait until thread 1 is completely executed
        t1.join()
        # wait until thread 2 is completely executed
        t2.join()
     
        # both threads completely executed
        print("Done!")

    Output:

    Square: 100
    Cube: 1000
    Done!

    3. Coroutines using yield:

    Coroutines are generalization of subroutines. They are used for cooperative multitasking where a process voluntarily yield (give away) control periodically or when idle in order to enable multiple applications to be run simultaneously. Coroutines are similar to generators but with few extra methods and slight change in how we use yield statement. Generators produce data for iteration while coroutines can also consume data.

    def print_name(prefix):
        print("Searching prefix:{}".format(prefix))
        try : 
            while True:
                    # yeild used to create coroutine
                    name = (yield)
                    if prefix in name:
                        print(name)
        except GeneratorExit:
                print("Closing coroutine!!")
     
    corou = print_name("Dear")
    corou.__next__()
    corou.send("James")
    corou.send("Dear James")
    corou.close()

    Output:

    Searching prefix:Dear
    Dear James
    Closing coroutine!!

    4. Asynchronous Programming

    The fourth way is an asynchronous programming, where the OS is not participating. As far as OS is concerned you’re going to have one process and there’s going to be a single thread within that process, but you’ll be able to do multiple things at once. So, what’s the trick?

    The answer is asyncio

    asyncio is the new concurrency module introduced in Python 3.4. It is designed to use coroutines and futures to simplify asynchronous code and make it almost as readable as synchronous code as there are no callbacks.

    asyncio uses different constructs: event loopscoroutines and futures.

    • An event loop manages and distributes the execution of different tasks. It registers them and handles distributing the flow of control between them.
    • Coroutines (covered above) are special functions that work similarly to Python generators, on await they release the flow of control back to the event loop. A coroutine needs to be scheduled to run on the event loop, once scheduled coroutines are wrapped in Tasks which is a type of Future.
    • Futures represent the result of a task that may or may not have been executed. This result may be an exception.

    Using Asyncio, you can structure your code so subtasks are defined as coroutines and allows you to schedule them as you please, including simultaneously. Coroutines contain yield points where we define possible points where a context switch can happen if other tasks are pending, but will not if no other task is pending.

    A context switch in asyncio represents the event loop yielding the flow of control from one coroutine to the next.

    In the example, we run 3 async tasks that query Reddit separately, extract and print the JSON. We leverage aiohttp which is a http client library ensuring even the HTTP request runs asynchronously.

    import signal  
    import sys  
    import asyncio  
    import aiohttp  
    import json
    
    loop = asyncio.get_event_loop()  
    client = aiohttp.ClientSession(loop=loop)
    
    async def get_json(client, url):  
        async with client.get(url) as response:
            assert response.status == 200
            return await response.read()
    
    async def get_reddit_top(subreddit, client):  
        data1 = await get_json(client, 'https://www.reddit.com/r/' + subreddit + '/top.json?sort=top&t=day&limit=5')
    
        j = json.loads(data1.decode('utf-8'))
        for i in j['data']['children']:
            score = i['data']['score']
            title = i['data']['title']
            link = i['data']['url']
            print(str(score) + ': ' + title + ' (' + link + ')')
    
        print('DONE:', subreddit + '\n')
    
    def signal_handler(signal, frame):  
        loop.stop()
        client.close()
        sys.exit(0)
    
    signal.signal(signal.SIGINT, signal_handler)
    
    asyncio.ensure_future(get_reddit_top('python', client))  
    asyncio.ensure_future(get_reddit_top('programming', client))  
    asyncio.ensure_future(get_reddit_top('compsci', client))  
    loop.run_forever()

    Output:

    50: Undershoot: Parsing theory in 1965 (http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2018/07/knuth_1965_2.html)
    12: Question about best-prefix/failure function/primal match table in kmp algorithm (https://www.reddit.com/r/compsci/comments/8xd3m2/question_about_bestprefixfailure_functionprimal/)
    1: Question regarding calculating the probability of failure of a RAID system (https://www.reddit.com/r/compsci/comments/8xbkk2/question_regarding_calculating_the_probability_of/)
    DONE: compsci
    
    336: /r/thanosdidnothingwrong -- banning people with python (https://clips.twitch.tv/AstutePluckyCocoaLitty)
    175: PythonRobotics: Python sample codes for robotics algorithms (https://atsushisakai.github.io/PythonRobotics/)
    23: Python and Flask Tutorial in VS Code (https://code.visualstudio.com/docs/python/tutorial-flask)
    17: Started a new blog on Celery - what would you like to read about? (https://www.python-celery.com)
    14: A Simple Anomaly Detection Algorithm in Python (https://medium.com/@mathmare_/pyng-a-simple-anomaly-detection-algorithm-2f355d7dc054)
    DONE: python
    
    1360: git bundle (https://dev.to/gabeguz/git-bundle-2l5o)
    1191: Which hashing algorithm is best for uniqueness and speed? Ian Boyd's answer (top voted) is one of the best comments I've seen on Stackexchange. (https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed)
    430: ARM launchesFactscampaign against RISC-V (https://riscv-basics.com/)
    244: Choice of search engine on Android nuked byAnonymous Coward” (2009) (https://android.googlesource.com/platform/packages/apps/GlobalSearch/+/592150ac00086400415afe936d96f04d3be3ba0c)
    209: Exploiting freely accessible WhatsApp data orWhy does WhatsApp web know my phones battery level?” (https://medium.com/@juan_cortes/exploiting-freely-accessible-whatsapp-data-or-why-does-whatsapp-know-my-battery-level-ddac224041b4)
    DONE: programming

    Using Redis and Redis Queue(RQ):

    Using asyncio and aiohttp may not always be in an option especially if you are using older versions of python. Also, there will be scenarios when you would want to distribute your tasks across different servers. In that case we can leverage RQ (Redis Queue). It is a simple Python library for queueing jobs and processing them in the background with workers. It is backed by Redis – a key/value data store.

    In the example below, we have queued a simple function count_words_at_url using redis.

    from mymodule import count_words_at_url
    from redis import Redis
    from rq import Queue
    
    
    q = Queue(connection=Redis())
    job = q.enqueue(count_words_at_url, 'http://nvie.com')
    
    
    ******mymodule.py******
    
    import requests
    
    def count_words_at_url(url):
        """Just an example function that's called async."""
        resp = requests.get(url)
    
        print( len(resp.text.split()))
        return( len(resp.text.split()))

    Output:

    15:10:45 RQ worker 'rq:worker:EMPID18030.9865' started, version 0.11.0
    15:10:45 *** Listening on default...
    15:10:45 Cleaning registries for queue: default
    15:10:50 default: mymodule.count_words_at_url('http://nvie.com') (a2b7451e-731f-4f31-9232-2b7e3549051f)
    322
    15:10:51 default: Job OK (a2b7451e-731f-4f31-9232-2b7e3549051f)
    15:10:51 Result is kept for 500 seconds

    Conclusion:

    Let’s take a classical example chess exhibition where one of the best chess players competes against a lot of people. And if there are 24 games with 24 people to play with and the chess master plays with all of them synchronically, it’ll take at least 12 hours (taking into account that the average game takes 30 moves, the chess master thinks for 5 seconds to come up with a move and the opponent – for approximately 55 seconds). But using the asynchronous mode gives chess master the opportunity to make a move and leave the opponent thinking while going to the next one and making a move there. This way a move on all 24 games can be done in 2 minutes and all of them can be won in just one hour.

    So, this is what’s meant when people talk about asynchronous being really fast. It’s this kind of fast. Chess master doesn’t play chess faster, the time is just more optimized and it’s not get wasted on waiting around. This is how it works.

    In this analogy, the chess master will be our CPU and the idea is that we wanna make sure that the CPU doesn’t wait or waits the least amount of time possible. It’s about always finding something to do.

    A practical definition of Async is that it’s a style of concurrent programming in which tasks release the CPU during waiting periods, so that other tasks can use it. In Python, there are several ways to achieve concurrency, based on our requirement, code flow, data manipulation, architecture design  and use cases we can select any of these methods.

  • Web Scraping: Introduction, Best Practices & Caveats

    Web scraping is a process to crawl various websites and extract the required data using spiders. This data is processed in a data pipeline and stored in a structured format. Today, web scraping is widely used and has many use cases:

    • Using web scraping, Marketing & Sales companies can fetch lead-related information.
    • Web scraping is useful for Real Estate businesses to get the data of new projects, resale properties, etc.
    • Price comparison portals, like Trivago, extensively use web scraping to get the information of product and price from various e-commerce sites.

    The process of web scraping usually involves spiders, which fetch the HTML documents from relevant websites, extract the needed content based on the business logic, and finally store it in a specific format. This blog is a primer to build highly scalable scrappers. We will cover the following items:

    1. Ways to scrape: We’ll see basic ways to scrape data using techniques and frameworks in Python with some code snippets.
    2. Scraping at scale: Scraping a single page is straightforward, but there are challenges in scraping millions of websites, including managing the spider code, collecting data, and maintaining a data warehouse. We’ll explore such challenges and their solutions to make scraping easy and accurate.
    3. Scraping Guidelines: Scraping data from websites without the owner’s permission can be deemed as malicious. Certain guidelines need to be followed to ensure our scrappers are not blacklisted. We’ll look at some of the best practices one should follow for crawling.

    So let’s start scraping. 

    Different Techniques for Scraping

    Here, we will discuss how to scrape a page and the different libraries available in Python.

    Note: Python is the most popular language for scraping.  

    1. Requests – HTTP Library in Python: To scrape the website or a page, first find out the content of the HTML page in an HTTP response object. The requests library from Python is pretty handy and easy to use. It uses urllib inside. I like ‘requests’ as it’s easy and the code becomes readable too.

    #Example showing how to use the requests library
    import requests
    r = requests.get("https://velotio.com") #Fetch HTML Page

    2. BeautifulSoup: Once you get the webpage, the next step is to extract the data. BeautifulSoup is a powerful Python library that helps you extract the data from the page. It’s easy to use and has a wide range of APIs that’ll help you extract the data. We use the requests library to fetch an HTML page and then use the BeautifulSoup to parse that page. In this example, we can easily fetch the page title and all links on the page. Check out the documentation for all the possible ways in which we can use BeautifulSoup.

    from bs4 import BeautifulSoup
    import requests
    r = requests.get("https://velotio.com") #Fetch HTML Page
    soup = BeautifulSoup(r.text, "html.parser") #Parse HTML Page
    print "Webpage Title:" + soup.title.string
    print "Fetch All Links:" soup.find_all('a')

    3. Python Scrapy Framework:

    Scrapy is a Python-based web scraping framework that allows you to create different kinds of spiders to fetch the source code of the target website. Scrapy starts crawling the web pages present on a certain website, and then you can write the extraction logic to get the required data. Scrapy is built on the top of Twisted, a Python-based asynchronous library that performs the requests in an async fashion to boost up the spider performance. Scrapy is faster than BeautifulSoup. Moreover, it is a framework to write scrapers as opposed to BeautifulSoup, which is just a library to parse HTML pages.

    Here is a simple example of how to use Scrapy. Install Scrapy via pip. Scrapy gives a shell after parsing a website:

    $ pip install scrapy #Install Scrapy"
    $ scrapy shell https://velotio.com
    In [1]: response.xpath("//a").extract() #Fetch all a hrefs

    Now, let’s write a custom spider to parse a website.

    $cat > myspider.py <import scrapy
    
    class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']
    
    def parse(self, response):
    for title in response.css('h2.entry-title'):
    yield {'title': title.css('a ::text').extract_first()}
    EOF
    scrapy runspider myspider.py

    That’s it. Your first custom spider is created. Now. let’s understand the code.

    • name: Name of the spider. In this case, it’s “blogspider”.
    • start_urls: A list of URLs where the spider will begin to crawl from.
    • parse(self, response): This function is called whenever the crawler successfully crawls a URL. The response object used earlier in the Scrapy shell is the same response object that is passed to the parse(..).

    When you run this, Scrapy will look for start URL and will give you all the divs of the h2.entry-title class and extract the associated text from it. Alternatively, you can write your extraction logic in a parse method or create a separate class for extraction and call its object from the parse method.

    You’ve seen how to extract simple items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient. Here is a tutorial for Scrapy and the additional documentation for LinkExtractor by which you can instruct Scrapy to extract links from a web page.

    4. Python lxml.html library:  This is another library from Python just like BeautifulSoup. Scrapy internally uses lxml. It comes with a list of APIs you can use for data extraction. Why will you use this when Scrapy itself can extract the data? Let’s say you want to iterate over the ‘div’ tag and perform some operation on each tag present under “div”, then you can use this library which will give you a list of ‘div’ tags. Now you can simply iterate over them using the iter() function and traverse each child tag inside the parent div tag. Such traversing operations are difficult in scraping. Here is the documentation for this library.

    Challenges while Scraping at Scale

    Let’s look at the challenges and solutions while scraping at large scale, i.e., scraping 100-200 websites regularly:

    1. Data warehousing: Data extraction at a large scale generates vast volumes of information. Fault-tolerant, scalability, security, and high availability are the must-have features for a data warehouse. If your data warehouse is not stable or accessible then operations, like search and filter over data would be an overhead. To achieve this, instead of maintaining own database or infrastructure, you can use Amazon Web Services (AWS). You can use RDS (Relational Database Service) for a structured database and DynamoDB for the non-relational database. AWS takes care of the backup of data. It automatically takes a snapshot of the database. It gives you database error logs as well. This blog explains how to set up infrastructure in the cloud for scraping.  

    2. Pattern Changes: Scraping heavily relies on user interface and its structure, i.e., CSS and Xpath. Now, if the target website gets some adjustments then our scraper may crash completely or it can give random data that we don’t want. This is a common scenario and that’s why it’s more difficult to maintain scrapers than writing it. To handle this case, we can write the test cases for the extraction logic and run them daily, either manually or from CI tools, like Jenkins to track if the target website has changed or not.

    3. Anti-scraping Technologies: Web scraping is a common thing these days, and every website host would want to prevent their data from being scraped. Anti-scraping technologies would help them in this. For example, if you are hitting a particular website from the same IP address on a regular interval then the target website can block your IP. Adding a captcha on a website also helps. There are methods by which we can bypass these anti-scraping methods. For e.g., we can use proxy servers to hide our original IP. There are several proxy services that keep on rotating the IP before each request. Also, it is easy to add support for proxy servers in the code, and in Python, the Scrapy framework does support it.

    4. JavaScript-based dynamic content:  Websites that heavily rely on JavaScript and Ajax to render dynamic content, makes data extraction difficult. Now, Scrapy and related frameworks/libraries will only work or extract what it finds in the HTML document. Ajax calls or JavaScript are executed at runtime so it can’t scrape that. This can be handled by rendering the web page in a headless browser such as Headless Chrome, which essentially allows running Chrome in a server environment. You can also use PhantomJS, which provides a headless Webkit-based environment.

    5. Honeypot traps: Some websites have honeypot traps on the webpages for the detection of web crawlers. They are hard to detect as most of the links are blended with background color or the display property of CSS is set to none. To achieve this requires large coding efforts on both the server and the crawler side, hence this method is not frequently used.

    6. Quality of data: Currently, AI and ML projects are in high demand and these projects need data at large scale. Data integrity is also important as one fault can cause serious problems in AI/ML algorithms. So, in scraping, it is very important to not just scrape the data, but verify its integrity as well. Now doing this in real-time is not possible always, so I would prefer to write test cases of the extraction logic to make sure whatever your spiders are extracting is correct and they are not scraping any bad data

    7. More Data, More Time:  This one is obvious. The larger a website is, the more data it contains, the longer it takes to scrape that site. This may be fine if your purpose for scanning the site isn’t time-sensitive, but that isn’t often the case. Stock prices don’t stay the same over hours. Sales listings, currency exchange rates, media trends, and market prices are just a few examples of time-sensitive data. What to do in this case then? Well, one solution could be to design your spiders carefully. If you’re using Scrapy like framework then apply proper LinkExtractor rules so that spider will not waste time on scraping unrelated URLs.

    You may use multithreading scraping packages available in Python, such as Frontera and Scrapy Redis. Frontera lets you send out only one request per domain at a time, but can hit multiple domains at once, making it great for parallel scraping. Scrapy Redis lets you send out multiple requests to one domain. The right combination of these can result in a very powerful web spider that can handle both the bulk and variation for large websites.

    8. Captchas: Captchas is a good way of keeping crawlers away from a website and it is used by many website hosts. So, in order to scrape the data from such websites, we need a mechanism to solve the captchas. There are packages, software that can solve the captcha and can act as a middleware between the target website and your spider. Also, you may use libraries like Pillow and Tesseract in Python to solve the simple image-based captchas.

    9. Maintaining Deployment: Normally, we don’t want to limit ourselves to scrape just a few websites. We need the maximum amount of data that are present on the Internet and that may introduce scraping of millions of websites. Now, you can imagine the size of the code and the deployment. We can’t run spiders at this scale from a single machine. What I prefer here is to dockerize the scrapers and take advantage of the latest technologies, like AWS ECS, Kubernetes to run our scraper containers. This helps us keeping our scrapers in high availability state and it’s easy to maintain. Also, we can schedule the scrapers to run at regular intervals.

    Scraping Guidelines/ Best Practices

    1. Respect the robots.txt file:  Robots.txt is a text file that webmasters create to instruct search engine robots on how to crawl and index pages on the website. This file generally contains instructions for crawlers. Now, before even planning the extraction logic, you should first check this file. Usually, you can find this at the website admin section. This file has all the rules set on how crawlers should interact with the website. For e.g., if a website has a link to download critical information then they probably don’t want to expose that to crawlers. Another important factor is the frequency interval for crawling, which means that crawlers can only hit the website at specified intervals. If someone has asked not to crawl their website then we better not do it. Because if they catch your crawlers, it can lead to some serious legal issues.

    2. Do not hit the servers too frequently:  As I mentioned above, some websites will have the frequency interval specified for crawlers. We better use it wisely because not every website is tested against the high load. If you are hitting at a constant interval then it creates huge traffic on the server-side, and it may crash or fail to serve other requests. This creates a high impact on user experience as they are more important than the bots. So, we should make the requests according to the specified interval in robots.txt or use a standard delay of 10 seconds. This also helps you not to get blocked by the target website.

    3. User Agent Rotation and Spoofing: Every request consists of a User-Agent string in the header. This string helps to identify the browser you are using, its version, and the platform. If we use the same User-Agent in every request then it’s easy for the target website to check that request is coming from a crawler. So, to make sure we do not face this, try to rotate the User and the Agent between the requests. You can get examples of genuine User-Agent strings on the Internet very easily, try them out. If you’re using Scrapy, you can set USER_AGENT property in settings.py.

    4. Disguise your requests by rotating IPs and Proxy Services: We’ve discussed this in the challenges above. It’s always better to use rotating IPs and proxy service so that your spider won’t get blocked.

    5. Do not follow the same crawling pattern: Now, as you know many websites use anti-scraping technologies, so it’s easy for them to detect your spider if it’s crawling in the same pattern. Normally, we, as a human, would not follow a pattern on a particular website. So, to have your spiders run smoothly, we can introduce actions like mouse movements, clicking a random link, etc, which gives the impression of your spider as a human.

    6. Scrape during off-peak hours: Off-peak hours are suitable for bots/crawlers as the traffic on the website is considerably less. These hours can be identified by the geolocation from where the site’s traffic originates. This also helps to improve the crawling rate and avoid the extra load from spider requests. Thus, it is advisable to schedule the crawlers to run in the off-peak hours.

    7. Use the scraped data responsibly: We should always take the responsibility of the scraped data. It is not acceptable if someone is scraping the data and then republish it somewhere else. This can be considered as breaking the copyright laws and may lead to legal issues. So, it is advisable to check the target website’s Terms of Service page before scraping.

    8. Use Canonical URLs: When we scrape, we tend to scrape duplicate URLs, and hence the duplicate data, which is the last thing we want to do. It may happen in a single website where we get multiple URLs having the same data. In this situation, duplicate URLs will have a canonical URL, which points to the parent or the original URL. By this, we make sure, we don’t scrape duplicate contents. In frameworks like Scrapy, duplicate URLs are handled by default.

    9. Be transparent: Don’t misrepresent your purpose or use deceptive methods to gain access. If you have a login and a password that identifies you to gain access to a source, use it.  Don’t hide who you are. If possible, share your credentials.

    Conclusion

    We’ve seen the basics of scraping, frameworks, how to crawl, and the best practices of scraping. To conclude:

    • Follow target URLs rules while scraping. Don’t make them block your spider.
    • Maintenance of data and spiders at scale is difficult. Use Docker/ Kubernetes and public cloud providers, like AWS to easily scale your web-scraping backend.
    • Always respect the rules of the websites you plan to crawl. If APIs are available, always use them first.
  • Building Your First AWS Serverless Application? Here’s Everything You Need to Know

    A serverless architecture is a way to implement and run applications and services or micro-services without need to manage infrastructure. Your application still runs on servers, but all the servers management is done by AWS. Now we don’t need to provision, scale or maintain servers to run our applications, databases and storage systems. Services which are developed by developers who don’t let developers build application from scratch.

    Why Serverless

    1. More focus on development rather than managing servers.
    2. Cost Effective.
    3. Application which scales automatically.
    4. Quick application setup.

    Services For ServerLess

    For implementing serverless architecture there are multiple services which are provided by cloud partners though we will be exploring most of the services from AWS. Following are the services which we can use depending on the application requirement.

    1. Lambda: It is used to write business logic / schedulers / functions.
    2. S3: It is mostly used for storing objects but it also gives the privilege to host WebApps. You can host a static website on S3.
    3. API Gateway: It is used for creating, publishing, maintaining, monitoring and securing REST and WebSocket APIs at any scale.
    4. Cognito: It provides authentication, authorization & user management for your web and mobile apps. Your users can sign in directly sign in with a username and password or through third parties such as Facebook, Amazon or Google.
    5. DynamoDB: It is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.

    Three-tier Serverless Architecture

    So, let’s take a use case in which you want to develop a three tier serverless application. The three tier architecture is a popular pattern for user facing applications, The tiers that comprise the architecture include the presentation tier, the logic tier and the data tier. The presentation tier represents the component that users directly interact with web page / mobile app UI. The logic tier contains the code required to translate user action at the presentation tier to the functionality that drives the application’s behaviour. The data tier consists of your storage media (databases, file systems, object stores) that holds the data relevant to the application. Figure shows the simple three-tier application.

     Figure: Simple Three-Tier Architectural Pattern

    Presentation Tier

    The presentation tier of the three tier represents the View part of the application. Here you can use S3 to host static website. On a static website, individual web pages include static content and they also contain client side scripting.

    The following is a quick procedure to configure an Amazon S3 bucket for static website hosting in the S3 console.

    To configure an S3 bucket for static website hosting

    1. Log in to the AWS Management Console and open the S3 console at

    2. In the Bucket name list, choose the name of the bucket that you want to enable static website hosting for.

    3. Choose Properties.

    4. Choose Static Website Hosting

    Once you enable your bucket for static website hosting, browsers can access all of your content through the Amazon S3 website endpoint for your bucket.

    5. Choose Use this bucket to host.

    A. For Index Document, type the name of your index document, which is typically named index.html. When you configure a S3 bucket for website hosting, you must specify an index document, which will be returned by S3 when requests are made to the root domain or any of the subfolders.

    B. (Optional) For 4XX errors, you can optionally provide your own custom error document that provides additional guidance for your users. Type the name of the file that contains the custom error document. If an error occurs, S3 returns an error document.

    C. (Optional) If you want to give advanced redirection rules, In the edit redirection rule text box, you have to XML to describe the rule.
    E.g.

    <RoutingRules>
        <RoutingRule>
            <Condition>
                <HttpErrorCodeReturnedEquals>403</HttpErrorCodeReturnedEquals>
            </Condition>
            <Redirect>
                <HostName>mywebsite.com</HostName>
                <ReplaceKeyPrefixWith>notfound/</ReplaceKeyPrefixWith>
            </Redirect>
        </RoutingRule>
    </RoutingRules>

    6. Choose Save

    7. Add a bucket policy to the website bucket that grants access to the object in the S3 bucket for everyone. You must make the objects that you want to serve publicly readable, when you configure a S3 bucket as a website. To do so, you write a bucket policy that grants everyone S3:GetObject permission. The following bucket policy grants everyone access to the objects in the example-bucket bucket.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "PublicReadGetObject",
                "Effect": "Allow",
                "Principal": "*",
                "Action": [
                    "s3:GetObject"
                ],
                "Resource": [
                    "arn:aws:s3:::example-bucket/*"
                ]
            }
        ]
    }

    Note: If you choose Disable Website Hosting, S3 removes the website configuration from the bucket, so that the bucket no longer accessible from the website endpoint, but the bucket is still available at the REST endpoint.

    Logic Tier

    The logic tier represents the brains of the application. Here the two core services for serverless will be used i.e. API Gateway and Lambda to form your logic tier can be so revolutionary. The feature of the 2 services allow you to build a serverless production application which is highly scalable, available and secure. Your application could use number of servers, however by leveraging this pattern you do not have to manage a single one. In addition, by using these managed services together you get following benefits:

    1. No operating system to choose, secure or manage.
    2. No servers to right size, monitor.
    3. No risk to your cost by over-provisioning.
    4. No Risk to your performance by under-provisioning.

    API Gateway

    API Gateway is a fully managed service for defining, deploying and maintaining APIs. Anyone can integrate with the APIs using standard HTTPS requests. However, it has specific features and qualities that result it being an edge for your logic tier.

    Integration with Lambda

    API Gateway gives your application a simple way to leverage the innovation of AWS lambda directly (HTTPS Requests). API Gateway forms the bridge that connects your presentation tier and the functions you write in Lambda. After defining the client / server relationship using your API, the contents of the client’s HTTPS requests are passed to Lambda function for execution. The content include request metadata, request headers and the request body.

    API Performance Across the Globe

    Each deployment of API Gateway includes an Amazon CloudFront distribution under the covers. Amazon CloudFront is a content delivery web service that used Amazon’s global network of edge locations as connection points for clients integrating with API. This helps drive down the total response time latency of your API. Through its use of multiple edge locations across the world, Amazon CloudFront also provides you capabilities to combat distributed denial of service (DDoS) attack scenarios.

    You can improve the performance of specific API requests by using API Gateway to store responses in an optional in-memory cache. This not only provides performance benefits for repeated API requests, but is also reduces backend executions, which can reduce overall cost.

    Let’s dive into each step

    1. Create Lambda Function
    Login to Aws Console and head over to Lambda Service and Click on “Create A Function”

    A. Choose first option “Author from scratch”
    B. Enter Function Name
    C. Select Runtime e.g. Python 2.7
    D. Click on “Create Function”

    As your function is ready, you can see your basic function will get generated in language you choose to write.
    E.g.

    import json
    
    def lambda_handler(event, context):
        # TODO implement
        return {
            'statusCode': 200,
            'body': json.dumps('Hello from Lambda!')
        }

    2. Testing Lambda Function

    Click on “Test” button at the top right corner where we need to configure test event. As we are not sending any events, just give event a name, for example, “Hello World” template as it is and “Create” it.

    Now, when you hit the “Test” button again, it runs through testing the function we created earlier and returns the configured value.

    Create & Configure API Gateway connecting to Lambda

    We are done with creating lambda functions but how to invoke function from outside world ? We need endpoint, right ?

    Go to API Gateway & click on “Get Started” and agree on creating an Example API but we will not use that API we will create “New API”. Give it a name by keeping “Endpoint Type” regional for now.

    Create the API and you will go on the page “resources” page of the created API Gateway. Go through the following steps:

    A. Click on the “Actions”, then click on “Create Method”. Select Get method for our function. Then, “Tick Mark” on the right side of “GET” to set it up.
    B. Choose “Lambda Function” as integration type.
    C. Choose the region where we created earlier.
    D. Write the name of Lambda Function we created
    E. Save the method where it will ask you for confirmation of “Add Permission to Lambda Function”. Agree to that & that is done.
    F. Now, we can test our setup. Click on “Test” to run API. It should give the response text we had on the lambda test screen.

    Now, to get endpoint. We need to deploy the API. On the Actions dropdown, click on Deploy API under API Actions. Fill in the details of deployment and hit Deploy.

    After that, we will get our HTTPS endpoint.

    On the above screen you can see the things like cache settings, throttling, logging which can be configured. Save the changes and browse the invoke URL from which we will get the response which was earlier getting from Lambda. So, here is our logic tier of serverless application is to be done.

    Data Tier

    By using Lambda as your logic tier, you have a number of data storage options for your data tier. These options fall into broad categories: Amazon VPC hosted data stores and IAM-enabled data stores. Lambda has the ability to integrate with both securely.

    Amazon VPC Hosted Data Stores

    1. Amazon RDS
    2. Amazon ElasticCache
    3. Amazon Redshift

    IAM-Enabled Data Stores

    1. Amazon DynamoDB
    2. Amazon S3
    3. Amazon ElasticSearch Service

    You can use any of those for storage purpose, But DynamoDB is one of best suited for ServerLess application.

    Why DynamoDB ?

    1. It is NoSQL DB, also that is fully managed by AWS.
    2. It provides fast & prectable performance with seamless scalability.
    3. DynamoDB lets you offload the administrative burden of operating and scaling a distributed system.
    4. It offers encryption at rest, which eliminates the operational burden and complexity involved in protecting sensitive data.
    5. You can scale up/down your tables throughput capacity without downtime/performance degradation.
    6. It provides On-Demand backups as well as enable point in time recovery for your DynamoDB tables.
    7. DynamoDB allows you to delete expired items from table automatically to help you reduce storage usage and the cost of storing data that is no longer relevant.

    Following is the sample script for DynamoDB with Python which you can use with lambda.

    from __future__ import print_function # Python 2/3 compatibility
    import boto3
    import json
    import decimal
    from boto3.dynamodb.conditions import Key, Attr
    from botocore.exceptions import ClientError
    
    # Helper class to convert a DynamoDB item to JSON.
    class DecimalEncoder(json.JSONEncoder):
        def default(self, o):
            if isinstance(o, decimal.Decimal):
                if o % 1 > 0:
                    return float(o)
                else:
                    return int(o)
            return super(DecimalEncoder, self).default(o)
    
    dynamodb = boto3.resource("dynamodb", region_name='us-west-2', endpoint_url="http://localhost:8000")
    
    table = dynamodb.Table('Movies')
    
    title = "The Big New Movie"
    year = 2015
    
    try:
        response = table.get_item(
            Key={
                'year': year,
                'title': title
            }
        )
    except ClientError as e:
        print(e.response['Error']['Message'])
    else:
        item = response['Item']
        print("GetItem succeeded:")
        print(json.dumps(item, indent=4, cls=DecimalEncoder))

    Note: To run the above script successfully you need to attach policy to your role for lambda. So in this case you need to attach policy for DynamoDB operations to take place & for CloudWatch if required to store your logs. Following is the policy which you can attach to your role for DB executions.

    {
    	"Version": "2012-10-17",
    	"Statement": [{
    			"Effect": "Allow",
    			"Action": [
    				"dynamodb:BatchGetItem",
    				"dynamodb:GetItem",
    				"dynamodb:Query",
    				"dynamodb:Scan",
    				"dynamodb:BatchWriteItem",
    				"dynamodb:PutItem",
    				"dynamodb:UpdateItem"
    			],
    			"Resource": "arn:aws:dynamodb:eu-west-1:123456789012:table/SampleTable"
    		},
    		{
    			"Effect": "Allow",
    			"Action": [
    				"logs:CreateLogStream",
    				"logs:PutLogEvents"
    			],
    			"Resource": "arn:aws:logs:eu-west-1:123456789012:*"
    		},
    		{
    			"Effect": "Allow",
    			"Action": "logs:CreateLogGroup",
    			"Resource": "*"
    		}
    	]
    }

    Sample Architecture Patterns

    You can implement the following popular architecture patterns using API Gateway & Lambda as your logic tier, Amazon S3 for presentation tier, DynamoDB as your data tier. For each example, we will only use AWS Service that do not require users to manage their own infrastructure.

    Mobile Backend

    1. Presentation Tier: A mobile application running on each user’s smartphone.

    2. Logic Tier: API Gateway & Lambda. The logic tier is globally distributed by the Amazon CloudFront distribution created as part of each API Gateway each API. A set of lambda functions can be specific to user / device identity management and authentication & managed by Amazon Cognito, which provides integration with IAM for temporary user access credentials as well as with popular third party identity providers. Other Lambda functions can define core business logic for your Mobile Back End.

    3. Data Tier: The various data storage services can be leveraged as needed; options are given above in data tier.

    Amazon S3 Hosted Website

    1. Presentation Tier: Static website content hosted on S3, distributed by Amazon CLoudFront. Hosting static website content on S3 is a cost effective alternative to hosting content on server-based infrastructure. However, for a website to contain rich feature, the static content often must integrate with a dynamic back end.

    2. Logic Tier: API Gateway & Lambda, static web content hosted in S3 can directly integrate with API Gateway, which can be CORS complaint.

    3. Data Tier: The various data storage services can be leveraged based on your requirement.

    ServerLess Costing

    At the top of the AWS invoice, we can see the total costing of AWS Services. The bill was processed for 2.1 million API request & all of the infrastructure required to support them.

    Following is the list of services with their costing.

    Note: You can get your costing done from AWS Calculator using following links;

    1. https://calculator.s3.amazonaws.com/index.html
    2. AWS Pricing Calculator

    Conclusion

    The three-tier architecture pattern encourages the best practice of creating application component that are easy to maintain, develop, decoupled & scalable. Serverless Application services varies based on the requirements over development.

  • A Practical Guide to Deploying Multi-tier Applications on Google Container Engine (GKE)

    Introduction

    All modern era programmers can attest that containerization has afforded more flexibility and allows us to build truly cloud-native applications. Containers provide portability – ability to easily move applications across environments. Although complex applications comprise of many (10s or 100s) containers. Managing such applications is a real challenge and that’s where container orchestration and scheduling platforms like Kubernetes, Mesosphere, Docker Swarm, etc. come into the picture. 
    Kubernetes, backed by Google is leading the pack given that Redhat, Microsoft and now Amazon are putting their weight behind it.

    Kubernetes can run on any cloud or bare metal infrastructure. Setting up & managing Kubernetes can be a challenge but Google provides an easy way to use Kubernetes through the Google Container Engine(GKE) service.

    What is GKE?

    Google Container Engine is a Management and orchestration system for Containers. In short, it is a hosted Kubernetes. The goal of GKE is to increase the productivity of DevOps and development teams by hiding the complexity of setting up the Kubernetes cluster, the overlay network, etc.

    Why GKE? What are the things that GKE does for the user?

    • GKE abstracts away the complexity of managing a highly available Kubernetes cluster.
    • GKE takes care of the overlay network
    • GKE also provides built-in authentication
    • GKE also provides built-in auto-scaling.
    • GKE also provides easy integration with the Google storage services.

    In this blog, we will see how to create your own Kubernetes cluster in GKE and how to deploy a multi-tier application in it. The blog assumes you have a basic understanding of Kubernetes and have used it before. It also assumes you have created an account with Google Cloud Platform. If you are not familiar with Kubernetes, this guide from Deis  is a good place to start.

    Google provides a Command-line interface (gcloud) to interact with all Google Cloud Platform products and services. gcloud is a tool that provides the primary command-line interface to Google Cloud Platform. Gcloud tool can be used in the scripts to automate the tasks or directly from the command-line. Follow this guide to install the gcloud tool.

    Now let’s begin! The first step is to create the cluster.

    Basic Steps to create cluster

    In this section, I would like to explain about how to create GKE cluster. We will use a command-line tool to setup the cluster.

    Set the zone in which you want to deploy the cluster

    $ gcloud config set compute/zone us-west1-a

    Create the cluster using following command,

    $ gcloud container --project <project-name> 
    clusters create <cluster-name> 
    --machine-type n1-standard-2 
    --image-type "COS" --disk-size "50" 
    --num-nodes 2 --network default 
    --enable-cloud-logging --no-enable-cloud-monitoring

    Let’s try to understand what each of these parameters mean:

    –project: Project Name

    –machine-type: Type of the machine like n1-standard-2, n1-standard-4

    –image-type: OS image.”COS” i.e. Container Optimized OS from Google: More Info here.

    –disk-size: Disk size of each instance.

    –num-nodes: Number of nodes in the cluster.

    –network: Network that users want to use for the cluster. In this case, we are using default network.

    Apart from the above options, you can also use the following to provide specific requirements while creating the cluster:

    –scopes: Scopes enable containers to direct access any Google service without needs credentials. You can specify comma separated list of scope APIs. For example:

    • Compute: Lets you view and manage your Google Compute Engine resources
    • Logging.write: Submit log data to Stackdriver.

    You can find all the Scopes that Google supports here: .

    –additional-zones: Specify additional zones to high availability. Eg. –additional-zones us-east1-b, us-east1-d . Here Kubernetes will create a cluster in 3 zones (1 specified at the beginning and additional 2 here).

    –enable-autoscaling : To enable the autoscaling option. If you specify this option then you have to specify the minimum and maximum required nodes as follows; You can read more about how auto-scaling works here. Eg:   –enable-autoscaling –min-nodes=15 –max-nodes=50

    You can fetch the credentials of the created cluster. This step is to update the credentials in the kubeconfig file, so that kubectl will point to required cluster.

    $ gcloud container clusters get-credentials my-first-cluster --project project-name

    Now, your First Kubernetes cluster is ready. Let’s check the cluster information & health.

    $ kubectl get nodes
    NAME    STATUS    AGE   VERSION
    gke-first-cluster-default-pool-d344484d-vnj1  Ready  2h  v1.6.4
    gke-first-cluster-default-pool-d344484d-kdd7  Ready  2h  v1.6.4
    gke-first-cluster-default-pool-d344484d-ytre2  Ready  2h  v1.6.4

    After creating Cluster, now let’s see how to deploy a multi tier application on it. Let’s use simple Python Flask app which will greet the user, store employee data & get employee data.

    Application Deployment

    I have created simple Python Flask application to deploy on K8S cluster created using GKE. you can go through the source code here. If you check the source code then you will find directory structure as follows:

    TryGKE/
    ├── Dockerfile
    ├── mysql-deployment.yaml
    ├── mysql-service.yaml
    ├── src    
      ├── app.py    
      └── requirements.txt    
      ├── testapp-deployment.yaml    
      └── testapp-service.yaml

    In this, I have written a Dockerfile for the Python Flask application in order to build our own image to deploy. For MySQL, we won’t build an image of our own. We will use the latest MySQL image from the public docker repository.

    Before deploying the application, let’s re-visit some of the important Kubernetes terms:

    Pods:

    The pod is a Docker container or a group of Docker containers which are deployed together on the host machine. It acts as a single unit of deployment.

    Deployments:

    Deployment is an entity which manages the ReplicaSets and provides declarative updates to pods. It is recommended to use Deployments instead of directly using ReplicaSets. We can use deployment to create, remove and update ReplicaSets. Deployments have the ability to rollout and rollback the changes.

    Services:

    Service in K8S is an abstraction which will connect you to one or more pods. You can connect to pod using the pod’s IP Address but since pods come and go, their IP Addresses change.  Services get their own IP & DNS and those remain for the entire lifetime of the service. 

    Each tier in an application is represented by a Deployment. A Deployment is described by the YAML file. We have two YAML files – one for MySQL and one for the Python application.

    1. MySQL Deployment YAML

    apiVersion: extensions/v1beta1
    kind: Deployment
    metadata:
      name: mysql
    spec:
      template:
        metadata:
          labels:
            app: mysql
        spec:
          containers:
            - env:
                - name: MYSQL_DATABASE
                  value: admin
                - name: MYSQL_ROOT_PASSWORD
                  value: admin
              image: 'mysql:latest'
              name: mysql
              ports:
                - name: mysqlport
                  containerPort: 3306
                  protocol: TCP

    2. Python Application Deployment YAML

    apiVersion: apps/v1beta1
    kind: Deployment
    metadata:
      name: test-app
    spec:
      replicas: 1
      template:
        metadata:
          labels:
            app: test-app
        spec:
          containers:
          - name: test-app
            image: ajaynemade/pymy:latest
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 5000

    Each Service is also represented by a YAML file as follows:

    1. MySQL service YAML

    apiVersion: v1
    kind: Service
    metadata:
      name: mysql-service
    spec:
      ports:
      - port: 3306
        targetPort: 3306
        protocol: TCP
        name: http
      selector:
        app: mysql

    2. Python Application service YAML

    apiVersion: v1
    kind: Service
    metadata:
      name: test-service
    spec:
      type: LoadBalancer
      ports:
      - name: test-service
        port: 80
        protocol: TCP
        targetPort: 5000
      selector:
        app: test-app

    You will find a ‘kind’ field in each YAML file. It is used to specify whether the given configuration is for deployment, service, pod, etc.

    In the Python app service YAML, I am using type = LoadBalancer. In GKE, There are two types of cloud load balancers available to expose the application to outside world.

    1. TCP load balancer: This is a TCP Proxy-based load balancer. We will use this in our example.
    2. HTTP(s) load balancer: It can be created using Ingress. For more information, refer to this post that talks about Ingress in detail.

    In the MySQL service, I’ve not specified any type, in that case, type ‘ClusterIP’ will get used, which will make sure that MySQL container is exposed to the cluster and the Python app can access it.

    If you check the app.py, you can see that I have used “mysql-service.default” as a hostname. “Mysql-service.default” is a DNS name of the service. The Python application will refer to that DNS name while accessing the MySQL Database.

    Now, let’s actually setup the components from the configurations. As mentioned above, we will first create services followed by deployments.

    Services:

    $ kubectl create -f mysql-service.yaml
    $ kubectl create -f testapp-service.yaml

    Deployments:

    $ kubectl create -f mysql-deployment.yaml
    $ kubectl create -f testapp-deployment.yaml

    Check the status of the pods and services. Wait till all pods come to the running state and Python application service to get external IP like below:

    $ kubectl get services
    NAME            CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
    kubernetes      10.55.240.1     <none>        443/TCP        5h
    mysql-service   10.55.240.57    <none>        3306/TCP       1m
    test-service    10.55.246.105   35.185.225.67     80:32546/TCP   11s

    Once you get the external IP, then you should be able to make APIs calls using simple curl requests.

    Eg. To Store Data :

    curl -H "Content-Type: application/x-www-form-urlencoded" -X POST  http://35.185.225.67:80/storedata -d id=1 -d name=NoOne

    Eg. To Get Data :

    curl 35.185.225.67:80/getdata/1

    At this stage your application is completely deployed and is externally accessible.

    Manual scaling of pods

    Scaling your application up or down in Kubernetes is quite straightforward. Let’s scale up the test-app deployment.

    $ kubectl scale deployment test-app --replicas=3

    Deployment configuration for test-app will get updated and you can see 3 replicas of test-app are running. Verify it using,

    kubectl get pods

    In the same manner, you can scale down your application by reducing the replica count.

    Cleanup :

    Un-deploying an application from Kubernetes is also quite straightforward. All we have to do is delete the services and delete the deployments. The only caveat is that the deletion of the load balancer is an asynchronous process. You have to wait until it gets deleted.

    $ kubectl delete service mysql-service
    $ kubectl delete service test-service

    The above command will deallocate Load Balancer which was created as a part of test-service. You can check the status of the load balancer with the following command.

    $ gcloud compute forwarding-rules list

    Once the load balancer is deleted, you can clean-up the deployments as well.

    $ kubectl delete deployments test-app
    $ kubectl delete deployments mysql

    Delete the Cluster:

    $ gcloud container clusters delete my-first-cluster

    Conclusion

    In this blog, we saw how easy it is to deploy, scale & terminate applications on Google Container Engine. Google Container Engine abstracts away all the complexity of Kubernetes and gives us a robust platform to run containerised applications. I am super excited about what the future holds for Kubernetes!

    Check out some of Velotio’s other blogs on Kubernetes.

  • The Ultimate Beginner’s Guide to Jupyter Notebooks

    Jupyter Notebooks offer a great way to write and iterate on your Python code. It is a powerful tool for developing data science projects in an interactive way. Jupyter Notebook allows to showcase the source code and its corresponding output at a single place helping combine narrative text, visualizations and other rich media.The intuitive workflow promotes iterative and rapid development, making notebooks the first choice for data scientists. Creating Jupyter Notebooks is completely free as it falls under Project Jupyter which is completely open source.

    Project Jupyter is the successor to an earlier project IPython Notebook, which was first published as a prototype in 2010. Jupyter Notebook is built on top of iPython, an interactive tool for executing Python code in the terminal using REPL model(Read-Eval-Print-Loop). The iPython kernel executes the python code and communicates with the Jupyter Notebook front-end interface. Jupyter Notebooks also provide additional features like storing your code and output and keep the markdown by extending iPython.

    Although Jupyter Notebooks support using various programming languages, we will focus on Python and its application in this article.

    Getting Started with Jupyter Notebooks!

    Installation

    Prerequisites

    As you would have surmised from the above abstract we need to have Python installed on your machine. Either Python 2.7 or Python 3.+ will do.

    Install Using Anaconda

    The simplest way to get started with Jupyter Notebooks is by installing it using Anaconda. Anaconda installs both Python3 and Jupyter and also includes quite a lot of packages commonly used in the data science and machine learning community. You can follow the latest guidelines from here.

    Install Using Pip

    If, for some reason, you decide not to use Anaconda, then you can install Jupyter manually using Python pip package, just follow the below code:

    pip install jupyter

    Launching First Notebook

    Open your terminal, navigate to the directory where you would like to store you notebook and launch the Jupyter Notebooks. Then type the below command and the program will instantiate a local server at http://localhost:8888/tree.

    jupyter notebook

    A new window with the Jupyter Notebook interface will open in your internet browser. As you might have already noticed Jupyter starts up a local Python server to serve web apps in your browser, where you can access the Dashboard and work with the Jupyter Notebooks. The Jupyter Notebooks are platform independent which makes it easier to collaborate and share with others.

    The list of all files is displayed under the Files tab whereas all the running processes can be viewed by clicking on the Running tab and the third tab, Clusters is extended from IPython parallel, IPython’s parallel computing framework. It helps you to control multiple engines, extended from the IPython kernel.

    Let’s start by making a new notebook. We can easily do this by clicking on the New drop-down list in the top- right corner of the dashboard. You see that you have the option to make a Python 3 notebook as well as regular text file, a folder, and a terminal. Please select the Python 3 notebook option.

    Your Jupyter Notebook will open in a new tab as shown in below image.

    Now each notebook is opened in a new tab so that you can simultaneously work with multiple notebooks. If you go back to the dashboard tab, you will see the new file Untitled.ipynb and you should see some green icon to it’s left which indicates your new notebook is running.

     

    Why a .ipynb file?

    .ipynb is the standard file format for storing Jupyter Notebooks, hence the file name Untitled.ipynb. Let’s begin by first understanding what an .ipynb file is and what it might contain. Each .ipynb file is a text file that describes the content of your notebook in a JSON format. The content of each cell, whether it is text, code or image attachments that have been converted into strings, along with some additional metadata is stored in the .ipynb file. You can also edit the metadata by selecting “Edit > Edit Notebook Metadata” from the menu options in the notebook.

    You can also view the content of your notebook files by selecting “Edit” from the controls on the dashboard, there’s no reason to do so unless you really want to edit the file manually.

    Understanding the Notebook Interface

    Now that you have created a notebook, let’s have a look at the various menu options and functions, which are readily available. Take some time out to scroll through the the list of commands that opens up when you click on the keyboard icon (or press Ctrl + Shift + P).

    There are two prominent terminologies that you should care to learn about: cells and kernels are key both to understanding Jupyter and to what makes it more than just a content writing tool. Fortunately, these concepts are not difficult to understand.

    • A kernel is a program that interprets and executes the user’s code. The Jupyter Notebook App has an inbuilt kernel for Python code, but there are also kernels available for other programming languages.
    • A cell is a container which holds the executable code or normal text 

    Cells

    Cells form the body of a notebook. If you look at the screenshot above for a new notebook (Untitled.ipynb), the text box with the green border is an empty cell. There are 4 types of cells:

    • Code – This is where you type your code and when executed the kernel will display its output below the cell.
    • Markdown – This is where you type your text formatted using Markdown and the output is displayed in place when it is run.
    • Raw NBConvert – It’s a command line tool to convert your notebook into another format (like HTML, PDF etc.)
    • Heading – This is where you add Headings to separate sections and make your notebook look tidy and neat. This has now been merged into the Markdown option itself. Adding a ‘#’ at the beginning ensures that whatever you type after that will be taken as a heading.

    Let’s test out how the cells work with a basic “hello world” example. Type print(‘Hello World!’) in the cell and press Ctrl + Enter or click on the Run button in the toolbar at the top.

    print("Hello World!")

    Hello World!

    When you run the cell, the output will be displayed below, and the label to its left changes from In[ ] to In[1] . Moreover, to signify that the cell is still running, Jupyter changes the label to In[*]

    Additionally, it is important to note that the output of a code cell comes from any of the print statements in the code cell, as well as the value of the last line in the cell, irrespective of it being a variable, function call or some other code snippet.

    Markdown

    Markdown is a lightweight, markup language for formatting plain text. Its syntax has a one-to-one correspondence with HTML tags. As this article has been written in a Jupyter notebook, all of the narrative text and images you can see, are written in Markdown. Let’s go through the basics with the following example.

    # This is a level 1 heading 
    ### This is a level 3 heading
    This is how you write some plain text that would form a paragraph.
    You can emphasize the text by enclosing the text like "**" or "__" to make it bold and enclosing the text in "*" or "_" to make it italic. 
    Paragraphs are separated by an empty line.
    * We can include lists.
      * And also indent them.
    
    1. Getting Numbered lists is
    2. Also easy.
    
    [To include hyperlinks enclose the text with square braces and then add the link url in round braces](http://www.example.com)
    
    Inline code uses single backticks: `foo()`, and code blocks use triple backticks:
    
    ``` 
    foo()
    ```
    
    Or can be indented by 4 spaces: 
    
        foo()
        
    And finally, adding images is easy: ![Online Image](https://www.example.com/image.jpg) or ![Local Image](img/image.jpg) or ![Image Attachment](attachment:image.jpg)

    We have 3 different ways to attach images

    • Link the URL of an image from the web.
    • Use relative path of an image present locally
    • Add an attachment to the notebook by using “Edit>Insert Image” option; This method converts the image into a string and store it inside your notebook

    Note that adding an image as an attachment will make the .ipynb file much larger because it is stored inside the notebook in a string format.

    There are a lot more features available in Markdown. To learn more about markdown, you can refer to the official guide from the creator, John Gruber, on his website.

    Kernels

    Every notebook runs on top of a kernel. Whenever you execute a code cell, the content of the cell is executed within the kernel and any output is returned back to the cell for display. The kernel’s state applies to the document as a whole and not individual cells and is persisted over time.

    For example, if you declare a variable or import some libraries in a cell, they will be accessible in other cells. Now let’s understand this with the help of an example. First we’ll import a Python package and then define a function.

    import os, binascii
    def sum(x,y):
      return x+y

    Once the cell above  is executed, we can reference os, binascii and sum in any other cell.

    rand_hex_string = binascii.b2a_hex(os.urandom(15)) 
    print(rand_hex_string)
    x = 1
    y = 2
    z = sum(x,y)
    print('Sum of %d and %d is %d' % (x, y, z))

    The output should look something like this:

    c84766ca4a3ce52c3602bbf02a
    d1f7 Sum of 1 and 2 is 3

    The execution flow of a notebook is generally from top-to-bottom, but it’s common to go back to make changes. The order of execution is shown to the left of each cell, such as In [2] , will let you know whether any of your cells have stale output. Additionally, there are multiple options in the Kernel menu which often come very handy.

    • Restart: restarts the kernel, thus clearing all the variables etc that were defined.
    • Restart & Clear Output: same as above but will also wipe the output displayed below your code cells.
    • Restart & Run All: same as above but will also run all your cells in order from top-to-bottom.
    • Interrupt: If your kernel is ever stuck on a computation and you wish to stop it, you can choose the Interrupt option.

    Naming Your Notebooks

    It is always a best practice to give a meaningful name to your notebooks. You can rename your notebooks from the notebook app itself by double-clicking on the existing name at the top left corner. You can also use the dashboard or the file browser to rename the notebook file. We’ll head back to the dashboard to rename the file we created earlier, which will have the default notebook file name Untitled.ipynb.

    Now that you are back on the dashboard, you can simply select your notebook and click “Rename” in the dashboard controls

    Jupyter notebook - Rename

    Shutting Down your Notebooks

    We can shutdown a running notebook by selecting “File > Close and Halt” from the notebook menu. However, we can also shutdown the kernel either by selecting the notebook in the dashboard and clicking “Shutdown” or by going to “Kernel > Shutdown” from within the notebook app (see images below).

    Shutdown the kernel from Notebook App:

     

    Shutdown the kernel from Dashboard:

     

     

    Sharing Your Notebooks

    When we talk about sharing a notebook, there are two things that might come to our mind. In most cases, we would want to share the end-result of the work, i.e. sharing non-interactive, pre-rendered version of the notebook, very much similar to this article; however, in some cases we might want to share the code and collaborate with others on notebooks with the aid of version control systems such as Git which is also possible.

    Before You Start Sharing

    The state of the shared notebook including the output of any code cells is maintained when exported to a file. Hence, to ensure that the notebook is share-ready, we should follow below steps before sharing.

    1. Click “Cell > All Output > Clear”
    2. Click “Kernel > Restart & Run All”
    3. After the code cells have finished executing, validate the output. 

    This ensures that your notebooks don’t have a stale state or contain intermediary output.

    Exporting Your Notebooks

    Jupyter has built-in support for exporting to HTML, Markdown and PDF as well as several other formats, which you can find from the menu under “File > Download as” . It is a very convenient way to share the results with others. But if sharing exported files isn’t suitable for you, there are some other popular methods of sharing the notebooks directly on the web.

    • GitHub
    • With home to over 2 million notebooks, GitHub is the most popular place for sharing Jupyter projects with the world. GitHub has integrated support for rendering .ipynb files directly both in repositories and gists on its website.
    • You can just follow the GitHub guides for you to get started on your own.
    • Nbviewer
    • NBViewer is one of the most prominent notebook renderers on the web.
    • It also renders your notebook from GitHub and other such code storage platforms and provide a shareable URL along with it. nbviewer.jupyter.org provides a free rendering service as part of Project Jupyter.

    Data Analysis in a Jupyter Notebook

    Now that we’ve looked at what a Jupyter Notebook is, it’s time to look at how they’re used in practice, which should give you a clearer understanding of why they are so popular. As we walk through the sample analysis, you will be able to see how the flow of a notebook makes the task intuitive to work through ourselves, as well as for others to understand when we share it with them. We also hope to learn some of the more advanced features of Jupyter notebooks along the way. So let’s get started, shall we?

    Analyzing the Revenue and Profit Trends of Fortune 500 US companies from 1955-2013

    So, let’s say you’ve been tasked with finding out how the revenues and profits of the largest companies in the US changed historically over the past 60 years. We shall begin by gathering the data to analyze.

    Gathering the DataSet

    The data set that we will be using to analyze the revenue and profit trends of fortune 500 companies has been sourced from Fortune 500 Archives and Top Foreign Stocks. For your ease we have compiled the data from both the sources and created a CSV for you.

    Importing the Required Dependencies

    Let’s start off with a code cell specifically for imports and initial setup, so that if we need to add or change anything at a later point in time, we can simply edit and re-run the cell without having to change the other cells. We can start by importing Pandas to work with our data, Matplotlib to plot the charts and Seaborn to make our charts prettier.

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import sys

    Set the design styles for the charts

    sns.set(style="darkgrid")

    Load the Input Data to be Analyzed

    As we plan on using pandas to aid in our analysis, let’s begin by importing our input data set into the most widely used pandas data-structure, DataFrame.

    df = pd.read_csv('../data/fortune500_1955_2013.csv')

    Now that we are done loading our input dataset, let us see how it looks like!

    df.head()

    Looking good. Each row corresponds to a single company per year and all the columns we need are present.

    Exploring the Dataset

    Next, let’s begin by exploring our data set. We will primarily look into the number of records imported and the data types for each of the different columns that were imported.

    As we have 500 data points per year and since the data set has records between 1955 and 2012, the total number of records in the dataset looks good!

    Now, let’s move on to the individual data types for each of the column.

    df.columns = ['year', 'rank', 'company', 'revenue', 'profit']
    len(df)

    df.dtypes

    As we can see from the output of the above command the data types for the columns revenue and profit are being shown as object whereas the expected data type should be float. It indicates that there may be some non-numeric values in the revenue and profit columns.

    So let’s first look at the details of imported values for revenue.

    non_numeric_revenues = df.revenue.str.contains('[^0-9.-]')
    df.loc[non_numeric_revenues].head()

    print("Number of Non-numeric revenue values: ", len(df.loc[non_numeric_revenues]))

    Number of Non-numeric revenue values:	1

    print("List of distinct Non-numeric revenue values: ", set(df.revenue[non_numeric_revenues]))

    List of distinct Non-numeric revenue values:	{'N.A.'}

    As the number of non-numeric revenue values is considerably less compared to the total size of our data set. Hence, it would be easier to just remove those rows.

    df = df.loc[~non_numeric_revenues]
    df.revenue = df.revenue.apply(pd.to_numeric)
    eval(In[6])

    Now that the data type issue for column revenue is resolved, let’s move on to values in column profit.

    non_numeric_profits = df.profit.str.contains('[^0-9.-]')
    df.loc[non_numeric_profits].head()

    print("Number of Non-numeric profit values: ", len(df.loc[non_numeric_profits]))

    Number of Non-numeric profit values:	374

    print("List of distinct Non-numeric profit values: ", set(df.profit[non_numeric_profits]))

    List of distinct Non-numeric profit values:	{'N.A.'}

    As the number of non-numeric profit values is around 1.5% which is a small percentage of our data set, but not completely inconsequential. Let’s take a quick look at the distribution of values and if the rows having N.A. values are uniformly distributed over the years then it would be wise to just remove the rows with missing values.

    bin_sizes, _, _ = plt.hist(df.year[non_numeric_profits], bins=range(1955, 2013))

    As observed from the histogram above, majority of invalid values in single year is fewer than 25, removing these values would account for less than 4% of the data as there are 500 data points per year. Also, other than a surge around 1990, most years have fewer than less than 10 values missing. Let’s assume that this is acceptable for us and move ahead with removing these rows.

    df = df.loc[~non_numeric_profits]
    df.profit = df.profit.apply(pd.to_numeric)

    We should validate if that worked!

    eval(In[6])

    Hurray! Our dataset has been cleaned up.

    Time to Plot the graphs

    Let’s begin with defining a function to plot the graph, set the title and add lables for the x-axis and y-axis.

    # function to plot the graphs for average revenues or profits of the fortune 500 companies against year
    def plot(x, y, ax, title, y_label):
        ax.set_title(title)
        ax.set_ylabel(y_label)
        ax.plot(x, y)
        ax.margins(x=0, y=0)
        
    # function to plot the graphs with superimposed standard deviation    
    def plot_with_std(x, y, stds, ax, title, y_label):
        ax.fill_between(x, y - stds, y + stds, alpha=0.2)
        plot(x, y, ax, title, y_label)

    Let’s plot the average profit by year and average revenue by year using Matplotlib.

    group_by_year = df.loc[:, ['year', 'revenue', 'profit']].groupby('year')
    avgs = group_by_year.mean()
    x = avgs.index
    y = avgs.profit
    
    fig, ax = plt.subplots()
    plot(x, y, ax, 'Increase in mean Fortune 500 company profits from 1955 to 2013', 'Profit (millions)')

    y2 = avgs.revenue
    fig, ax = plt.subplots()
    plot(x, y2, ax, 'Increase in mean Fortune 500 company revenues from 1955 to 2013', 'Revenue (millions)')

    Woah! The charts for profits has got some huge ups and downs. It seems like they correspond to the early 1990s recession, the dot-com bubble in the early 2000s and the Great Recession in 2008.

    On the other hand, the Revenues are constantly growing and are comparatively stable. Also it does help to understand how the average profits recovered so quickly after the staggering drops because of the recession.

    Let’s also take a look at how the average profits and revenues compare to their standard deviations.

    fig, (ax1, ax2) = plt.subplots(ncols=2)
    title = 'Increase in mean and std Fortune 500 company %s from 1955 to 2013'
    stds1 = group_by_year.std().profit.values
    stds2 = group_by_year.std().revenue.values
    plot_with_std(x, y.values, stds1, ax1, title % 'profits', 'Profit (millions)')
    plot_with_std(x, y2.values, stds2, ax2, title % 'revenues', 'Revenue (millions)')
    fig.set_size_inches(14, 4)
    fig.tight_layout()

     

    That’s astonishing, the standard deviations are huge. Some companies are making billions while some others are losing as much, and the risk certainly has increased along with rising profits and revenues over the years. Although we could keep on playing around with our data set and plot plenty more charts to analyze, it is time to bring this article to a close.

    Conclusion

    As part of this article we have seen various features of the Jupyter notebooks, from basics like installation, creating, and running code cells to more advanced features like plotting graphs. The power of Jupyter Notebooks to promote a productive working experience and provide an ease of use is evident from the above example, and I do hope that you feel confident to begin using Jupyter Notebooks in your own work and start exploring more advanced features. You can read more about data analytics using Pandas here.

    If you’d like to further explore and want to look at more examples, Jupyter has put together A Gallery of Interesting Jupyter Notebooks that you may find helpful and the Nbviewer homepage provides a lot of examples for further references. Find the entire code here on Github.

  • Learn How to Quickly Setup Istio Using GKE and its Applications

    In this blog, we will try to understand Istio and its YAML configurations. You will also learn why Istio is great for managing traffic and how to set it up using Google Kubernetes Engine (GKE). I’ve also shed some light on deploying Istio in various environments and applications like intelligent routing, traffic shifting, injecting delays, and testing the resiliency of your application.

    What is Istio?

    The Istio’s website says it is “An open platform to connect, manage, and secure microservices”.

    As a network of microservices known as ‘Service Mesh’ grows in size and complexity, it can become tougher to understand and manage. Its requirements can include discovery, load balancing, failure recovery, metrics, and monitoring, and often more complex operational requirements such as A/B testing, canary releases, rate limiting, access control, and end-to-end authentication. Istio claims that it provides complete end to end solution to these problems.

    Why Istio?

    • Provides automatic load balancing for various protocols like HTTP, gRPC, WebSocket, and TCP traffic. It means you can cater to the needs of web services and also frameworks like Tensorflow (it uses gRPC).
    • To control the flow of traffic and API calls between services, make calls more reliable, and make the network more robust in the face of adverse conditions.
    • To gain understanding of the dependencies between services and the nature and flow of traffic between them, providing the ability to quickly identify issues etc.

    Let’s explore the architecture of Istio.

    Istio’s service mesh is split logically into two components:

    1. Data plane – set of intelligent proxies (Envoy) deployed as sidecars to the microservice they control communications between microservices.
    2. Control plane – manages and configures proxies to route traffic. It also enforces policies.

    Envoy – Istio uses an extended version of envoy (L7 proxy and communication bus designed for large modern service-oriented architectures) written in C++. It manages inbound and outbound traffic for service mesh.

    Enough of theory, now let us setup Istio to see things in action. A notable point is that Istio is pretty fast. It’s written in Go and adds a very tiny overhead to your system.

    Setup Istio on GKE

    You can either setup Istio via command line or via UI. We have used command line installation for this blog.

    Sample Book Review Application

    Following this link, you can easily

    The Bookinfo application is broken into four separate microservices:

    • productpage. The productpage microservice calls the details and reviews microservices to populate the page.
    • details. The details microservice contains book information.
    • reviews. The reviews microservice contains book reviews. It also calls the ratings microservice.
    • ratings. The ratings microservice contains book ranking information that accompanies a book review.

    There are 3 versions of the reviews microservice:

    • Version v1 doesn’t call the ratings service.
    • Version v2 calls the ratings service and displays each rating as 1 to 5 black stars.
    • Version v3 calls the ratings service and displays each rating as 1 to 5 red stars.

    The end-to-end architecture of the application is shown below.

    If everything goes well, You will have a web app like this (served at http://GATEWAY_URL/productpage)

    Let’s take a case where 50% of traffic is routed to v1 and the remaining 50% to v3.

    This is how the config file looks like (/path/to/istio-0.2.12/samples/bookinfo/kube/route-rule-reviews-50-v3.yaml) 

    apiVersion: config.istio.io/v1alpha2
    kind: RouteRule
    metadata:
      name: reviews-default
    spec:
      destination:
        name: reviews
      precedence: 1
      route:
      - labels:
          version: v1
        weight: 50
      - labels:
          version: v3
        weight: 50

    Let’s try to understand the config file above.

    Istio provides a simple Domain-specific language (DSL) to control how API calls and layer-4 traffic flow across various services in the application deployment.

    In the above configuration, we are trying to Add a “Route Rule”. It means we will be routing the traffic coming to destinations. The destination is the name of the service to which the traffic is being routed. The route labels identify the specific service instances that will receive traffic.

    In this Kubernetes deployment of Istio, the route label “version: v1” and “version: v3” indicates that only pods containing the label “version: v1” and “version: v3” will receive 50% traffic each.

    Now multiple route rules could be applied to the same destination. The order of evaluation of rules corresponding to a given destination, when there is more than one, can be specified by setting the precedence field of the rule.

    The precedence field is an optional integer value, 0 by default. Rules with higher precedence values are evaluated first. If there is more than one rule with the same precedence value the order of evaluation is undefined.

    When is precedence useful? Whenever the routing story for a particular service is purely weight based, it can be specified in a single rule.

    Once a rule is found that applies to the incoming request, it will be executed and the rule-evaluation process will terminate. That’s why it’s very important to carefully consider the priorities of each rule when there is more than one.

    In short, it means route label “version: v1” is given preference over route label “version: v3”.

    Intelligent Routing Using Istio

    We will demonstrate an example in which we will be aiming to get more control over routing the traffic coming to our app. Before reading ahead, make sure that you have installed Istio and book review application.

    First, we will set a default version for all microservices.

    > kubectl create -f samples/bookinfo/kube/route-rule-all-v1.yaml

    Then wait a few seconds for the rules to propagate to all pods before attempting to access the application. This will set the default route to v1 version (which doesn’t call rating service). Now we want a specific user, say Velotio, to see v2 version. We write a yaml (test-velotio.yaml) file.

    apiVersion: config.istio.io/v1alpha2
    kind: RouteRule
    metadata:
      name: test-velotio
      namespace: default
      ...
    spec:
      destination:
        name: reviews
      match:
        request:
          headers:
            cookie:
              regex: ^(.*?;)?(user=velotio)(;.*)?$
      precedence: 2
      route:
      - labels:
          version: v2

    We then set this rule

    > kubectl create -f path/to/test-velotio.yml

    Now if any other user logs in it won’t see any ratings (it will see v1 version) but when “velotio” user logs in it will see v2 version!

    This is how we can intelligently do content-based routing. We used Istio to send 100% of the traffic to the v1 version of each of the Bookinfo services. You then set a rule to selectively send traffic to version v2 of the reviews service based on a header (i.e., a user cookie) in a request.

    Traffic Shifting

    Now Let’s take a case in which we have to shift traffic from an old service to a new service.

    We can use Istio to gradually transfer traffic from one microservice to another one. For example, we can move 10, 20, 25..100% of traffic. Here for simplicity of the blog, we will move traffic from reviews:v1 to reviews:v3 in two steps 40% to 100%.

    First, we set the default version v1.

    > kubectl create -f samples/bookinfo/kube/route-rule-all-v1.yaml

    We write a yaml file route-rule-reviews-40-v3.yaml

    apiVersion: config.istio.io/v1alpha2
    kind: RouteRule
    metadata:
      name: reviews-default
      namespace: default
    spec:
      destination:
        name: reviews
      precedence: 1
      route:
      - labels:
          version: v1
        weight: 60
      - labels:
          version: v3
        weight: 40

    Then we apply a new rule.

    > kubectl create -f path/to/route-rule-reviews-40-v3.yaml

    Now, Refresh the productpage in your browser and you should now see red colored star ratings approximately 40% of the time. Once that is stable, we transfer all the traffic to v3.

    > istioctl replace -f samples/bookinfo/kube/route-rule-reviews-v3.yaml

    Inject Delays and Test the Resiliency of Your Application

    Here we will check fault injection using HTTP delay. To test our Bookinfo application microservices for resiliency, we will inject a 7s delay between the reviews:v2 and ratings microservices, for user “Jason”. Since the reviews:v2 service has a 10s timeout for its calls to the ratings service, we expect the end-to-end flow to continue without any errors.

    > istioctl create -f samples/bookinfo/kube/route-rule-ratings-test-delay.yaml

    Now we check if the rule was applied correctly,

    > istioctl get routerule ratings-test-delay -o yaml

    Now we allow several seconds to account for rule propagation delay to all pods. Log in as user “Jason”. If the application’s front page was set to correctly handle delays, we expect it to load within approximately 7 seconds.

    Conclusion

    In this blog we only explored the routing capabilities of Istio. We found Istio to give us good amount of control over routing, fault injection etc in microservices. Istio has a lot more to offer like load balancing and security. We encourage you guys to toy around with Istio and tell us about your experiences.

    Happy Coding!

  • Creating GraphQL APIs Using Elixir Phoenix and Absinthe

    Introduction

    GraphQL is a new hype in the Field of API technologies. We have been constructing and using REST API’s for quite some time now and started hearing about GraphQL recently. GraphQL is usually described as a frontend-directed API technology as it allows front-end developers to request data in a more simpler way than ever before. The objective of this query language is to formulate client applications formed on an instinctive and adjustable format, for portraying their data prerequisites as well as interactions.

    The Phoenix Framework is running on Elixir, which is built on top of Erlang. Elixir core strength is scaling and concurrency. Phoenix is a powerful and productive web framework that does not compromise speed and maintainability. Phoenix comes in with built-in support for web sockets, enabling you to build real-time apps.

    Prerequisites:

    1. Elixir & Erlang: Phoenix is built on top of these
    2. Phoenix Web Framework: Used for writing the server application. (It’s a well-unknown and lightweight framework in elixir) 
    3. Absinthe: GraphQL library written for Elixir used for writing queries and mutations.
    4. GraphiQL: Browser based GraphQL ide for testing your queries. Consider it similar to what Postman is used for testing REST APIs.

    Overview:

    The application we will be developing is a simple blog application written using Phoenix Framework with two schemas User and Post defined in Accounts and Blog resp. We will design the application to support API’s related to blog creation and management. Assuming you have Erlang, Elixir and mix installed.

    Where to Start:

    At first, we have to create a Phoenix web application using the following command:

    mix phx.new  --no-brunch --no-html

    –no-brunch – do not generate brunch files for static asset building. When choosing this option, you will need to manually handle JavaScript  dependencies if building HTML apps

    • –-no-html – do not generate HTML views.

    Note: As we are going to mostly work with API, we don’t need any web pages, HTML views and so the command args  and

    Dependencies:

    After we create the project, we need to add dependencies in mix.exs to make GraphQL available for the Phoenix application.

    defp deps do
    [
    {:absinthe, "~> 1.3.1"},
    {:absinthe_plug, "~> 1.3.0"},
    {:absinthe_ecto, "~> 0.1.3"}
    ]
    end

    Structuring the Application:

    We can used following components to design/structure our GraphQL application:

    1. GraphQL Schemas : This has to go inside lib/graphql_web/schema/schema.ex. The schema definitions your queries and mutations.
    2. Custom types: Your schema may include some custom properties which should be defined inside lib/graphql_web/schema/types.ex

    Resolvers: We have to write respective Resolver Function’s that handles the business logic and has to be mapped with respective query or mutation. Resolvers should be defined in their own files. We defined it inside lib/graphql/accounts/user_resolver.ex and lib/graphql/blog/post_resolver.ex folder.

    Also, we need to uppdate the router we have to be able to make queries using the GraphQL client in lib/graphql_web/router.ex and also have to create a GraphQL pipeline to route the API request which also goes inside lib/graphql_web/router.ex:

    pipeline :graphql do
    	  plug Graphql.Context  #custom plug written into lib/graphql_web/plug/context.ex folder
    end
    
    scope "/api" do
      pipe_through(:graphql)  #pipeline through which the request have to be routed
    
      forward("/",  Absinthe.Plug, schema: GraphqlWeb.Schema)
      forward("/graphiql", Absinthe.Plug.GraphiQL, schema: GraphqlWeb.Schema)
    end

    Writing GraphQL Queries:

    Lets write some graphql queries which can be considered to be equivalent to GET requests in REST. But before getting into queries lets take a look at GraphQL schema we defined and its equivalent resolver mapping:

    defmodule GraphqlWeb.Schema do
      use Absinthe.Schema
      import_types(GraphqlWeb.Schema.Types)
    
      query do
        field :blog_posts, list_of(:blog_post) do
          resolve(&Graphql.Blog.PostResolver.all/2)
        end
    
        field :blog_post, type: :blog_post do
          arg(:id, non_null(:id))
          resolve(&Graphql.Blog.PostResolver.find/2)
        end
    
        field :accounts_users, list_of(:accounts_user) do
          resolve(&Graphql.Accounts.UserResolver.all/2)
        end
    
        field :accounts_user, :accounts_user do
          arg(:email, non_null(:string))
          resolve(&Graphql.Accounts.UserResolver.find/2)
        end
      end
    end

    You can see above we have defined four queries in the schema. Lets pick a query and see what goes into it :

    field :accounts_user, :accounts_user do
    arg(:email, non_null(:string))
    resolve(&Graphql.Accounts.UserResolver.find/2)
    end

    Above, we have retrieved a particular user using his email address through Graphql query.

    1. arg(:, ): defines an non-null incoming string argument i.e user email for us.
    2. Graphql.Accounts.UserResolver.find/2 : the resolver function that is mapped via schema, which contains the core business logic for retrieving an user.
    3. Accounts_user : the custome defined type which is defined inside lib/graphql_web/schema/types.ex as follows:
    object :accounts_user do
    field(:id, :id)
    field(:name, :string)
    field(:email, :string)
    field(:posts, list_of(:blog_post), resolve: assoc(:blog_posts))
    end

    We need to write a separate resolver function for every query we define. Will go over the resolver function for accounts_user which is present in lib/graphql/accounts/user_resolver.ex file:

    defmodule Graphql.Accounts.UserResolver do
      alias Graphql.Accounts                    #import lib/graphql/accounts/accounts.ex as Accounts
    
      def all(_args, _info) do
        {:ok, Accounts.list_users()}
      end
    
      def find(%{email: email}, _info) do
        case Accounts.get_user_by_email(email) do
          nil -> {:error, "User email #{email} not found!"}
          user -> {:ok, user}
        end
      end
    end

    This function is used to list all users or retrieve a particular user using an email address. Let’s run it now using GraphiQL browser. You need to have the server running on port 4000. To start the Phoenix server use:

    mix deps.get #pulls all the dependencies
    mix deps.compile #compile your code
    mix phx.server #starts the phoenix server

    Let’s retrieve an user using his email address via query:

    Above, we have retrieved the id, email and name fields by executing accountsUser query with an email address. GraphQL also allow us to define variables which we will show later when writing different mutations.

    Let’s execute another query to list all blog posts that we have defined:

     Writing GraphQL Mutations:

    Let’s write some GraphQl mutations. If you have understood the way graphql queries are written mutations are much simpler and similar to queries and easy to understand. It is defined in the same form as queries with a resolver function. Different mutations we are gonna write are as follow:

    1. create_post:- create a new blog post
    2. update_post :- update a existing blog post
    3. delete_post:- delete an existing blog post

    The mutation looks as follows:

    defmodule GraphqlWeb.Schema do
      use Absinthe.Schema
      import_types(GraphqlWeb.Schema.Types)
    
      query do
        mutation do
          field :create_post, type: :blog_post do
            arg(:title, non_null(:string))
            arg(:body, non_null(:string))
            arg(:accounts_user_id, non_null(:id))
    
            resolve(&Graphql.Blog.PostResolver.create/2)
          end
    
          field :update_post, type: :blog_post do
            arg(:id, non_null(:id))
            arg(:post, :update_post_params)
    
            resolve(&Graphql.Blog.PostResolver.update/2)
          end
    
          field :delete_post, type: :blog_post do
            arg(:id, non_null(:id))
            resolve(&Graphql.Blog.PostResolver.delete/2)
          end
        end
    
      end
    end

    Let’s run some mutations to create a post in GraphQL:

    Notice the method is POST and not GET over here.

    Let’s dig into update mutation function :

    field :update_post, type: :blog_post do
    arg(:id, non_null(:id))
    arg(:post, :update_post_params)
    
    resolve(&Graphql.Blog.PostResolver.update/2)
    end

    Here, update post takes two arguments as input ,  non null id and a post parameter of type update_post_params that holds the input parameter values to update. The mutation is defined in lib/graphql_web/schema/schema.ex while the input parameter values are defined in lib/graphql_web/schema/types.ex —

    input_object :update_post_params do
    field(:title, :string)
    field(:body, :string)
    field(:accounts_user_id, :id)
    end

    The difference with previous type definitions is that it’s defined as input_object instead of object.

    The corresponding resolver function is defined as follows :

    def update(%{id: id, post: post_params}, _info) do
    case find(%{id: id}, _info) do
    {:ok, post} -> post |> Blog.update_post(post_params)
    {:error, _} -> {:error, "Post id #{id} not found"}
    end
    end

         

    Here we have defined a query parameter to specify the id of the blog post to be updated.

    Conclusion

    This is all you need, to write a basic GraphQL server for any Phoenix application using Absinthe.  

    References:

    1. https://www.howtographql.com/graphql-elixir/0-introduction/
    2. https://pragprog.com/book/wwgraphql/craft-graphql-apis-in-elixir-with-absinthe
    3. https://itnext.io/graphql-with-elixir-phoenix-and-absinthe-6b0ffd260094
  • Deploy Serverless, Event-driven Python Applications Using Zappa

    Introduction

    Zappa is a  very powerful open source python project which lets you build, deploy and update your WSGI app hosted on AWS Lambda + API Gateway easily.This blog is a detailed step-by-step focusing on challenges faced while deploying Django application on AWS Lambda using Zappa as a deployment tool.

    Building Your Application

    If you do not have a Django application already you can build one by cloning this GitHub repository.

    $ git clone https://github.com/velotiotech/django-zappa-sample.git    

    Cloning into 'django-zappa-sample'...
    remote: Counting objects: 18, done.
    remote: Compressing objects: 100% (13/13), done.
    remote: Total 18 (delta 1), reused 15 (delta 1), pack-reused 0
    Unpacking objects: 100% (18/18), done.
    Checking connectivity... done.

    Once you have cloned the repository you will need a virtual environment which provides an isolated Python environment for your application. I prefer virtualenvwrapper to create one.

    Command :

    $ mkvirtualenv django_zappa_sample 

    Installing setuptools, pip, wheel...done.
    virtualenvwrapper.user_scripts creating /home/velotio/Envs/django_zappa_sample/bin/predeactivate
    virtualenvwrapper.user_scripts creating /home/velotio/Envs/django_zappa_sample/bin/postdeactivate
    virtualenvwrapper.user_scripts creating /home/velotio/Envs/django_zappa_sample/bin/preactivate
    virtualenvwrapper.user_scripts creating /home/velotio/Envs/django_zappa_sample/bin/postactivate
    virtualenvwrapper.user_scripts creating /home/velotio/Envs/django_zappa_sample/bin/get_env_details

    Install dependencies from requirements.txt.

    $ pip install -r requirements.txt

    Collecting Django==1.11.11 (from -r requirements.txt (line 1))
      Downloading https://files.pythonhosted.org/packages/d5/bf/2cd5eb314aa2b89855c01259c94dc48dbd9be6c269370c1f7ae4979e6e2f/Django-1.11.11-py2.py3-none-any.whl (6.9MB)
        100% |████████████████████████████████| 7.0MB 772kB/s 
    Collecting zappa==0.45.1 (from -r requirements.txt (line 2))
    Collecting pytz (from Django==1.11.11->-r requirements.txt (line 1))
      Downloading https://files.pythonhosted.org/packages/dc/83/15f7833b70d3e067ca91467ca245bae0f6fe56ddc7451aa0dc5606b120f2/pytz-2018.4-py2.py3-none-any.whl (510kB)
        100% |████████████████████████████████| 512kB 857kB/s 
    Collecting future==0.16.0 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting toml>=0.9.3 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting docutils>=0.12 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/50/09/c53398e0005b11f7ffb27b7aa720c617aba53be4fb4f4f3f06b9b5c60f28/docutils-0.14-py2-none-any.whl
    Collecting PyYAML==3.12 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting futures==3.1.1 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/a6/1c/72a18c8c7502ee1b38a604a5c5243aa8c2a64f4bba4e6631b1b8972235dd/futures-3.1.1-py2-none-any.whl
    Requirement already satisfied: wheel>=0.30.0 in /home/velotio/Envs/django_zappa_sample/lib/python2.7/site-packages (from zappa==0.45.1->-r requirements.txt (line 2)) (0.31.1)
    Collecting base58==0.2.4 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting durationpy==0.5 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting kappa==0.6.0 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/ed/cf/a8aa5964557c8a4828da23d210f8827f9ff190318838b382a4fb6f118f5d/kappa-0.6.0-py2-none-any.whl
    Collecting Werkzeug==0.12 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/ae/c3/f59f6ade89c811143272161aae8a7898735e7439b9e182d03d141de4804f/Werkzeug-0.12-py2.py3-none-any.whl
    Collecting boto3>=1.4.7 (from zappa==0.45.1->-r requirements.txt (line 2))
      Downloading https://files.pythonhosted.org/packages/cd/a3/4d1caf76d8f5aac8ab1ffb4924ecf0a43df1572f6f9a13465a482f94e61c/boto3-1.7.24-py2.py3-none-any.whl (128kB)
        100% |████████████████████████████████| 133kB 1.1MB/s 
    Collecting six>=1.11.0 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl
    Collecting tqdm==4.19.1 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/c0/d3/7f930cbfcafae3836be39dd3ed9b77e5bb177bdcf587a80b6cd1c7b85e74/tqdm-4.19.1-py2.py3-none-any.whl
    Collecting argcomplete==1.9.2 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/0f/ee/625763d848016115695942dba31a9937679a25622b6f529a2607d51bfbaa/argcomplete-1.9.2-py2.py3-none-any.whl
    Collecting hjson==3.0.1 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting troposphere>=1.9.0 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting python-dateutil==2.6.1 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/4b/0d/7ed381ab4fe80b8ebf34411d14f253e1cf3e56e2820ffa1d8844b23859a2/python_dateutil-2.6.1-py2.py3-none-any.whl
    Collecting botocore>=1.7.19 (from zappa==0.45.1->-r requirements.txt (line 2))
      Downloading https://files.pythonhosted.org/packages/65/98/12aa979ca3215d69111026405a9812d7bb0c9ae49e2800b00d3bd794705b/botocore-1.10.24-py2.py3-none-any.whl (4.2MB)
        100% |████████████████████████████████| 4.2MB 768kB/s 
    Collecting requests>=2.10.0 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/49/df/50aa1999ab9bde74656c2919d9c0c085fd2b3775fd3eca826012bef76d8c/requests-2.18.4-py2.py3-none-any.whl
    Collecting jmespath==0.9.3 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/b7/31/05c8d001f7f87f0f07289a5fc0fc3832e9a57f2dbd4d3b0fee70e0d51365/jmespath-0.9.3-py2.py3-none-any.whl
    Collecting wsgi-request-logger==0.4.6 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting lambda-packages==0.19.0 (from zappa==0.45.1->-r requirements.txt (line 2))
    Collecting python-slugify==1.2.4 (from zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/9f/77/ab7134b731d0e831cf82861c1ab0bb318e80c41155fa9da18958f9d96057/python_slugify-1.2.4-py2.py3-none-any.whl
    Collecting placebo>=0.8.1 (from kappa==0.6.0->zappa==0.45.1->-r requirements.txt (line 2))
    Collecting click>=5.1 (from kappa==0.6.0->zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/34/c1/8806f99713ddb993c5366c362b2f908f18269f8d792aff1abfd700775a77/click-6.7-py2.py3-none-any.whl
    Collecting s3transfer<0.2.0,>=0.1.10 (from boto3>=1.4.7->zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/d7/14/2a0004d487464d120c9fb85313a75cd3d71a7506955be458eebfe19a6b1d/s3transfer-0.1.13-py2.py3-none-any.whl
    Collecting cfn-flip>=0.2.5 (from troposphere>=1.9.0->zappa==0.45.1->-r requirements.txt (line 2))
    Collecting certifi>=2017.4.17 (from requests>=2.10.0->zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/7c/e6/92ad559b7192d846975fc916b65f667c7b8c3a32bea7372340bfe9a15fa5/certifi-2018.4.16-py2.py3-none-any.whl
    Collecting chardet<3.1.0,>=3.0.2 (from requests>=2.10.0->zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
    Collecting idna<2.7,>=2.5 (from requests>=2.10.0->zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/27/cc/6dd9a3869f15c2edfab863b992838277279ce92663d334df9ecf5106f5c6/idna-2.6-py2.py3-none-any.whl
    Collecting urllib3<1.23,>=1.21.1 (from requests>=2.10.0->zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/63/cb/6965947c13a94236f6d4b8223e21beb4d576dc72e8130bd7880f600839b8/urllib3-1.22-py2.py3-none-any.whl
    Collecting Unidecode>=0.04.16 (from python-slugify==1.2.4->zappa==0.45.1->-r requirements.txt (line 2))
      Using cached https://files.pythonhosted.org/packages/59/ef/67085e30e8bbcdd76e2f0a4ad8151c13a2c5bce77c85f8cad6e1f16fb141/Unidecode-1.0.22-py2.py3-none-any.whl
    Installing collected packages: pytz, Django, future, toml, docutils, PyYAML, futures, base58, durationpy, jmespath, six, python-dateutil, botocore, s3transfer, boto3, placebo, click, kappa, Werkzeug, tqdm, argcomplete, hjson, cfn-flip, troposphere, certifi, chardet, idna, urllib3, requests, wsgi-request-logger, lambda-packages, Unidecode, python-slugify, zappa
    Successfully installed Django-1.11.11 PyYAML-3.12 Unidecode-1.0.22 Werkzeug-0.12 argcomplete-1.9.2 base58-0.2.4 boto3-1.7.24 botocore-1.10.24 certifi-2018.4.16 cfn-flip-1.0.3 chardet-3.0.4 click-6.7 docutils-0.14 durationpy-0.5 future-0.16.0 futures-3.1.1 hjson-3.0.1 idna-2.6 jmespath-0.9.3 kappa-0.6.0 lambda-packages-0.19.0 placebo-0.8.1 python-dateutil-2.6.1 python-slugify-1.2.4 pytz-2018.4 requests-2.18.4 s3transfer-0.1.13 six-1.11.0 toml-0.9.4 tqdm-4.19.1 troposphere-2.2.1 urllib3-1.22 wsgi-request-logger-0.4.6 zappa-0.45.1
    @velotiotech

    Now if you run the server directly it will log a warning as the database is not set up yet.

    $ python manage.py runserver  

    Performing system checks...
    
    System check identified no issues (0 silenced).
    
    You have 13 unapplied migration(s). Your project may not work properly until you apply the migrations for app(s): admin, auth, contenttypes, sessions.
    Run 'python manage.py migrate' to apply them.
    
    May 20, 2018 - 14:47:32
    Django version 1.11.11, using settings 'django_zappa_sample.settings'
    Starting development server at http://127.0.0.1:8000/
    Quit the server with CONTROL-C.

    Also trying to access admin page (http://localhost:8000/admin/) will throw an “OperationalError” exception with below log at server end.

    Internal Server Error: /admin/
    Traceback (most recent call last):
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/core/handlers/exception.py", line 41, in inner
        response = get_response(request)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
        response = self.process_exception_by_middleware(e, request)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
        response = wrapped_callback(request, *callback_args, **callback_kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/admin/sites.py", line 242, in wrapper
        return self.admin_view(view, cacheable)(*args, **kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/utils/decorators.py", line 149, in _wrapped_view
        response = view_func(request, *args, **kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/views/decorators/cache.py", line 57, in _wrapped_view_func
        response = view_func(request, *args, **kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/admin/sites.py", line 213, in inner
        if not self.has_permission(request):
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/admin/sites.py", line 187, in has_permission
        return request.user.is_active and request.user.is_staff
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/utils/functional.py", line 238, in inner
        self._setup()
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/utils/functional.py", line 386, in _setup
        self._wrapped = self._setupfunc()
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/auth/middleware.py", line 24, in <lambda>
        request.user = SimpleLazyObject(lambda: get_user(request))
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/auth/middleware.py", line 12, in get_user
        request._cached_user = auth.get_user(request)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/auth/__init__.py", line 211, in get_user
        user_id = _get_user_session_key(request)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/auth/__init__.py", line 61, in _get_user_session_key
        return get_user_model()._meta.pk.to_python(request.session[SESSION_KEY])
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/sessions/backends/base.py", line 57, in __getitem__
        return self._session[key]
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/sessions/backends/base.py", line 207, in _get_session
        self._session_cache = self.load()
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/contrib/sessions/backends/db.py", line 35, in load
        expire_date__gt=timezone.now()
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/manager.py", line 85, in manager_method
        return getattr(self.get_queryset(), name)(*args, **kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/query.py", line 374, in get
        num = len(clone)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/query.py", line 232, in __len__
        self._fetch_all()
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/query.py", line 1118, in _fetch_all
        self._result_cache = list(self._iterable_class(self))
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/query.py", line 53, in __iter__
        results = compiler.execute_sql(chunked_fetch=self.chunked_fetch)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 899, in execute_sql
        raise original_exception
    OperationalError: no such table: django_session
    [20/May/2018 14:59:23] "GET /admin/ HTTP/1.1" 500 153553
    Not Found: /favicon.ico

    In order to fix this you need to run the migration into your database so that essential tables like auth_user, sessions, etc are created before any request is made to the server.

    $ python manage.py migrate 

    Operations to perform:
      Apply all migrations: admin, auth, contenttypes, sessions
    Running migrations:
      Applying contenttypes.0001_initial... OK
      Applying auth.0001_initial... OK
      Applying admin.0001_initial... OK
      Applying admin.0002_logentry_remove_auto_add... OK
      Applying contenttypes.0002_remove_content_type_name... OK
      Applying auth.0002_alter_permission_name_max_length... OK
      Applying auth.0003_alter_user_email_max_length... OK
      Applying auth.0004_alter_user_username_opts... OK
      Applying auth.0005_alter_user_last_login_null... OK
      Applying auth.0006_require_contenttypes_0002... OK
      Applying auth.0007_alter_validators_add_error_messages... OK
      Applying auth.0008_alter_user_username_max_length... OK
      Applying sessions.0001_initial... OK

    NOTE: Use DATABASES from project settings file to configure your database that you would want your Django application to use once hosted on AWS Lambda. By default, its configured to create a local SQLite database file as backend.

    You can run the server again and it should now load the admin panel of your website.

    Do verify if you have the zappa python package into your virtual environment before moving forward.

    Configuring Zappa Settings

    Deploying with Zappa is simple as it only needs a configuration file to run and rest will be managed by Zappa. To create this configuration file run from your project root directory –

    $ zappa init 

    ███████╗ █████╗ ██████╗ ██████╗  █████╗
    ╚══███╔╝██╔══██╗██╔══██╗██╔══██╗██╔══██╗
      ███╔╝ ███████║██████╔╝██████╔╝███████║
     ███╔╝  ██╔══██║██╔═══╝ ██╔═══╝ ██╔══██║
    ███████╗██║  ██║██║     ██║     ██║  ██║
    ╚══════╝╚═╝  ╚═╝╚═╝     ╚═╝     ╚═╝  ╚═╝
    
    Welcome to Zappa!
    
    Zappa is a system for running server-less Python web applications on AWS Lambda and AWS API Gateway.
    This `init` command will help you create and configure your new Zappa deployment.
    Let's get started!
    
    Your Zappa configuration can support multiple production stages, like 'dev', 'staging', and 'production'.
    What do you want to call this environment (default 'dev'): 
    
    AWS Lambda and API Gateway are only available in certain regions. Let's check to make sure you have a profile set up in one that will work.
    We found the following profiles: default, and hdx. Which would you like us to use? (default 'default'): 
    
    Your Zappa deployments will need to be uploaded to a private S3 bucket.
    If you don't have a bucket yet, we'll create one for you too.
    What do you want call your bucket? (default 'zappa-108wqhyn4'): django-zappa-sample-bucket
    
    It looks like this is a Django application!
    What is the module path to your projects's Django settings?
    We discovered: django_zappa_sample.settings
    Where are your project's settings? (default 'django_zappa_sample.settings'): 
    
    You can optionally deploy to all available regions in order to provide fast global service.
    If you are using Zappa for the first time, you probably don't want to do this!
    Would you like to deploy this application globally? (default 'n') [y/n/(p)rimary]: n
    
    Okay, here's your zappa_settings.json:
    
    {
        "dev": {
            "aws_region": "us-east-1", 
            "django_settings": "django_zappa_sample.settings", 
            "profile_name": "default", 
            "project_name": "django-zappa-sa", 
            "runtime": "python2.7", 
            "s3_bucket": "django-zappa-sample-bucket"
        }
    }
    
    Does this look okay? (default 'y') [y/n]: y
    
    Done! Now you can deploy your Zappa application by executing:
    
    	$ zappa deploy dev
    
    After that, you can update your application code with:
    
    	$ zappa update dev
    
    To learn more, check out our project page on GitHub here: https://github.com/Miserlou/Zappa
    and stop by our Slack channel here: https://slack.zappa.io
    
    Enjoy!,
     ~ Team Zappa!

    You can verify zappa_settings.json generated at your project root directory.

    TIP: The virtual environment name should not be the same as the Zappa project name, as this may cause errors.

    Additionally, you could specify other settings in  zappa_settings.json file as per requirement using Advanced Settings.

    Now, you’re ready to deploy!

    IAM Permissions

    In order to deploy the Django Application to Lambda/Gateway, setup an IAM role (eg. ZappaLambdaExecutionRole) with the following permissions:

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Action": [
    "iam:AttachRolePolicy",
    "iam:CreateRole",
    "iam:GetRole",
    "iam:PutRolePolicy"
    ],
    "Resource": [
    "*"
    ]
    },
    {
    "Effect": "Allow",
    "Action": [
    "iam:PassRole"
    ],
    "Resource": [
    "arn:aws:iam:::role/*-ZappaLambdaExecutionRole"
    ]
    },
    {
    "Effect": "Allow",
    "Action": [
    "apigateway:DELETE",
    "apigateway:GET",
    "apigateway:PATCH",
    "apigateway:POST",
    "apigateway:PUT",
    "events:DeleteRule",
    "events:DescribeRule",
    "events:ListRules",
    "events:ListTargetsByRule",
    "events:ListRuleNamesByTarget",
    "events:PutRule",
    "events:PutTargets",
    "events:RemoveTargets",
    "lambda:AddPermission",
    "lambda:CreateFunction",
    "lambda:DeleteFunction",
    "lambda:GetFunction",
    "lambda:GetPolicy",
    "lambda:ListVersionsByFunction",
    "lambda:RemovePermission",
    "lambda:UpdateFunctionCode",
    "lambda:UpdateFunctionConfiguration",
    "cloudformation:CreateStack",
    "cloudformation:DeleteStack",
    "cloudformation:DescribeStackResource",
    "cloudformation:DescribeStacks",
    "cloudformation:ListStackResources",
    "cloudformation:UpdateStack",
    "logs:DescribeLogStreams",
    "logs:FilterLogEvents",
    "route53:ListHostedZones",
    "route53:ChangeResourceRecordSets",
    "route53:GetHostedZone",
    "s3:CreateBucket",
    ],
    "Resource": [
    "*"
    ]
    },
    {
    "Effect": "Allow",
    "Action": [
    "s3:ListBucket"
    ],
    "Resource": [
    "arn:aws:s3:::"
    ]
    },
    {
    "Effect": "Allow",
    "Action": [
    "s3:DeleteObject",
    "s3:GetObject",
    "s3:PutObject",
    "s3:CreateMultipartUpload",
    "s3:AbortMultipartUpload",
    "s3:ListMultipartUploadParts",
    "s3:ListBucketMultipartUploads"
    ],
    "Resource": [
    "arn:aws:s3:::/*"
    ]
    }
    ]
    }

    Deploying Django Application

    Before deploying the application, ensure that the IAM role is set in the config JSON as follows:

    {
    "dev": {
    ...
    "manage_roles": false, // Disable Zappa client managing roles.
    "role_name": "MyLambdaRole", // Name of your Zappa execution role. Optional, default: --ZappaExecutionRole.
    "role_arn": "arn:aws:iam::12345:role/app-ZappaLambdaExecutionRole", // ARN of your Zappa execution role. Optional.
    ...
    },
    ...
    }

    Once your settings are configured, you can package and deploy your application to a stage called “dev” with a single command:

    $ zappa deploy dev

    Calling deploy for stage dev..
    Downloading and installing dependencies..
    Packaging project as zip.
    Uploading django-zappa-sa-dev-1526831069.zip (10.9MiB)..
    100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [01:02<00:00, 75.3KB/s]
    Scheduling..
    Scheduled django-zappa-sa-dev-zappa-keep-warm-handler.keep_warm_callback with expression rate(4 minutes)!
    Uploading django-zappa-sa-dev-template-1526831157.json (1.6KiB)..
    100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.60K/1.60K [00:02<00:00, 792B/s]
    Waiting for stack django-zappa-sa-dev to create (this can take a bit)..
    100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:11<00:00,  2.92s/res]
    Deploying API Gateway..
    Deployment complete!: https://akg59b222b.execute-api.us-east-1.amazonaws.com/dev

    You should see that your Zappa deployment completed successfully with URL to API gateway created for your application.

    Troubleshooting

    1. If you are seeing the following error while deployment, it’s probably because you do not have sufficient privileges to run deployment on AWS Lambda. Ensure your IAM role has all the permissions as described above or set “manage_roles” to true so that Zappa can create and manage the IAM role for you.

    Calling deploy for stage dev..
    Creating django-zappa-sa-dev-ZappaLambdaExecutionRole IAM Role..
    Error: Failed to manage IAM roles!
    You may lack the necessary AWS permissions to automatically manage a Zappa execution role.
    To fix this, see here: https://github.com/Miserlou/Zappa#using-custom-aws-iam-roles-and-policies

    2. The below error will be caused as you have not listed “events.amazonaws.com” as Trusted Entity for your IAM Role. You can add the same or set “keep_warm” parameter to false in your Zappa settings file. Your Zappa deployment was partially deployed as it got terminated abnormally.

    Downloading and installing dependencies..
    100%|████████████████████████████████████████████| 44/44 [00:05<00:00, 7.92pkg/s]
    Packaging project as zip..
    Uploading django-zappa-sample-dev-1482817370.zip (8.8MiB)..
    100%|█████████████████████████████████████████| 9.22M/9.22M [00:17<00:00, 527KB/s]
    Scheduling...
    Oh no! An error occurred! :(
    
    ==============
    
    Traceback (most recent call last):
    Traceback (most recent call last):
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 2610, in handle
        sys.exit(cli.handle())
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 505, in handle
        self.dispatch_command(self.command, stage)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 539, in dispatch_command
        self.deploy(self.vargs['zip'])
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 800, in deploy
        self.zappa.add_binary_support(api_id=api_id, cors=self.cors)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/core.py", line 1490, in add_binary_support
        restApiId=api_id
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 314, in _api_call
        return self._make_api_call(operation_name, kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 612, in _make_api_call
        raise error_class(parsed_response, operation_name)
    ClientError: An error occurred (ValidationError) when calling the PutRole operation: Provided role 'arn:aws:iam:484375727565:role/lambda_basic_execution' cannot be assumed by principal
    'events.amazonaws.com'.
    
    ==============
    
    Need help? Found a bug? Let us know! :D
    File bug reports on GitHub here: https://github.com/Miserlou/Zappa
    And join our Slack channel here: https://slack.zappa.io
    Love!,
    ~ Team Zappa!

    3. Adding the parameter and running zappa update will cause above error. As you can see it says “Stack django-zappa-sa-dev does not exists” as the previous deployment was unsuccessful. To fix this, delete the Lambda function from console and rerun the deployment.

    Downloading and installing dependencies..
    100%|████████████████████████████████████████████| 44/44 [00:05<00:00, 7.92pkg/s]
    Packaging project as zip..
    Uploading django-zappa-sample-dev-1482817370.zip (8.8MiB)..
    100%|█████████████████████████████████████████| 9.22M/9.22M [00:17<00:00, 527KB/s]
    Updating Lambda function code..
    Updating Lambda function configuration..
    Uploading djangoo-zapppa-sample-dev-template-1482817403.json (1.5KiB)..
    100%|████████████████████████████████████████| 1.56K/1.56K [00:00<00:00, 6.56KB/s]
    CloudFormation stack missing, re-deploy to enable updates
    ERROR:Could not get API ID.
    Traceback (most recent call last):
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 2610, in handle
        sys.exit(cli.handle())
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 505, in handle
        self.dispatch_command(self.command, stage)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 539, in dispatch_command
        self.deploy(self.vargs['zip'])
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 800, in deploy
        self.zappa.add_binary_support(api_id=api_id, cors=self.cors)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/core.py", line 1490, in add_binary_support
        restApiId=api_id
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 314, in _api_call
        return self._make_api_call(operation_name, kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 612, in _make_api_call
        raise error_class(parsed_response, operation_name)
    ClientError: An error occurred (ValidationError) when calling the DescribeStackResource operation: Stack 'django-zappa-sa-dev' does not exist
    Deploying API Gateway..
    Oh no! An error occurred! :(
    
    ==============
    
    Traceback (most recent call last):
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 1847, in handle
    sys.exit(cli.handle())
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 345, in handle
    self.dispatch_command(self.command, environment)
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 379, in dispatch_command
    self.update()
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 605, in update
    endpoint_url = self.deploy_api_gateway(api_id)
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 1816, in deploy_api_gateway
    cloudwatch_metrics_enabled=self.zappa_settings[self.api_stage].get('cloudwatch_metrics_enabled', False),
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/zappa.py", line 1014, in deploy_api_gateway
    variables=variables or {}
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 251, in _api_call
    return self._make_api_call(operation_name, kwargs)
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 513, in _make_api_call
    api_params, operation_model, context=request_context)
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 566, in _convert_to_request_dict
    api_params, operation_model)
    File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/validate.py", line 270, in serialize_to_request
    raise ParamValidationError(report=report.generate_report())
    ParamValidationError: Parameter validation failed:
    Invalid type for parameter restApiId, value: None, type: <type 'NoneType'>, valid types: <type 'basestring'>
    
    ==============
    
    Need help? Found a bug? Let us know! :D
    File bug reports on GitHub here: https://github.com/Miserlou/Zappa
    And join our Slack channel here: https://slack.zappa.io
    Love!,
    ~ Team Zappa!

    4.  If you run into any distribution error, please try down-grading your pip version to 9.0.1.

    $ pip install pip==9.0.1   

    Calling deploy for stage dev..
    Downloading and installing dependencies..
    Oh no! An error occurred! :(
    
    ==============
    
    Traceback (most recent call last):
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 2610, in handle
        sys.exit(cli.handle())
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 505, in handle
        self.dispatch_command(self.command, stage)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 539, in dispatch_command
        self.deploy(self.vargs['zip'])
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 709, in deploy
        self.create_package()
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 2171, in create_package
        disable_progress=self.disable_progress
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/core.py", line 595, in create_lambda_zip
        installed_packages = self.get_installed_packages(site_packages, site_packages_64)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/core.py", line 751, in get_installed_packages
        pip.get_installed_distributions()
    AttributeError: 'module' object has no attribute 'get_installed_distributions'
    
    ==============
    
    Need help? Found a bug? Let us know! :D
    File bug reports on GitHub here: https://github.com/Miserlou/Zappa
    And join our Slack channel here: https://slack.zappa.io
    Love!,
     ~ Team Zappa!

    or,

    If you run into NotFoundException(Invalid REST API Identifier issue) please try undeploying the Zappa stage and retry again.

    Calling deploy for stage dev..
    Downloading and installing dependencies..
    Packaging project as zip.
    Uploading django-zappa-sa-dev-1526830532.zip (10.9MiB)..
    100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:42<00:00, 331KB/s]
    Scheduling..
    Scheduled django-zappa-sa-dev-zappa-keep-warm-handler.keep_warm_callback with expression rate(4 minutes)!
    Uploading django-zappa-sa-dev-template-1526830690.json (1.6KiB)..
    100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.60K/1.60K [00:01<00:00, 801B/s]
    Oh no! An error occurred! :(
    
    ==============
    
    Traceback (most recent call last):
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 2610, in handle
        sys.exit(cli.handle())
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 505, in handle
        self.dispatch_command(self.command, stage)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 539, in dispatch_command
        self.deploy(self.vargs['zip'])
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/cli.py", line 800, in deploy
        self.zappa.add_binary_support(api_id=api_id, cors=self.cors)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/zappa/core.py", line 1490, in add_binary_support
        restApiId=api_id
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 314, in _api_call
        return self._make_api_call(operation_name, kwargs)
      File "/home/velotio/Envs/django_zappa_sample/local/lib/python2.7/site-packages/botocore/client.py", line 612, in _make_api_call
        raise error_class(parsed_response, operation_name)
    NotFoundException: An error occurred (NotFoundException) when calling the GetRestApi operation: Invalid REST API identifier specified 484375727565:akg59b222b
    
    ==============
    
    Need help? Found a bug? Let us know! :D
    File bug reports on GitHub here: https://github.com/Miserlou/Zappa
    And join our Slack channel here: https://slack.zappa.io
    Love!,
     ~ Team Zappa!

    TIP: To understand how your application works on serverless environment please visit this link.

    Post Deployment Setup

    Migrate database

    At this point, you should have an empty database for your Django application to fill up with a schema.

    $ zappa manage.py migrate dev

    Once you run above command the database migrations will be applied on the database as specified in your Django settings.

    Creating Superuser of Django Application

    You also might need to create a new superuser on the database. You could use the following command on your project directory.

    $ zappa invoke --raw dev "from django.contrib.auth.models import User; User.objects.create_superuser('username', 'username@yourdomain.com', 'password')"

    Alternatively,

    $ python manage createsuperuser

    Note that your application must be connected to the same database as this is run as standard Django administration command (not a Zappa command).

    Managing static files

    Your Django application will be having a dependency on static files, Django admin panel uses a combination of JS, CSS and image files.

    NOTE: Zappa is for running your application code, not for serving static web assets. If you plan on serving custom static assets in your web application (CSS/JavaScript/images/etc.), you’ll likely want to use a combination of AWS S3 and AWS CloudFront.

    You will need to add following packages to your virtual environment required for management of files to and from S3 django-storages and boto.

    $ pip install django-storages boto
    Add Django-Storage to your INSTALLED_APPS in settings.py
    INSTALLED_APPS = (
    ...,
    storages',
    )
    
    Configure Django-storage in settings.py as
    
    AWS_STORAGE_BUCKET_NAME = 'django-zappa-sample-bucket'
    AWS_S3_CUSTOM_DOMAIN = '%s.s3.amazonaws.com' % AWS_STORAGE_BUCKET_NAME
    STATIC_URL = "https://%s/" % AWS_S3_CUSTOM_DOMAIN
    STATICFILES_STORAGE = 'storages.backends.s3boto.S3BotoStorage'

    Once you have setup the Django application to serve your static files from AWS S3, run following command to upload the static file from your project to S3.

    $ python manage.py collectstatic --noinput

    or

    $ zappa update dev
    $ zappa manage dev "collectstatic --noinput"

    Check that at least 61 static files are moved to S3 bucket. Admin panel is built over  61 static files.

    NOTE: STATICFILES_DIR must be configured properly to collect your files from the appropriate location.

    Tip: You need to render static files in your templates by loading static path and using the same.  Example, {% static %}

    Setting Up API Gateway

    To connect to your Django application you also need to ensure you have API gateway setup for your AWS Lambda Function.  You need to have GET methods set up for all the URL resources used in your Django application. Alternatively, you can setup a proxy method to allow all subresources to be processed through one API method.

    Go to AWS Lambda function console and add API Gateway from ‘Add triggers’.

    1. Configure API, Deployment Stage, and Security for API Gateway. Click Save once it is done.

    2. Go to API Gateway console and,

    a. Recreate ANY method for / resource.

    i. Check `Use Lambda Proxy integration`

    ii. Set `Lambda Region` and `Lambda Function` and `Save` it.

    a. Recreate ANY method for /{proxy+} resource.

    i. Select `Lambda Function Proxy`

    ii. Set`Lambda Region` and `Lambda Function` and `Save` it.

    3. Click on Action and select Deploy API. Set Deployment Stage and click Deploy

    4. Ensure that GET and POST method for / and Proxy are set as Override for this method

    Setting Up Custom SSL Endpoint

    Optionally, you could also set up your own custom defined SSL endpoint with Zappa and install your certificate with your domain by running certify with Zappa. 

    $ zappa certify dev
    
    ...
    "certificate_arn": "arn:aws:acm:us-east-1:xxxxxxxxxxxx:certificate/xxxxxxxxxxxx-xxxxxx-xxxx-xxxx-xxxxxxxxxxxxxx",
    "domain": "django-zappa-sample.com"

    Now you are ready to launch your Django Application hosted on AWS Lambda.

    Additional Notes:

    •  Once deployed, you must run “zappa update <stage-name>” for updating your already hosted AWS Lambda function.</stage-name>
    • You can check server logs for investigation by running “zappa tail” command.
    • To un-deploy your application, simply run: `zappa undeploy <stage-name>`</stage-name>

    You’ve seen how to deploy Django application on AWS Lambda using Zappa. If you are creating your Django application for first time you might also want to read Edgar Roman’s Django Zappa Guide.

    Start building your Django application and let us know in the comments if you need any help during your application deployment over AWS Lambda.

  • Exploring OpenAI Gym: A Platform for Reinforcement Learning Algorithms

    Introduction 

    According to the OpenAI Gym GitHub repository “OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. This is the gym open-source library, which gives you access to a standardized set of environments.”

    Open AI Gym has an environment-agent arrangement. It simply means Gym gives you access to an “agent” which can perform specific actions in an “environment”. In return, it gets the observation and reward as a consequence of performing a particular action in the environment.

    There are four values that are returned by the environment for every “step” taken by the agent.

    1. Observation (object): an environment-specific object representing your observation of the environment. For example, board state in a board game etc
    2. Reward (float): the amount of reward/score achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward/score.
    3. Done (boolean): whether it’s time to reset the environment again. E.g you lost your last life in the game.
    4. Info (dict): diagnostic information useful for debugging. However, official evaluations of your agent are not allowed to use this for learning.

    Following are the available Environments in the Gym:

    1. Classic control and toy text
    2. Algorithmic
    3. Atari
    4. 2D and 3D robots

    Here you can find a full list of environments.

    Cart-Pole Problem

    Here we will try to write a solve a classic control problem from Reinforcement Learning literature, “The Cart-pole Problem”.

    The Cart-pole problem is defined as follows:
    “A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.”

    The following code will quickly allow you see how the problem looks like on your computer.

    import gym
    env = gym.make('CartPole-v0')
    env.reset()
    for _ in range(1000):
        env.render()
        env.step(env.action_space.sample())

    This is what the output will look like:

    Coding the neural network 

    #We first import the necessary libraries and define hyperparameters - 
    
    import gym
    import random
    import numpy as np
    import tflearn
    from tflearn.layers.core import input_data, dropout, fully_connected
    from tflearn.layers.estimator import regression
    from statistics import median, mean
    from collections import Counter
    
    LR = 2.33e-4
    env = gym.make("CartPole-v0")
    observation = env.reset()
    goal_steps = 500
    score_requirement = 50
    initial_games = 10000
    
    #Now we will define a function to generate training data - 
    
    def initial_population():
        # [OBS, MOVES]
        training_data = []
        # all scores:
        scores = []
        # scores above our threshold:
        accepted_scores = []
        # number of episodes
        for _ in range(initial_games):
            score = 0
            # moves specifically from this episode:
            episode_memory = []
            # previous observation that we saw
            prev_observation = []
            for _ in range(goal_steps):
                # choose random action left or right i.e (0 or 1)
                action = random.randrange(0,2)
                observation, reward, done, info = env.step(action)
                # since that the observation is returned FROM the action
                # we store previous observation and corresponding action
                if len(prev_observation) > 0 :
                    episode_memory.append([prev_observation, action])
                prev_observation = observation
                score+=reward
                if done: break
    
            # reinforcement methodology here.
            # IF our score is higher than our threshold, we save
            # all we're doing is reinforcing the score, we're not trying
            # to influence the machine in any way as to HOW that score is
            # reached.
            if score >= score_requirement:
                accepted_scores.append(score)
                for data in episode_memory:
                    # convert to one-hot (this is the output layer for our neural network)
                    if data[1] == 1:
                        output = [0,1]
                    elif data[1] == 0:
                        output = [1,0]
    
                    # saving our training data
                    training_data.append([data[0], output])
    
            # reset env to play again
            env.reset()
            # save overall scores
            scores.append(score)
    
    # Now using tflearn we will define our neural network 
    
    def neural_network_model(input_size):
    
        network = input_data(shape=[None, input_size, 1], name='input')
    
        network = fully_connected(network, 128, activation='relu')
        network = dropout(network, 0.8)
    
        network = fully_connected(network, 256, activation='relu')
        network = dropout(network, 0.8)
    
        network = fully_connected(network, 512, activation='relu')
        network = dropout(network, 0.8)
    
        network = fully_connected(network, 256, activation='relu')
        network = dropout(network, 0.8)
    
        network = fully_connected(network, 128, activation='relu')
        network = dropout(network, 0.8)
    
        network = fully_connected(network, 2, activation='softmax')
        network = regression(network, optimizer='adam', learning_rate=LR, loss='categorical_crossentropy', name='targets')
        model = tflearn.DNN(network, tensorboard_dir='log')
    
        return model
    
    #It is time to train the model now -
    
    def train_model(training_data, model=False):
    
        X = np.array([i[0] for i in training_data]).reshape(-1,len(training_data[0][0]),1)
        y = [i[1] for i in training_data]
    
        if not model:
            model = neural_network_model(input_size = len(X[0]))
    
        model.fit({'input': X}, {'targets': y}, n_epoch=5, snapshot_step=500, show_metric=True, run_id='openai_CartPole')
        return model
    
    training_data = initial_population()
    
    model = train_model(training_data)
    
    #Training complete, now we should play the game to see how the output looks like 
    
    scores = []
    choices = []
    for each_game in range(10):
        score = 0
        game_memory = []
        prev_obs = []
        env.reset()
        for _ in range(goal_steps):
            env.render()
    
            if len(prev_obs)==0:
                action = random.randrange(0,2)
            else:
                action = np.argmax(model.predict(prev_obs.reshape(-1,len(prev_obs),1))[0])
    
            choices.append(action)
    
            new_observation, reward, done, info = env.step(action)
            prev_obs = new_observation
            game_memory.append([new_observation, action])
            score+=reward
            if done: break
    
        scores.append(score)
    
    print('Average Score:',sum(scores)/len(scores))
    print('choice 1:{}  choice 0:{}'.format(float((choices.count(1))/float(len(choices)))*100,float((choices.count(0))/float(len(choices)))*100))
    print(score_requirement)

    This is what the result will look like:

    Conclusion

    Though we haven’t used the Reinforcement Learning model in this blog, the normal fully connected neural network gave us a satisfactory accuracy of 60%. We used tflearn, which is a higher level API on top of Tensorflow for speeding-up experimentation. We hope that this blog will give you a head start in using OpenAI Gym.

    We are waiting to see exciting implementations using Gym and Reinforcement Learning. Happy Coding!

  • Extending Kubernetes APIs with Custom Resource Definitions (CRDs)

    Introduction:

    Custom resources definition (CRD) is a powerful feature introduced in Kubernetes 1.7 which enables users to add their own/custom objects to the Kubernetes cluster and use it like any other native Kubernetes objects. In this blog post, we will see how we can add a custom resource to a Kubernetes cluster using the command line as well as using the Golang client library thus also learning how to programmatically interact with a Kubernetes cluster.

    What is a Custom Resource Definition (CRD)?

    In the Kubernetes API, every resource is an endpoint to store API objects of certain kind. For example, the built-in service resource contains a collection of service objects. The standard Kubernetes distribution ships with many inbuilt API objects/resources. CRD comes into picture when we want to introduce our own object into the Kubernetes cluster to full fill our requirements. Once we create a CRD in Kubernetes we can use it like any other native Kubernetes object thus leveraging all the features of Kubernetes like its CLI, security, API services, RBAC etc.

    The custom resource created is also stored in the etcd cluster with proper replication and lifecycle management. CRD allows us to use all the functionalities provided by a Kubernetes cluster for our custom objects and saves us the overhead of implementing them on our own.

    How to register a CRD using command line interface (CLI)

    Step-1: Create a CRD definition file sslconfig-crd.yaml

    apiVersion: "apiextensions.k8s.io/v1beta1"
    kind: "CustomResourceDefinition"
    metadata:
      name: "sslconfigs.blog.velotio.com"
    spec:
      group: "blog.velotio.com"
      version: "v1alpha1"
      scope: "Namespaced"
      names:
        plural: "sslconfigs"
        singular: "sslconfig"
        kind: "SslConfig"
      validation:
        openAPIV3Schema:
          required: ["spec"]
          properties:
            spec:
              required: ["cert","key","domain"]
              properties:
                cert:
                  type: "string"
                  minimum: 1
                key:
                  type: "string"
                  minimum: 1
                domain:
                  type: "string"
                  minimum: 1 

    Here we are creating a custom resource definition for an object of kind SslConfig. This object allows us to store the SSL configuration information for a domain. As we can see under the validation section specifying the cert, key and the domain are mandatory for creating objects of this kind, along with this we can store other information like the provider of the certificate etc. The name metadata that we specify must be spec.names.plural+”.”+spec.group.

    An API group (blog.velotio.com here) is a collection of API objects which are logically related to each other. We have also specified version for our custom objects (spec.version), as the definition of the object is expected to change/evolve in future so it’s better to start with alpha so that the users of the object knows that the definition might change later. In the scope, we have specified Namespaced, by default a custom resource name is clustered scoped. 

    # kubectl create -f crd.yaml
    # kubectl get crd NAME AGE sslconfigs.blog.velotio.com 5s

    Step-2:  Create objects using the definition we created above

    apiVersion: "blog.velotio.com/v1alpha1"
    kind: "SslConfig"
    metadata:
      name: "sslconfig-velotio.com"
    spec:
      cert: "my cert file"
      key : "my private  key"
      domain: "*.velotio.com"
      provider: "digicert"

    # kubectl create -f crd-obj.yaml
    # kubectl get sslconfig NAME AGE sslconfig-velotio.com 12s

    Along with the mandatory fields cert, key and domain, we have also stored the information of the provider ( certifying authority ) of the cert.

    How to register a CRD programmatically using client-go

    Client-go project provides us with packages using which we can easily create go client and access the Kubernetes cluster.  For creating a client first we need to create a connection with the API server.
    How we connect to the API server depends on whether we will be accessing it from within the cluster (our code running in the Kubernetes cluster itself) or if our code is running outside the cluster (locally)

    If the code is running outside the cluster then we need to provide either the path of the config file or URL of the Kubernetes proxy server running on the cluster.

    kubeconfig := filepath.Join(
    os.Getenv("HOME"), ".kube", "config",
    )
    config, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
    if err != nil {
    log.Fatal(err)
    }

    OR

    var (
    // Set during build
    version string
    
    proxyURL = flag.String("proxy", "",
    `If specified, it is assumed that a kubctl proxy server is running on the
    given url and creates a proxy client. In case it is not given InCluster
    kubernetes setup will be used`)
    )
    if *proxyURL != "" {
    config, err = clientcmd.NewNonInteractiveDeferredLoadingClientConfig(
    &clientcmd.ClientConfigLoadingRules{},
    &clientcmd.ConfigOverrides{
    ClusterInfo: clientcmdapi.Cluster{
    Server: *proxyURL,
    },
    }).ClientConfig()
    if err != nil {
    glog.Fatalf("error creating client configuration: %v", err)
    }

    When the code is to be run as a part of the cluster then we can simply use

    import "k8s.io/client-go/rest"  ...  rest.InClusterConfig() 

    Once the connection is established we can use it to create clientset. For accessing Kubernetes objects, generally the clientset from the client-go project is used, but for CRD related operations we need to use the clientset from apiextensions-apiserver project

    apiextension “k8s.io/apiextensions-apiserver/pkg/client/clientset/clientset”

    kubeClient, err := apiextension.NewForConfig(config)
    if err != nil {
    glog.Fatalf("Failed to create client: %v.", err)
    }

    Now we can use the client to make the API call which will create the CRD for us.

    package v1alpha1
    
    import (
    	"reflect"
    
    	apiextensionv1beta1 "k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1beta1"
    	apiextension "k8s.io/apiextensions-apiserver/pkg/client/clientset/clientset"
    	apierrors "k8s.io/apimachinery/pkg/api/errors"
    	meta_v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    )
    
    const (
    	CRDPlural   string = "sslconfigs"
    	CRDGroup    string = "blog.velotio.com"
    	CRDVersion  string = "v1alpha1"
    	FullCRDName string = CRDPlural + "." + CRDGroup
    )
    
    func CreateCRD(clientset apiextension.Interface) error {
    	crd := &apiextensionv1beta1.CustomResourceDefinition{
    		ObjectMeta: meta_v1.ObjectMeta{Name: FullCRDName},
    		Spec: apiextensionv1beta1.CustomResourceDefinitionSpec{
    			Group:   CRDGroup,
    			Version: CRDVersion,
    			Scope:   apiextensionv1beta1.NamespaceScoped,
    			Names: apiextensionv1beta1.CustomResourceDefinitionNames{
    				Plural: CRDPlural,
    				Kind:   reflect.TypeOf(SslConfig{}).Name(),
    			},
    		},
    	}
    
    	_, err := clientset.ApiextensionsV1beta1().CustomResourceDefinitions().Create(crd)
    	if err != nil && apierrors.IsAlreadyExists(err) {
    		return nil
    	}
    	return err
    }

    In the create CRD function, we first create the definition of our custom object and then pass it to the create method which creates it in our cluster. Just like we did while creating our definition using CLI, here also we set the parameters like version, group, kind etc.

    Once our definition is ready we can create objects of its type just like we did earlier using the CLI. First we need to define our object.

    package v1alpha1
    
    import meta_v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    
    type SslConfig struct {
    	meta_v1.TypeMeta   `json:",inline"`
    	meta_v1.ObjectMeta `json:"metadata"`
    	Spec               SslConfigSpec   `json:"spec"`
    	Status             SslConfigStatus `json:"status,omitempty"`
    }
    type SslConfigSpec struct {
    	Cert   string `json:"cert"`
    	Key    string `json:"key"`
    	Domain string `json:"domain"`
    }
    
    type SslConfigStatus struct {
    	State   string `json:"state,omitempty"`
    	Message string `json:"message,omitempty"`
    }
    
    type SslConfigList struct {
    	meta_v1.TypeMeta `json:",inline"`
    	meta_v1.ListMeta `json:"metadata"`
    	Items            []SslConfig `json:"items"`
    }

    Kubernetes API conventions suggests that each object must have two nested object fields that govern the object’s configuration: the object spec and the object status. Objects must also have metadata associated with them. The custom objects that we define here comply with these standards. It is also recommended to create a list type for every type thus we have also created a SslConfigList struct.

    Now we need to write a function which will create a custom client which is aware of the new resource that we have created.

    package v1alpha1
    
    import (
    	meta_v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    	"k8s.io/apimachinery/pkg/runtime"
    	"k8s.io/apimachinery/pkg/runtime/schema"
    	"k8s.io/apimachinery/pkg/runtime/serializer"
    	"k8s.io/client-go/rest"
    )
    
    var SchemeGroupVersion = schema.GroupVersion{Group: CRDGroup, Version: CRDVersion}
    
    func addKnownTypes(scheme *runtime.Scheme) error {
    	scheme.AddKnownTypes(SchemeGroupVersion,
    		&SslConfig{},
    		&SslConfigList{},
    	)
    	meta_v1.AddToGroupVersion(scheme, SchemeGroupVersion)
    	return nil
    }
    
    func NewClient(cfg *rest.Config) (*SslConfigV1Alpha1Client, error) {
    	scheme := runtime.NewScheme()
    	SchemeBuilder := runtime.NewSchemeBuilder(addKnownTypes)
    	if err := SchemeBuilder.AddToScheme(scheme); err != nil {
    		return nil, err
    	}
    	config := *cfg
    	config.GroupVersion = &SchemeGroupVersion
    	config.APIPath = "/apis"
    	config.ContentType = runtime.ContentTypeJSON
    	config.NegotiatedSerializer = serializer.DirectCodecFactory{CodecFactory: serializer.NewCodecFactory(scheme)}
    	client, err := rest.RESTClientFor(&config)
    	if err != nil {
    		return nil, err
    	}
    	return &SslConfigV1Alpha1Client{restClient: client}, nil
    }

    Building the custom client library

    Once we have registered our custom resource definition with the Kubernetes cluster we can create objects of its type using the Kubernetes cli as we did earlier but for creating controllers for these objects or for developing some custom functionalities around them we need to build a client library also using which we can access them from go API. For native Kubernetes objects, this type of library is provided for each object.

    package v1alpha1
    
    import (
    	meta_v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    	"k8s.io/client-go/rest"
    )
    
    func (c *SslConfigV1Alpha1Client) SslConfigs(namespace string) SslConfigInterface {
    	return &sslConfigclient{
    		client: c.restClient,
    		ns:     namespace,
    	}
    }
    
    type SslConfigV1Alpha1Client struct {
    	restClient rest.Interface
    }
    
    type SslConfigInterface interface {
    	Create(obj *SslConfig) (*SslConfig, error)
    	Update(obj *SslConfig) (*SslConfig, error)
    	Delete(name string, options *meta_v1.DeleteOptions) error
    	Get(name string) (*SslConfig, error)
    }
    
    type sslConfigclient struct {
    	client rest.Interface
    	ns     string
    }
    
    func (c *sslConfigclient) Create(obj *SslConfig) (*SslConfig, error) {
    	result := &SslConfig{}
    	err := c.client.Post().
    		Namespace(c.ns).Resource("sslconfigs").
    		Body(obj).Do().Into(result)
    	return result, err
    }
    
    func (c *sslConfigclient) Update(obj *SslConfig) (*SslConfig, error) {
    	result := &SslConfig{}
    	err := c.client.Put().
    		Namespace(c.ns).Resource("sslconfigs").
    		Body(obj).Do().Into(result)
    	return result, err
    }
    
    func (c *sslConfigclient) Delete(name string, options *meta_v1.DeleteOptions) error {
    	return c.client.Delete().
    		Namespace(c.ns).Resource("sslconfigs").
    		Name(name).Body(options).Do().
    		Error()
    }
    
    func (c *sslConfigclient) Get(name string) (*SslConfig, error) {
    	result := &SslConfig{}
    	err := c.client.Get().
    		Namespace(c.ns).Resource("sslconfigs").
    		Name(name).Do().Into(result)
    	return result, err
    }

    We can add more methods like watch, update status etc. Their implementation will also be similar to the methods we have defined above. For looking at the methods available for various Kubernetes objects like pod, node etc. we can refer to the v1 package.

    Putting all things together

    Now in our main function we will get all the things together.

    package main
    
    import (
    	"flag"
    	"fmt"
    	"time"
    
    	"blog.velotio.com/crd-blog/v1alpha1"
    	"github.com/golang/glog"
    	apiextension "k8s.io/apiextensions-apiserver/pkg/client/clientset/clientset"
    	meta_v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    	"k8s.io/client-go/rest"
    	"k8s.io/client-go/tools/clientcmd"
    	clientcmdapi "k8s.io/client-go/tools/clientcmd/api"
    )
    
    var (
    	// Set during build
    	version string
    
    	proxyURL = flag.String("proxy", "",
    		`If specified, it is assumed that a kubctl proxy server is running on the
    		given url and creates a proxy client. In case it is not given InCluster
    		kubernetes setup will be used`)
    )
    
    func main() {
    
    	flag.Parse()
    	var err error
    
    	var config *rest.Config
    	if *proxyURL != "" {
    		config, err = clientcmd.NewNonInteractiveDeferredLoadingClientConfig(
    			&clientcmd.ClientConfigLoadingRules{},
    			&clientcmd.ConfigOverrides{
    				ClusterInfo: clientcmdapi.Cluster{
    					Server: *proxyURL,
    				},
    			}).ClientConfig()
    		if err != nil {
    			glog.Fatalf("error creating client configuration: %v", err)
    		}
    	} else {
    		if config, err = rest.InClusterConfig(); err != nil {
    			glog.Fatalf("error creating client configuration: %v", err)
    		}
    	}
    
    	kubeClient, err := apiextension.NewForConfig(config)
    	if err != nil {
    		glog.Fatalf("Failed to create client: %v", err)
    	}
    	// Create the CRD
    	err = v1alpha1.CreateCRD(kubeClient)
    	if err != nil {
    		glog.Fatalf("Failed to create crd: %v", err)
    	}
    
    	// Wait for the CRD to be created before we use it.
    	time.Sleep(5 * time.Second)
    
    	// Create a new clientset which include our CRD schema
    	crdclient, err := v1alpha1.NewClient(config)
    	if err != nil {
    		panic(err)
    	}
    
    	// Create a new SslConfig object
    
    	SslConfig := &v1alpha1.SslConfig{
    		ObjectMeta: meta_v1.ObjectMeta{
    			Name:   "sslconfigobj",
    			Labels: map[string]string{"mylabel": "test"},
    		},
    		Spec: v1alpha1.SslConfigSpec{
    			Cert:   "my-cert",
    			Key:    "my-key",
    			Domain: "*.velotio.com",
    		},
    		Status: v1alpha1.SslConfigStatus{
    			State:   "created",
    			Message: "Created, not processed yet",
    		},
    	}
    	// Create the SslConfig object we create above in the k8s cluster
    	resp, err := crdclient.SslConfigs("default").Create(SslConfig)
    	if err != nil {
    		fmt.Printf("error while creating object: %vn", err)
    	} else {
    		fmt.Printf("object created: %vn", resp)
    	}
    
    	obj, err := crdclient.SslConfigs("default").Get(SslConfig.ObjectMeta.Name)
    	if err != nil {
    		glog.Infof("error while getting the object %vn", err)
    	}
    	fmt.Printf("SslConfig Objects Found: n%vn", obj)
    	select {}
    }

    Now if we run our code then our custom resource definition will get created in the Kubernetes cluster and also an object of its type will be there just like with the cli. The docker image akash125/crdblog is build using the code discussed above it can be directly pulled from docker hub and run in a Kubernetes cluster. After the image is run successfully, the CRD definition that we discussed above will get created in the cluster along with an object of its type. We can verify the same using the CLI the way we did earlier, we can also check the logs of the pod running the docker image to verify it. The complete code is available here.

    Conclusion and future work

    We learned how to create a custom resource definition and objects using Kubernetes command line interface as well as the Golang client. We also learned how to programmatically access a Kubernetes cluster, using which we can build some really cool stuff on Kubernetes, we can now also create custom controllers for our resources which continuously watches the cluster for various life cycle events of our object and takes desired action accordingly. To read more about CRD refer the following links: