Tag: AWS

  • Serverless Computing: Predictions for 2017

    Serverless is the emerging trend in software architecture. 2016 was a very exciting year for serverless and adoption will continue to explode in 2017. This post covers the interesting developments in serverless space in 2016 and my thoughts on how this space will evolve in 2017.

    What is serverless?

    In my opinion, Serverless means two things:
    1. Serverless was initially used to describe fully hosted services or Backend-as-a-Service (BaaS) offerings where you fully depend on a 3rd party to manage the server-side logic and state. Examples include AWS S3(storage), Auth0(authentication), AWS Dynamo (database), Firebase, etc. 
    2. The popular interpretation of Serverless is Functions-as-a-Service(FaaS) where developers can upload code that is run within stateless compute containers that are triggered by a variety of events, are ephemeral and fully managed by the cloud platform. FaaS obviates the need to provision, manage, scale or manage availability of your own servers. The most popular FaaS offering is AWS Lambda but Microsoft Azure Functions , Auth0 Webtask, Google Cloud Functions are also coming up with fast maturing offerings. IBM OpenWhisk, Iron.io allow serverless environments to be setup on-premise as well.

    This post will focus more on FaaS and especially AWS Lambda since it is the leader in this space with the largest feature set.

    Serverless vs PaaS

    People are comparing serverless and Platform-as-a-Service(PaaS) which is an invalid comparison in my opinion. PaaS platforms provide you the ability to deploy and manage your entire application or micro-service. PaaS platforms will help you to scale your application infrastructure. Anyone who has used a PaaS knows that while it reduces administration, it does not do away with it. PaaS does not really require you to re-evaluate your code design. In PaaS, your application needs to be always on, though it can be scaled up or down.

    Serverless on the other hand is about “breaking up your application into discrete functions to obviate the need for any complex infrastructure or it’s management”. Serverless has several restrictions with respect to execution duration limits and state. Serverless requires developers to architect differently thinking about discrete functionality of each component of your application.

    New features from AWS for Serverless

    1. Lambda@Edge: This feature allows Lambda code to be executed on global AWS Edge locations. Imagine you deploy your code to one region of AWS and then are able to run it in any of the 10s and soon 100s of AWS Edge locations around the globe. Imagine the reduction in network latency for your end users.
    2. Step Functions: Companies that adopt serverless  soon end up with 10s of 100s of functions. The logic and workflow between these functions is hard to track and manage. Step Functions allow developers to create visual workflows to organize the various components, micro-services, events and conditions. This is nothing but a Rapid Application Development(RAD) product. I expect AWS to build a fully functional enterprise RAD based on this feature.
    3. API Gateway Monetisation: This is big deal in my opinion. There are various trends like a) startups are increasingly API-first b) all enterprises and startups build 10s or 100s of integrations with APIs c) fine-grained usage based billing based on API usage d) adoption of micro-services architecture which uses API contracts. This feature allows companies to start monetising their APIs via the AWS Marketplace. And the APIs can be implemented in AWS Lambda in the backend. I expect to see a lot of “data integration”, “data pipeline” and “data marketplace” companies try out this approach.
    4. AWS Greengrass: Extend AWS compute to devices. This feature enables running Lambda functions offline on devices in the field. This is another great feature which is extending the meaning of a “global cloud”. Most of the use-cases for this feature are in IoT space.
    5. Continuous Deployment for Serverless: AWS CodePipeline, CodeCommit and CodeBuild support AWS Lambda and enable creation of continuous integration and deployment pipelines. Jenkins and other CI tools also have some plugins for serverlesss which are improving all the time.
    6. API Gateway Developer Portal: Easily build your own developer portal with API documentation. Support as per Swagger documentation.
    7. AWS X-Ray: Analyze and debug distributed applications, commonly used with micro-services architecture. X-Ray will soon get Lambda support and enable even easier debugging of Lambda logic flows across all the AWS services and triggers that are part of your workflow.

    Predictions

    • Your code will increasingly run closer to clients. With Lambda@Edge, Greengrass & Snowball Edge, you can truly “deploy once and run around the globe”. This is not just scaling horizontally but scaling up or down geographically. I expect customers to leverage this feature in some very interesting ways.
    • Serverless frameworks will mature and allow easy creation of simple REST applications. I also expect vertical specific serverless frameworks to evolve, especially for IoT.
    • Monitoring, logging and debugging for serverless will improve with cloud vendor solutions as well as frameworks providing capabilities. A good example is IOPipe which provides a monitoring and logging capabilities by instrumenting your Lambda code.
    • I expect all the continuous integration and deployment tool vendors to add increasing support for serverless architectures in 2017. Automated testing for serverless will become easier via frameworks and CI tools.
    • With CloudFormation and Step Functions, AWS will try and solve the versioning and discovery problem for Lambda functions.
    • Mature patterns will emerge for serverless usage. For example, some very common use-cases today are image or video conversion based on S3 trigger, backends for messaging bots, data processing for IoT, etc. I expect to see more mature patterns and use-cases where serverless becomes an obvious solution. Expect lots of interesting white-papers and case studies from AWS to educate the market on merging use-cases.
    • AWS will allow users to choose to keep some number of Lambda instances always-on for lower latency with some extra pricing. This will start addressing the latency issues that some heavy functions or frameworks entail.
    • API and data integration vendors will increasing use Lambda and monetized API gateways. An early example is Cloud Elements which has released a feature that allows publishing SaaS APIs as AWS Lambda functions.

    Serverless Frameworks

    There are open-source projects like ServerlessChaliceLambda FrameworkApex, Gomix which add a level of abstraction over vendor serverless platforms. These frameworks make it easier to develop and deploy serverless components.

    Some of these frameworks plan to allow abstraction across FaaS offerings to avoid lock-in into a single platform. I do not expect to see any abstraction or standardisation that allows portability of serverless functions. Cloud vendors have unique FaaS offerings with different implementations and the triggers/events are based on their own proprietary services (for example, AWS has S3, Kinesis, DynamodDB, etc. triggers).

    Bots, IoT, mobile app backends, IoT backends, data processing/ETL, scheduled jobs and any other event driven use-cases are a perfect fit for serverless.

    Conclusion

    In my opinion, any service that adds agility along with reduction in IT operations and costs will become successful. Serverless definitely fits into this philosophy and in my opinion, will continue to see increasing adoption. It will not “replace” existing architectures but augment them. The serverless movement is just getting started and you can expect cloud vendors to invest heavily in improving the feature set and capabilities of serverless offerings. The serverless frameworks, DevOps teams, operational management and monitoring of FaaS will continue to mature in 2017. There are a lot of emergent trends at play here — containerization, serverless, LessOps, voice and chatbots, IoT proliferation, globalization and agility demands. All of these will accelerate serverless adoption.

  • Unit Testing Data at Scale using Deequ and Apache Spark

    Everyone knows the importance of knowledge and how critical it is to progress. In today’s world, data is knowledge. But that’s only when the data is “good” and correctly interpreted. Let’s focus on the “good” part. What do we really mean by “good data”?

    Its definition can change from use case to use case but, in general terms, good data can be defined by its accuracy, legitimacy, reliability, consistency, completeness, and availability.

    Bad data can lead to failures in production systems, unexpected outputs, and wrong inferences, leading to poor business decisions.

    It’s important to have something in place that can tell us about the quality of the data we have, how close it is to our expectations, and whether we can rely on it.

    This is basically the problem we’re trying to solve.

    The Problem and the Potential Solutions

    A manual approach to data quality testing is definitely one of the solutions and can work well.

    We’ll need to write code for computing various statistical measures, running them manually on different columns, maybe draw some plots, and then conduct some spot checks to see if there’s something not right or unexpected. The overall process can get tedious and time-consuming if we need to do it on a daily basis.

    Certain tools can make life easier for us, like:

    In this blog, we’ll be focussing on Amazon Deequ.

    Amazon Deequ

    Amazon Deequ is an open-source tool developed and used at Amazon. It’s built on top of Apache Spark, so it’s great at handling big data. Deequ computes data quality metrics regularly, based on the checks and validations set, and generates relevant reports.

    Deequ provides a lot of interesting features, and we’ll be discussing them in detail. Here’s a look at its main components:

    Source: AWS

    Prerequisites

    Working with Deequ requires having Apache Spark up and running with Deequ as one of the dependencies.

    As of this blog, the latest version of Deequ, 1.1.0, supports Spark 2.2.x to 2.4.x and Spark 3.0.x.

    Sample Dataset

    For learning more about Deequ and its features, we’ll be using an open-source IMDb dataset which has the following schema: 

    root
     |-- tconst: string (nullable = true)
     |-- titleType: string (nullable = true)
     |-- primaryTitle: string (nullable = true)
     |-- originalTitle: string (nullable = true)
     |-- isAdult: integer (nullable = true)
     |-- startYear: string (nullable = true)
     |-- endYear: string (nullable = true)
     |-- runtimeMinutes: string (nullable = true)
     |-- genres: string (nullable = true)
     |-- averageRating: double (nullable = true)
     |-- numVotes: integer (nullable = true)

    Here, tconst is the primary key, and the rest of the columns are pretty much self-explanatory.

    Data Analysis and Validation

    Before we start defining checks on the data, if we want to compute some basic stats on the dataset, Deequ provides us with an easy way to do that. They’re called metrics.

    Deequ provides support for the following metrics:

    ApproxCountDistinct, ApproxQuantile, ApproxQuantiles, Completeness, Compliance, Correlation, CountDistinct, DataType, Distance, Distinctness, Entropy, Histogram, Maximum, MaxLength, Mean, Minimum, MinLength, MutualInformation, PatternMatch, Size, StandardDeviation, Sum, UniqueValueRatio, Uniqueness

    Let’s go ahead and apply some metrics to our dataset.

    val runAnalyzer: AnalyzerContext = { AnalysisRunner
      .onData(data)
      .addAnalyzer(Size())
      .addAnalyzer(Completeness("averageRating"))
      .addAnalyzer(Uniqueness("tconst"))
      .addAnalyzer(Mean("averageRating"))
      .addAnalyzer(StandardDeviation("averageRating"))
      .addAnalyzer(Compliance("top rating", "averageRating >= 7.0"))
      .addAnalyzer(Correlation("numVotes", "averageRating"))
      .addAnalyzer(Distinctness("tconst"))
      .addAnalyzer(Maximum("averageRating"))
      .addAnalyzer(Minimum("averageRating"))
      .run()
    }
    
    val metricsResult = successMetricsAsDataFrame(spark, runAnalyzer)
    metricsResult.show(false)

    We get the following output by running the code above:

    +-----------+----------------------+-----------------+--------------------+
    |entity     |instance              |name             |value               |
    +-----------+----------------------+-----------------+--------------------+
    |Mutlicolumn|numVotes,averageRating|Correlation      |0.013454113877394851|
    |Column     |tconst                |Uniqueness       |1.0                 |
    |Column     |tconst                |Distinctness     |1.0                 |
    |Dataset    |*                     |Size             |7339583.0           |
    |Column     |averageRating         |Completeness     |0.14858528066240276 |
    |Column     |averageRating         |Mean             |6.886130810579155   |
    |Column     |averageRating         |StandardDeviation|1.3982924856469208  |
    |Column     |averageRating         |Maximum          |10.0                |
    |Column     |averageRating         |Minimum          |1.0                 |
    |Column     |top rating            |Compliance       |0.080230443609671   |
    +-----------+----------------------+-----------------+--------------------+

    Let’s try to quickly understand what this tells us.

    • The dataset has 7,339,583 rows.
    • The distinctness and uniqueness of the tconst column is 1.0, which means that all the values in the column are distinct and unique, which should be expected as it’s the primary key column.
    • The averageRating column has a min of 1 and a max of 10 with a mean of 6.88 and a standard deviation of 1.39, which tells us about the variation in the average rating values across the data.
    • The completeness of the averageRating column is 0.148, which tells us that we have an average rating available for around 15% of the dataset’s records.
    • Then, we tried to see if there’s any correlation between the numVotes and averageRating column. This metric calculates the Pearson correlation coefficient, which has a value of 0.01, meaning there’s no correlation between the two columns, which is expected.

    This feature of Deequ can be really helpful if we want to quickly do some basic analysis on a dataset.

    Let’s move on to defining and running tests and checks on the data.

    Data Validation

    For writing tests for our dataset, we use Deequ’s VerificationSuite and add checks on attributes of the dataset.

    Deequ has a big handy list of validators available to use, which are:

    hasSize, isComplete, hasCompleteness, isUnique, isPrimaryKey, hasUniqueness, hasDistinctness, hasUniqueValueRatio, hasNumberOfDistinctValues, hasHistogramValues, hasEntropy, hasMutualInformation, hasApproxQuantile, hasMinLength, hasMaxLength, hasMin, hasMax, hasMean, hasSum, hasStandardDeviation, hasApproxCountDistinct, hasCorrelation, satisfies, hasPattern, containsCreditCardNumber, containsEmail, containsURL, containsSocialSecurityNumber, hasDataType, isNonNegative, isPositive, isLessThan, isLessThanOrEqualTo, isGreaterThan, isGreaterThanOrEqualTo, isContainedIn

    Let’s apply some checks to our dataset.

    val validationResult: VerificationResult = { VerificationSuite()
      .onData(data)
      .addCheck(
        Check(CheckLevel.Error, "Review Check") 
          .hasSize(_ >= 100000) // check if the data has atleast 100k records
          .hasMin("averageRating", _ > 0.0) // min rating should not be less than 0
          .hasMax("averageRating", _ < 9.0) // max rating should not be greater than 9
          .containsURL("titleType") // verify that titleType column has URLs
          .isComplete("primaryTitle") // primaryTitle should never be NULL
          .isNonNegative("numVotes") // should not contain negative values
          .isPrimaryKey("tconst") // verify that tconst is the primary key column
          .hasDataType("isAdult", ConstrainableDataTypes.Integral) 
          //column contains Integer values only, expected as values this col has are 0 or 1
          )
      .run()
    }
    
    val results = checkResultsAsDataFrame(spark, validationResult)
    results.select("constraint","constraint_status","constraint_message").show(false)

    We have added some checks to our dataset, and the details about the check can be seen as comments in the above code.

    We expect all checks to pass for our dataset except the containsURL and hasMax ones.

    That’s because the titleType column doesn’t have URLs, and we know that the max rating is 10.0, but we are checking against 9.0.

    We can see the output below:

    +--------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
    |constraint                                                                                  |constraint_status|constraint_message                                   |
    +--------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
    |SizeConstraint(Size(None))                                                                  |Success          |                                                     |
    |MinimumConstraint(Minimum(averageRating,None))                                              |Success          |                                                     |
    |MaximumConstraint(Maximum(averageRating,None))                                              |Failure          |Value: 10.0 does not meet the constraint requirement!|
    |containsURL(titleType)                                                                      |Failure          |Value: 0.0 does not meet the constraint requirement! |
    |CompletenessConstraint(Completeness(primaryTitle,None))                                     |Success          |                                                     |
    |ComplianceConstraint(Compliance(numVotes is non-negative,COALESCE(numVotes, 0.0) >= 0,None))|Success          |                                                     |
    |UniquenessConstraint(Uniqueness(List(tconst),None))                                         |Success          |                                                     |
    |AnalysisBasedConstraint(DataType(isAdult,None),<function1>,Some(<function1>),None)          |Success          |                                                     |
    +--------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
    view raw

    In order to perform these checks, behind the scenes, Deequ calculated metrics that we saw in the previous section.

    To look at the metrics Deequ computed for the checks we defined, we can use: 

    VerificationResult.successMetricsAsDataFrame(spark,validationResult)
                      .show(truncate=false)

    Automated Constraint Suggestion

    Automated constraint suggestion is a really interesting and useful feature provided by Deequ.

    Adding validation checks on a dataset with hundreds of columns or on a large number of datasets can be challenging. With this feature, Deequ tries to make our task easier. Deequ analyses the data distribution and, based on that, suggests potential useful constraints that can be used as validation checks.

    Let’s see how this works.

    This piece of code can automatically generate constraint suggestions for us:

    val constraintResult = { ConstraintSuggestionRunner()
      .onData(data)
      .addConstraintRules(Rules.DEFAULT)
      .run()
    }
    
    val suggestionsDF = constraintResult.constraintSuggestions.flatMap { 
      case (column, suggestions) => 
        suggestions.map { constraint =>
          (column, constraint.description, constraint.codeForConstraint)
        } 
    }.toSeq.toDS()
    
    suggestionsDF.select("_1","_2").show(false)

    Let’s look at constraint suggestions generated by Deequ:

    +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |runtimeMinutes|'runtimeMinutes' has less than 72% missing values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
    |tconst        |'tconst' is not null                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
    |titleType     |'titleType' is not null                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
    |titleType     |'titleType' has value range 'tvEpisode', 'short', 'movie', 'video', 'tvSeries', 'tvMovie', 'tvMiniSeries', 'tvSpecial', 'videoGame', 'tvShort'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
    |titleType     |'titleType' has value range 'tvEpisode', 'short', 'movie' for at least 90.0% of values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
    |averageRating |'averageRating' has no negative values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
    |originalTitle |'originalTitle' is not null                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
    |startYear     |'startYear' has less than 9% missing values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
    |startYear     |'startYear' has type Integral                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
    |startYear     |'startYear' has no negative values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
    |endYear       |'endYear' has type Integral  
    |endYear       |'endYear' has value range '2017', '2018', '2019', '2016', '2015', '2020', '2014', '2013', '2012', '2011', '2010',......|
    |endYear       |'endYear' has value range '' for at least 99.0% of values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
    |endYear       |'endYear' has no negative values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
    |numVotes      |'numVotes' has no negative values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
    |primaryTitle  |'primaryTitle' is not null                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
    |isAdult       |'isAdult' is not null                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
    |isAdult       |'isAdult' has no negative values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
    |genres        |'genres' has less than 7% missing values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
    +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

    We shouldn’t expect the constraint suggestions generated by Deequ to always make sense. They should always be verified before using.

    This is because the algorithm that generates the constraint suggestions just works on the data distribution and isn’t exactly “intelligent.”

    We can see that most of the suggestions generated make sense even though they might be really trivial.

    For the endYear column, one of the suggestions is that endYear should be contained in a list of years, which indeed is true for our dataset. However, it can’t be generalized as every passing year, the value for endYear continues to increase.

    But on the other hand, the suggestion that titleType can take the following values: ‘tvEpisode,’ ‘short,’ ‘movie,’ ‘video,’ ‘tvSeries,’ ‘tvMovie,’ ‘tvMiniSeries,’ ‘tvSpecial,’ ‘videoGame,’ and ‘tvShort’ makes sense and can be generalized, which makes it a great suggestion.

    And this is why we should not blindly use the constraints suggested by Deequ and always cross-check them.

    Something we can do to improve the constraint suggestions is to use the useTrainTestSplitWithTestsetRatio method in ConstraintSuggestionRunner.
    It makes a lot of sense to use this on large datasets.

    How does this work? If we use the config useTrainTestSplitWithTestsetRatio(0.1), Deequ would compute constraint suggestions on 90% of the data and evaluate the suggested constraints on the remaining 10%, which would improve the quality of the suggested constraints.

    Anomaly Detection

    Deequ also supports anomaly detection for data quality metrics.

    The idea behind Deequ’s anomaly detection is that often we have a sense of how much change in certain metrics of our data can be expected. Say we are getting new data every day, and we know that the number of records we get on a daily basis are around 8 to 12k. On a random day, if we get 40k records, we know something went wrong with the data ingestion job or some other job didn’t go right.

    Deequ will regularly store the metrics of our data in a MetricsRepository. Once that’s done, anomaly detection checks can be run. These compare the current values of the metrics to the historical values stored in the MetricsRepository, and that helps Deequ to detect anomalous changes that are a red flag.

    One of Deequ’s anomaly detection strategies is the RateOfChangeStrategy, which limits the maximum change in the metrics by some numerical factor that can be passed as a parameter.

    Deequ supports other strategies that can be found here. And code examples for anomaly detection can be found here.

    Conclusion

    We learned about the main features and capabilities of AWS Lab’s Deequ.

    It might feel a little daunting to people unfamiliar with Scala or Spark, but using Deequ is very easy and straightforward. Someone with a basic understanding of Scala or Spark should be able to work with Deequ’s primary features without any friction.

    For someone who rarely deals with data quality checks, manual test runs might be a good enough option. However, for someone dealing with new datasets frequently, as in multiple times in a day or a week, using a tool like Deequ to perform automated data quality testing makes a lot of sense in terms of time and effort.

    We hope this article helped you get a deep dive into data quality testing and using Deequ for these types of engineering practices.

  • Web Scraping: Introduction, Best Practices & Caveats

    Web scraping is a process to crawl various websites and extract the required data using spiders. This data is processed in a data pipeline and stored in a structured format. Today, web scraping is widely used and has many use cases:

    • Using web scraping, Marketing & Sales companies can fetch lead-related information.
    • Web scraping is useful for Real Estate businesses to get the data of new projects, resale properties, etc.
    • Price comparison portals, like Trivago, extensively use web scraping to get the information of product and price from various e-commerce sites.

    The process of web scraping usually involves spiders, which fetch the HTML documents from relevant websites, extract the needed content based on the business logic, and finally store it in a specific format. This blog is a primer to build highly scalable scrappers. We will cover the following items:

    1. Ways to scrape: We’ll see basic ways to scrape data using techniques and frameworks in Python with some code snippets.
    2. Scraping at scale: Scraping a single page is straightforward, but there are challenges in scraping millions of websites, including managing the spider code, collecting data, and maintaining a data warehouse. We’ll explore such challenges and their solutions to make scraping easy and accurate.
    3. Scraping Guidelines: Scraping data from websites without the owner’s permission can be deemed as malicious. Certain guidelines need to be followed to ensure our scrappers are not blacklisted. We’ll look at some of the best practices one should follow for crawling.

    So let’s start scraping. 

    Different Techniques for Scraping

    Here, we will discuss how to scrape a page and the different libraries available in Python.

    Note: Python is the most popular language for scraping.  

    1. Requests – HTTP Library in Python: To scrape the website or a page, first find out the content of the HTML page in an HTTP response object. The requests library from Python is pretty handy and easy to use. It uses urllib inside. I like ‘requests’ as it’s easy and the code becomes readable too.

    #Example showing how to use the requests library
    import requests
    r = requests.get("https://velotio.com") #Fetch HTML Page

    2. BeautifulSoup: Once you get the webpage, the next step is to extract the data. BeautifulSoup is a powerful Python library that helps you extract the data from the page. It’s easy to use and has a wide range of APIs that’ll help you extract the data. We use the requests library to fetch an HTML page and then use the BeautifulSoup to parse that page. In this example, we can easily fetch the page title and all links on the page. Check out the documentation for all the possible ways in which we can use BeautifulSoup.

    from bs4 import BeautifulSoup
    import requests
    r = requests.get("https://velotio.com") #Fetch HTML Page
    soup = BeautifulSoup(r.text, "html.parser") #Parse HTML Page
    print "Webpage Title:" + soup.title.string
    print "Fetch All Links:" soup.find_all('a')

    3. Python Scrapy Framework:

    Scrapy is a Python-based web scraping framework that allows you to create different kinds of spiders to fetch the source code of the target website. Scrapy starts crawling the web pages present on a certain website, and then you can write the extraction logic to get the required data. Scrapy is built on the top of Twisted, a Python-based asynchronous library that performs the requests in an async fashion to boost up the spider performance. Scrapy is faster than BeautifulSoup. Moreover, it is a framework to write scrapers as opposed to BeautifulSoup, which is just a library to parse HTML pages.

    Here is a simple example of how to use Scrapy. Install Scrapy via pip. Scrapy gives a shell after parsing a website:

    $ pip install scrapy #Install Scrapy"
    $ scrapy shell https://velotio.com
    In [1]: response.xpath("//a").extract() #Fetch all a hrefs

    Now, let’s write a custom spider to parse a website.

    $cat > myspider.py <import scrapy
    
    class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']
    
    def parse(self, response):
    for title in response.css('h2.entry-title'):
    yield {'title': title.css('a ::text').extract_first()}
    EOF
    scrapy runspider myspider.py

    That’s it. Your first custom spider is created. Now. let’s understand the code.

    • name: Name of the spider. In this case, it’s “blogspider”.
    • start_urls: A list of URLs where the spider will begin to crawl from.
    • parse(self, response): This function is called whenever the crawler successfully crawls a URL. The response object used earlier in the Scrapy shell is the same response object that is passed to the parse(..).

    When you run this, Scrapy will look for start URL and will give you all the divs of the h2.entry-title class and extract the associated text from it. Alternatively, you can write your extraction logic in a parse method or create a separate class for extraction and call its object from the parse method.

    You’ve seen how to extract simple items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient. Here is a tutorial for Scrapy and the additional documentation for LinkExtractor by which you can instruct Scrapy to extract links from a web page.

    4. Python lxml.html library:  This is another library from Python just like BeautifulSoup. Scrapy internally uses lxml. It comes with a list of APIs you can use for data extraction. Why will you use this when Scrapy itself can extract the data? Let’s say you want to iterate over the ‘div’ tag and perform some operation on each tag present under “div”, then you can use this library which will give you a list of ‘div’ tags. Now you can simply iterate over them using the iter() function and traverse each child tag inside the parent div tag. Such traversing operations are difficult in scraping. Here is the documentation for this library.

    Challenges while Scraping at Scale

    Let’s look at the challenges and solutions while scraping at large scale, i.e., scraping 100-200 websites regularly:

    1. Data warehousing: Data extraction at a large scale generates vast volumes of information. Fault-tolerant, scalability, security, and high availability are the must-have features for a data warehouse. If your data warehouse is not stable or accessible then operations, like search and filter over data would be an overhead. To achieve this, instead of maintaining own database or infrastructure, you can use Amazon Web Services (AWS). You can use RDS (Relational Database Service) for a structured database and DynamoDB for the non-relational database. AWS takes care of the backup of data. It automatically takes a snapshot of the database. It gives you database error logs as well. This blog explains how to set up infrastructure in the cloud for scraping.  

    2. Pattern Changes: Scraping heavily relies on user interface and its structure, i.e., CSS and Xpath. Now, if the target website gets some adjustments then our scraper may crash completely or it can give random data that we don’t want. This is a common scenario and that’s why it’s more difficult to maintain scrapers than writing it. To handle this case, we can write the test cases for the extraction logic and run them daily, either manually or from CI tools, like Jenkins to track if the target website has changed or not.

    3. Anti-scraping Technologies: Web scraping is a common thing these days, and every website host would want to prevent their data from being scraped. Anti-scraping technologies would help them in this. For example, if you are hitting a particular website from the same IP address on a regular interval then the target website can block your IP. Adding a captcha on a website also helps. There are methods by which we can bypass these anti-scraping methods. For e.g., we can use proxy servers to hide our original IP. There are several proxy services that keep on rotating the IP before each request. Also, it is easy to add support for proxy servers in the code, and in Python, the Scrapy framework does support it.

    4. JavaScript-based dynamic content:  Websites that heavily rely on JavaScript and Ajax to render dynamic content, makes data extraction difficult. Now, Scrapy and related frameworks/libraries will only work or extract what it finds in the HTML document. Ajax calls or JavaScript are executed at runtime so it can’t scrape that. This can be handled by rendering the web page in a headless browser such as Headless Chrome, which essentially allows running Chrome in a server environment. You can also use PhantomJS, which provides a headless Webkit-based environment.

    5. Honeypot traps: Some websites have honeypot traps on the webpages for the detection of web crawlers. They are hard to detect as most of the links are blended with background color or the display property of CSS is set to none. To achieve this requires large coding efforts on both the server and the crawler side, hence this method is not frequently used.

    6. Quality of data: Currently, AI and ML projects are in high demand and these projects need data at large scale. Data integrity is also important as one fault can cause serious problems in AI/ML algorithms. So, in scraping, it is very important to not just scrape the data, but verify its integrity as well. Now doing this in real-time is not possible always, so I would prefer to write test cases of the extraction logic to make sure whatever your spiders are extracting is correct and they are not scraping any bad data

    7. More Data, More Time:  This one is obvious. The larger a website is, the more data it contains, the longer it takes to scrape that site. This may be fine if your purpose for scanning the site isn’t time-sensitive, but that isn’t often the case. Stock prices don’t stay the same over hours. Sales listings, currency exchange rates, media trends, and market prices are just a few examples of time-sensitive data. What to do in this case then? Well, one solution could be to design your spiders carefully. If you’re using Scrapy like framework then apply proper LinkExtractor rules so that spider will not waste time on scraping unrelated URLs.

    You may use multithreading scraping packages available in Python, such as Frontera and Scrapy Redis. Frontera lets you send out only one request per domain at a time, but can hit multiple domains at once, making it great for parallel scraping. Scrapy Redis lets you send out multiple requests to one domain. The right combination of these can result in a very powerful web spider that can handle both the bulk and variation for large websites.

    8. Captchas: Captchas is a good way of keeping crawlers away from a website and it is used by many website hosts. So, in order to scrape the data from such websites, we need a mechanism to solve the captchas. There are packages, software that can solve the captcha and can act as a middleware between the target website and your spider. Also, you may use libraries like Pillow and Tesseract in Python to solve the simple image-based captchas.

    9. Maintaining Deployment: Normally, we don’t want to limit ourselves to scrape just a few websites. We need the maximum amount of data that are present on the Internet and that may introduce scraping of millions of websites. Now, you can imagine the size of the code and the deployment. We can’t run spiders at this scale from a single machine. What I prefer here is to dockerize the scrapers and take advantage of the latest technologies, like AWS ECS, Kubernetes to run our scraper containers. This helps us keeping our scrapers in high availability state and it’s easy to maintain. Also, we can schedule the scrapers to run at regular intervals.

    Scraping Guidelines/ Best Practices

    1. Respect the robots.txt file:  Robots.txt is a text file that webmasters create to instruct search engine robots on how to crawl and index pages on the website. This file generally contains instructions for crawlers. Now, before even planning the extraction logic, you should first check this file. Usually, you can find this at the website admin section. This file has all the rules set on how crawlers should interact with the website. For e.g., if a website has a link to download critical information then they probably don’t want to expose that to crawlers. Another important factor is the frequency interval for crawling, which means that crawlers can only hit the website at specified intervals. If someone has asked not to crawl their website then we better not do it. Because if they catch your crawlers, it can lead to some serious legal issues.

    2. Do not hit the servers too frequently:  As I mentioned above, some websites will have the frequency interval specified for crawlers. We better use it wisely because not every website is tested against the high load. If you are hitting at a constant interval then it creates huge traffic on the server-side, and it may crash or fail to serve other requests. This creates a high impact on user experience as they are more important than the bots. So, we should make the requests according to the specified interval in robots.txt or use a standard delay of 10 seconds. This also helps you not to get blocked by the target website.

    3. User Agent Rotation and Spoofing: Every request consists of a User-Agent string in the header. This string helps to identify the browser you are using, its version, and the platform. If we use the same User-Agent in every request then it’s easy for the target website to check that request is coming from a crawler. So, to make sure we do not face this, try to rotate the User and the Agent between the requests. You can get examples of genuine User-Agent strings on the Internet very easily, try them out. If you’re using Scrapy, you can set USER_AGENT property in settings.py.

    4. Disguise your requests by rotating IPs and Proxy Services: We’ve discussed this in the challenges above. It’s always better to use rotating IPs and proxy service so that your spider won’t get blocked.

    5. Do not follow the same crawling pattern: Now, as you know many websites use anti-scraping technologies, so it’s easy for them to detect your spider if it’s crawling in the same pattern. Normally, we, as a human, would not follow a pattern on a particular website. So, to have your spiders run smoothly, we can introduce actions like mouse movements, clicking a random link, etc, which gives the impression of your spider as a human.

    6. Scrape during off-peak hours: Off-peak hours are suitable for bots/crawlers as the traffic on the website is considerably less. These hours can be identified by the geolocation from where the site’s traffic originates. This also helps to improve the crawling rate and avoid the extra load from spider requests. Thus, it is advisable to schedule the crawlers to run in the off-peak hours.

    7. Use the scraped data responsibly: We should always take the responsibility of the scraped data. It is not acceptable if someone is scraping the data and then republish it somewhere else. This can be considered as breaking the copyright laws and may lead to legal issues. So, it is advisable to check the target website’s Terms of Service page before scraping.

    8. Use Canonical URLs: When we scrape, we tend to scrape duplicate URLs, and hence the duplicate data, which is the last thing we want to do. It may happen in a single website where we get multiple URLs having the same data. In this situation, duplicate URLs will have a canonical URL, which points to the parent or the original URL. By this, we make sure, we don’t scrape duplicate contents. In frameworks like Scrapy, duplicate URLs are handled by default.

    9. Be transparent: Don’t misrepresent your purpose or use deceptive methods to gain access. If you have a login and a password that identifies you to gain access to a source, use it.  Don’t hide who you are. If possible, share your credentials.

    Conclusion

    We’ve seen the basics of scraping, frameworks, how to crawl, and the best practices of scraping. To conclude:

    • Follow target URLs rules while scraping. Don’t make them block your spider.
    • Maintenance of data and spiders at scale is difficult. Use Docker/ Kubernetes and public cloud providers, like AWS to easily scale your web-scraping backend.
    • Always respect the rules of the websites you plan to crawl. If APIs are available, always use them first.
  • Your Quintessential Guide to AWS Athena

    Introduction

    Serverless has become a new trend today and is here to stay for sure! Now when you think of wireless internet, you know that it still has some wires but you don’t need to worry about them as you don’t have to maintain them. Similarly, serverless has servers but you don’t have to keep worrying about handling or maintaining them. All you need to do is focus on your code and you’re good to go.

    It has some more benefits, such as:

    • Zero administration: You can deploy code without provisioning anything beforehand, or managing anything later. There is no concept of a fleet, an instance, or even an operating system.
    • Auto-scaling: It lets your service providers manage the scaling challenges. You don’t need to fire alerts or write scripts to scale up and down. It handles quick bursts of traffic and weekend lulls the same way.
    • Pay-per-use: The function-as-a-service compute and managed services are charged based on usage rather than pre-provisioned capacity. You can have complete resource utilization without paying a cent for idle time. The results? 90% cost-savings over a cloud VM, and the satisfaction of knowing that you never pay for resources you don’t use.

    What is AWS Athena?

    AWS Athena is a similar serverless service. It is more of an interactive query service than a code deployment service.

    Using Athena one can directly query the data stored in S3 buckets and using standard ANSI SQL.

    As mentioned earlier, it works on the principle of serverless, that is, there is no infrastructure to manage, and you pay only for the queries that you run.

    Athena is easy to use. You can simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

    It is based on Facebook’s PrestoDB and can be used to query structured and semi-structured data.

    Some Exciting Features of Athena are:

    • Serverless. No ETL – Not having to set up and manage any servers or data warehouses.
    • Only pay for the data that is scanned.
    • You can ensure better performance by compressing, partitioning, and converting your data into columnar formats.
    • Can also handle complex analysis, including large joins, window functions, and arrays.
    • Athena automatically executes queries in parallel.
    • Need to provide a path to the S3 folder and when new files added automatically reflects in the table.
    • Supports –
    • Support CSV, Json, Parquet, ORC, Avro data formats
    • Complex Joins and datatypes
    • View creation
    • Does not Support –
    • User-defined functions and stored procedures
    • Hive or Presto transactions
    • LZO (Snappy is supported)

    Pricing of Athena

    • AWS Athena is priced $5 for each TB of data scanned.
    • Queries are rounded up to the nearest MB, with a 10 MB minimum.
    • Users pay for stored data at regular S3 rates.
    • Amazon advises users to use compressed data files, have data in columnar formats, and routinely delete old results sets to keep charges low. Partitioning data in tables can speed up queries and reduce query bills.

    Athena vs. Redshift Spectrum

    • AWS also has Redshift as data warehouse service, and we can use redshift spectrum to query S3 data, so then why should you use Athena?

    Advantages of Redshift Spectrum:

    • Allows creation of Redshift tables. You’re able to join Redshift tables with Redshift spectrum tables efficiently.

    If you do not need those things then you should consider Athena as well Athena differences from Redshift spectrum:

    • Billing. This is a major difference and depending on your use case you may find one much cheaper than the other Performance.
    • Athena slightly faster. SQL syntax and features.
    • Athena is derived from presto and is a bit different to Redshift which has its roots in Postgres.
    • It’s easy enough to connect to Athena using API, JDBC or ODBC but many more products offer “standard out of the box” connection to Redshift.
    • Athena has GIS functions and lambdas.

    So in nutshell, if you have existing instances of redshift you would probably go for Redshift Spectrum, if not then you can opt for Athena for querying the data. In some cases, you can use both in tandem.

    Example

    Here is a sample query to create a sample database having 3 tables basic_details, contact_details and bill_details, Uploaded csv file to s3:

    Basic_details:

    const outside = {weather: FRIGHTFUL}
    const inside = {fire: DELIGHTFUL}
    const go = places => places.some(p=>p>outside.weather)))
    
    const snow = () => (outside.weather < inside.fire && !go(places)) {
      let it = snow()
    }
    
    let it = snow()
    
    const FRIGHTFUL = 1
    const DELIGHTFUL = 1337

    Bill_details:

    CREATE EXTERNAL TABLE `bil_details`(
      `id` int COMMENT '', 
      `amount_paid` string COMMENT '', 
      `amount_due` string COMMENT '')
    ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY ',' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://athena-blog/bill-details'
    TBLPROPERTIES (
      'has_encrypted_data'='false', 
      'skip.header.line.count'='1')

    Contact_details:

    CREATE EXTERNAL TABLE `contact_details`(
      `id` int COMMENT '', 
      `street` string COMMENT '', 
      `city` string COMMENT '', 
      `state` string COMMENT '', 
      `country` string COMMENT '', 
      `zip` string COMMENT '')
    ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY ',' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://athena-blog/contact-details'
    TBLPROPERTIES (
      'has_encrypted_data'='false', 
      'skip.header.line.count'='1')

    Sample Query for – FirstNames of People from Minnesota with amount_due > $100

    WITH basic AS 
        (SELECT id,
             first_name
        FROM basic_details
        WHERE lower(gender) = 'male' ), bill AS 
        (SELECT id
        FROM bil_details
        WHERE CAST(amount_due AS INTEGER) > 100 ), contact AS 
        (SELECT contact_details.id
        FROM contact_details
        JOIN bill
            ON contact_details.id = bill.id
        WHERE state= 'Minnesota' )
    SELECT basic.first_name
    FROM basic
    JOIN contact
        ON basic.id = contact.id 

    Output:

    Some Other Sample Queries:

    1. Searching for Values in JSON

    WITH dataset AS (
      SELECT * FROM (VALUES
        (JSON '{"name": "Bob Smith", "org": "legal", "projects": ["project1"]}'),
        (JSON '{"name": "Susan Smith", "org": "engineering", "projects": ["project1", "project2", "project3"]}'),
        (JSON '{"name": "Jane Smith", "org": "finance", "projects": ["project1", "project2"]}')
      ) AS t (users)
    )
    SELECT json_extract_scalar(users, '$.name') AS user
    FROM dataset
    WHERE json_array_contains(json_extract(users, '$.projects'), 'project2')

    Output:

    2. Extracting properties

    WITH dataset AS (
      SELECT '{"name": "Susan Smith",
               "org": "engineering",
               "projects": [{"name":"project1", "completed":false},
               {"name":"project2", "completed":true}]}'
        AS blob
    )
    SELECT
      json_extract(blob, '$.name') AS name,
      json_extract(blob, '$.projects') AS projects
    FROM dataset

    Output:

    3. Converting JSON to Athena Data Types

    WITH dataset AS (
      SELECT
        CAST(JSON '"HELLO ATHENA"' AS VARCHAR) AS hello_msg,
        CAST(JSON '12345' AS INTEGER) AS some_int,
        CAST(JSON '{"a":1,"b":2}' AS MAP(VARCHAR, INTEGER)) AS some_map
    )
    SELECT * FROM dataset

    Output:

    Conclusion

    Hence, we can easily say that AWS Athena gives us an efficient way to query our raw data present in different formats in S3 object storage, without spawning a dedicated infrastructure and at minimal cost.

    Need help with setting up AWS Athena for your organization? Connect with the experts at Velotio!

  • Acquiring Temporary AWS Credentials with Browser Navigated Authentication

    In one of my previous blog posts (Hacking your way around AWS IAM Roles), we demonstrated how users can access AWS resources without having to store AWS credentials on disk. This was achieved by setting up an OpenVPN server and client-side route that gets automatically pushed when the user is connected to the VPN. To this date, I really find this as a complaint-friendly solution without forcing users to do any manual configuration on their system. It also makes sense to have access to AWS resources as long as they are connected on VPN. One of the downsides to this method is maintaining an OpenVPN server, keeping it secure and having it running in a highly available (HA) state. If the OpenVPN server is compromised, our credentials are at stake. Secondly, all the users connected on VPN get the same level of access.

    In this blog post, we present to you a CLI utility written in Rust that writes temporary AWS credentials to a user profile (~/.aws/credentials file) using web browser navigated Google authentication. This utility is inspired by gimme-aws-creds (written in python for Okta authenticated AWS farm) and heroku cli (written in nodejs and utilizes oclif framework). We will refer to our utility as aws-authcreds throughout this post.

    “If you have an apple and I have an apple and we exchange these apples then you and I will still each have one apple. But if you have an idea and I have an idea and we exchange these ideas, then each of us will have two ideas.”

    – George Bernard Shaw

    What does this CLI utility (auth-awscreds) do?

    When the user fires a command (auth-awscreds) on the terminal, our program reads utility configuration from file .auth-awscreds located in the user home directory. If this file is not present, the utility prompts for setting the configuration for the first time. Utility configuration file is INI format. Program then opens a default web browser and navigates to the URL read from the configuration file. At this point, the utility waits for the browser URL to navigate and authorize. Web UI then navigates to Google Authentication. If authentication is successful, a callback is shared with CLI utility along with temporary AWS credentials, which is then written to ~/.aws/credentials file.

    Block Diagram

    Tech Stack Used

    As stated earlier, we wrote this utility in Rust. One of the reasons for choosing Rust is because we wanted a statically typed binary (ELF) file (executed independent of interpreter), which ships as it is when compiled. Unlike programs written in Python or Node.js, one needs a language interpreter and has supporting libraries installed for your program. The golang would have also suffice our purpose, but I prefer Rust over golang.

    Software Stack:

    • Rust (for CLI utility)
    • Actix Web – HTTP Server
    • Node.js, Express, ReactJS, serverless-http, aws-sdk, AWS Amplify, axios
    • Terraform and serverless framework

    Infrastructure Stack:

    • AWS Cognito (User Pool and Federated Identities)
    • AWS API Gateway (HTTP API)
    • AWS Lambda
    • AWS S3 Bucket (React App)
    • AWS CloudFront (For Serving React App)
    • AWS ACM (SSL Certificate)

    Recipe

    Architecture Diagram

    CLI Utility: auth-awscreds

    Our goal is, when the auth-awscreds command is fired, we first check if the user’s home directory ~/.aws/credentials file exists. If not, we create a ~/.aws directory. This is the default AWS credentials directory, where usually AWS SDK looks for credentials (unless exclusively specified by env var AWS_SHARED_CREDENTIALS_FILE). The next step would be to check if a ~/.auth-awscredds file exists. If this file doesn’t exist, we create a prompt user with two inputs: 

    1. AWS credentials profile name (used by SDK, default is preferred) 

    2. Application domain URL (Our backend app domain is used for authentication)

    let app_profile_file = format!("{}/.auth-awscreds",&user_home_dir);
     
       let config_exist : bool = Path::new(&app_profile_file).exists();
     
       let mut profile_name = String::new();
       let mut app_domain = String::new();
     
       if !config_exist {
           //ask the series of questions
           print!("Which profile to write AWS Credentials [default] : ");
           io::stdout().flush().unwrap();
           io::stdin()
               .read_line(&mut profile_name)
               .expect("Failed to read line");
     
           print!("App Domain : ");
           io::stdout().flush().unwrap();
          
           io::stdin()
               .read_line(&mut app_domain)
               .expect("Failed to read line");
          
           profile_name=String::from(profile_name.trim());
           app_domain=String::from(app_domain.trim());
          
           config_profile(&profile_name,&app_domain);
          
       }
       else {
           (profile_name,app_domain) = read_profile();
       }

    These two properties are written in ~/.auth-awscreds under the default section. Followed by this, our utility generates RSA asymmetric 1024 bit public and private key. Both the keypair are converted to base64.

    pub fn genkeypairs() -> (String,String) {
       let rsa = Rsa::generate(1024).unwrap();
     
       let private_key: Vec<u8> = rsa.private_key_to_pem_passphrase(Cipher::aes_128_cbc(),"Sagar Barai".as_bytes()).unwrap();
       let public_key: Vec<u8> = rsa.public_key_to_pem().unwrap();
     
       (base64::encode(private_key) , base64::encode(public_key))
    }

    We then launch a browser window and navigate to the specified app domain URL. At this stage, our utility starts a temporary web server with the help of the Actix Web framework and listens on 63442 port of localhost.

    println!("Opening web ui for authentication...!");
       open::that(&app_domain).unwrap();
     
       HttpServer::new(move || {
           //let stopper = tx.clone();
           let cors = Cors::permissive();
           App::new()
           .wrap(cors)
           //.app_data(stopper)
           .app_data(crypto_data.clone())
           .service(get_public_key)
           .service(set_aws_creds)
       })
       .bind(("127.0.0.1",63442))?
       .run()
       .await

    Localhost web server has two end points.

    1. GET Endpoint (/publickey): This endpoint is called by our React app after authentication and returns the public key created during the initialization process. Since the web server hosted by the Rust application is insecure (non ssl),  when actual AWS credentials are received, they should be posted as an encrypted string with the help of this public key.

    #[get("/publickey")]
    pub async fn get_public_key(data: web::Data<AppData>) -> impl Responder {
       let public_key = &data.public_key;
      
       web::Json(HTTPResponseData{
           status: 200,
           msg: String::from("Ok"),
           success: true,
           data: String::from(public_key)
       })
    }

    2. POST Endpoint (/setcreds): This endpoint is called when the react app has successfully retrieved credentials from API Gateway. Credentials are decrypted by private key and then written to ~/.aws/credentials file defined by profile name in utility configuration. 

    let encrypted_data = payload["data"].as_array().unwrap();
       let username = payload["username"].as_str().unwrap();
     
       let mut decypted_payload = vec![];
     
       for str in encrypted_data.iter() {
           //println!("{}",str.to_string());
           let s = str.as_str().unwrap();
           let decrypted = decrypt_data(&private_key, &s.to_string());
           decypted_payload.extend_from_slice(&decrypted);
       }
     
       let credentials : serde_json::Value = serde_json::from_str(&String::from_utf8(decypted_payload).unwrap()).unwrap();
     
       let aws_creds = AWSCreds{
           profile_name: String::from(profile_name),
           aws_access_key_id: String::from(credentials["AccessKeyId"].as_str().unwrap()),
           aws_secret_access_key: String::from(credentials["SecretAccessKey"].as_str().unwrap()),
           aws_session_token: String::from(credentials["SessionToken"].as_str().unwrap())
       };
     
       println!("Authenticated as {}",username);
       println!("Updating AWS Credentials File...!");
     
       configcreds(&aws_creds);

    One of the interesting parts of this code is the decryption process, which iterates through an array of strings and is joined by method decypted_payload.extend_from_slice(&decrypted);. RSA 1024 is 128-byte encryption, and we used OAEP padding, which uses 42 bytes for padding and the rest for encrypted data. Thus, 86 bytes can be encrypted at max. So, when credentials are received they are an array of 128 bytes long base64 encoded data. One has to decode the bas64 string to a data buffer and then decrypt data piece by piece.

    To generate a statically typed binary file, run: cargo build –release

    AWS Cognito and Google Authentication

    This guide does not cover how to set up Cognito and integration with Google Authentication. You can refer to our old post for a detailed guide on setting up authentication and authorization. (Refer to the sections Setup Authentication and Setup Authorization).

    React App:

    The React app is launched via our Rust CLI utility. This application is served right from the S3 bucket via CloudFront. When our React app is loaded, it checks if the current session is authenticated. If not, then with the help of the AWS Amplify framework, our app is redirected to Cognito-hosted UI authentication, which in turn auto redirects to Google Login page.

    render(){
       return (
         <div className="centerdiv">
           {
             this.state.appInitialised ?
               this.state.user === null ? Auth.federatedSignIn({provider: 'Google'}) :
               <Aux>
                 {this.state.pageContent}
               </Aux>
             :
             <Loader/>
           }
         </div>
       )
     }

    Once the session is authenticated, we set the react state variables and then retrieve the public key from the actix web server (Rust CLI App: auth-awscreds) by calling /publickey GET method. Followed by this, an Ajax POST request (/auth-creds) is made via axios library to API Gateway. The payload contains a public key, and JWT token for authentication. Expected response from API gateway is encrypted AWS temporary credentials which is then proxied to our CLI application.

    To ease this deployment, we have written a terraform code (available in the repository) that takes care of creating an S3 bucket, CloudFront distribution, ACM, React build, and deploying it to the S3 bucket. Navigate to vars.tf file and change the respective default variables). The Terraform script will fail at first launch since the ACM needs a DNS record validation. You can create a CNAME record for DNS validation and re-run the Terraform script to continue deployment. The React app expects few environment variables. Below is the sample .env file; update the respective values for your environment.

    REACT_APP_IDENTITY_POOL_ID=
    REACT_APP_COGNITO_REGION=
    REACT_APP_COGNITO_USER_POOL_ID=
    REACT_APP_COGNTIO_DOMAIN_NAME=
    REACT_APP_DOMAIN_NAME=
    REACT_APP_CLIENT_ID=
    REACT_APP_CLI_APP_URL=
    REACT_APP_API_APP_URL=

    Finally, deploy the React app using below sample commands.

    $ terraform plan -out plan     #creates plan for revision
    $ terraform apply plan         #apply plan and deploy

    API Gateway HTTP API and Lambda Function

    When a request is first intercepted by API Gateway, it validates the JWT token on its own. API Gateway natively supports Cognito integration. Thus, any payload with invalid authorization header is rejected at API Gateway itself. This eases our authentication process and validates the identity. If the request is valid, it is then received by our Lambda function. Our Lambda function is written in Node.js and wrapped by serverless-http framework around express app. The Express app has only one endpoint.

    /auth-creds (POST): once the request is received, it retrieves the ID from Cognito and logs it to stdout for audit purpose.

    let identityParams = {
               IdentityPoolId: process.env.IDENTITY_POOL_ID,
               Logins: {}
           };
      
           identityParams.Logins[`${process.env.COGNITOIDP}`] = req.headers.authorization;
      
           const ci = new CognitoIdentity({region : process.env.AWSREGION});
      
           let idpResponse = await ci.getId(identityParams).promise();
      
           console.log("Auth Creds Request Received from ",JSON.stringify(idpResponse));

    The app then extracts the base64 encoded public key. Followed by this, an STS api call (Security Token Service) is made and temporary credentials are derived. These credentials are then encrypted with a public key in chunks of 86 bytes.

    const pemPublicKey = Buffer.from(public_key,'base64').toString();
     
           const authdata=await sts.assumeRole({
               ExternalId: process.env.STS_EXTERNAL_ID,
               RoleArn: process.env.IAM_ROLE_ARN,
               RoleSessionName: "DemoAWSAuthSession"
           }).promise();
     
           const creds = JSON.stringify(authdata.Credentials);
           const splitData = creds.match(/.{1,86}/g);
          
           const encryptedData = splitData.map(d=>{
               return publicEncrypt(pemPublicKey,Buffer.from(d)).toString('base64');
           });

    Here, the assumeRole calls the IAM role, which has appropriate policy documents attached. For the sake of this demo, we attached an Administrator role. However, one should consider a hardening policy document and avoid attaching Administrator policy directly to the role.

    resources:
     Resources:
       AuthCredsAssumeRole:
         Type: AWS::IAM::Role
         Properties:
           AssumeRolePolicyDocument:
             Version: "2012-10-17"
             Statement:
               -
                 Effect: Allow
                 Principal:
                   AWS: !GetAtt IamRoleLambdaExecution.Arn
                 Action: sts:AssumeRole
                 Condition:
                   StringEquals:
                     sts:ExternalId: ${env:STS_EXTERNAL_ID}
           RoleName: auth-awscreds-api
           ManagedPolicyArns:
             - arn:aws:iam::aws:policy/AdministratorAccess

    Finally, the response is sent to the React app. 

    We have used the Serverless framework to deploy the API. The Serverless framework creates API gateway, lambda function, Lambda Layer, and IAM role, and takes care of code deployment to lambda function.

    To deploy this application, follow the below steps.

    1. cd layer/nodejs && npm install && cd ../.. && npm install

    2. npm install -g serverless (on mac you can skip this step and use the npx serverless command instead) 

    3. Create .env file and below environment variables to file and set the respective values.

    AWSREGION=ap-south-1
    COGNITO_USER_POOL_ID=
    IDENTITY_POOL_ID=
    COGNITOIDP=
    APP_CLIENT_ID=
    STS_EXTERNAL_ID=
    IAM_ROLE_ARN=
    DEPLOYMENT_BUCKET=
    APP_DOMAIN=

    4. serverless deploy or npx serverless deploy

    Entire codebase for CLI APP, React App, and Backend API  is available on the GitHub repository.

    Testing:

    Assuming that you have compiled binary (auth-awscreds) available in your local machine and for the sake of testing you have installed `aws-cli`, you can then run /path/to/your/auth-awscreds. 

    App Testing

    If you selected your AWS profile name as “demo-awscreds,” you can then export the AWS_PROFILE environment variable. If you prefer a “default” profile, you don’t need to export the environment variable as AWS SDK selects a “default” profile on its own.

    [demo-awscreds]
    aws_access_key_id=ASIAUAOF2CHC77SJUPZU
    aws_secret_access_key=r21J4vwPDnDYWiwdyJe3ET+yhyzFEj7Wi1XxdIaq
    aws_session_token=FwoGZXIvYXdzEIj//////////wEaDHVLdvxSNEqaQZPPQyK2AeuaSlfAGtgaV1q2aKBCvK9c8GCJqcRLlNrixCAFga9n+9Vsh/5AWV2fmea6HwWGqGYU9uUr3mqTSFfh+6/9VQH3RTTwfWEnQONuZ6+E7KT9vYxPockyIZku2hjAUtx9dSyBvOHpIn2muMFmizZH/8EvcZFuzxFrbcy0LyLFHt2HI/gy9k6bLCMbcG9w7Ej2l8vfF3dQ6y1peVOQ5Q8dDMahhS+CMm1q/T1TdNeoon7mgqKGruO4KJrKiZoGMi1JZvXeEIVGiGAW0ro0/Vlp8DY1MaL7Af8BlWI1ZuJJwDJXbEi2Y7rHme5JjbA=

    To validate, you can then run “aws s3 ls.” You should see S3 buckets listed from your AWS account. Note that these credentials are only valid for 60 minutes. This means you will have to re-run the command and acquire a new pair of AWS credentials. Of course, you can configure your IAM role to extend expiry for an “assume role.” 

    auth-awscreds in Action:

    Summary

    Currently, “auth-awscreds” is at its early development stage. This post demonstrates how AWS credentials can be acquired temporarily without having to worry about key rotation. One of the features that we are currently working on is RBAC, with the help of AWS Cognito. Since this tool currently doesn’t support any command line argument, we can’t reconfigure utility configuration. You can manually edit or delete the utility configuration file, which triggers a prompt for configuring during the next run. We also want to add multiple profiles so that multiple AWS accounts can be used.

  • Building Your First AWS Serverless Application? Here’s Everything You Need to Know

    A serverless architecture is a way to implement and run applications and services or micro-services without need to manage infrastructure. Your application still runs on servers, but all the servers management is done by AWS. Now we don’t need to provision, scale or maintain servers to run our applications, databases and storage systems. Services which are developed by developers who don’t let developers build application from scratch.

    Why Serverless

    1. More focus on development rather than managing servers.
    2. Cost Effective.
    3. Application which scales automatically.
    4. Quick application setup.

    Services For ServerLess

    For implementing serverless architecture there are multiple services which are provided by cloud partners though we will be exploring most of the services from AWS. Following are the services which we can use depending on the application requirement.

    1. Lambda: It is used to write business logic / schedulers / functions.
    2. S3: It is mostly used for storing objects but it also gives the privilege to host WebApps. You can host a static website on S3.
    3. API Gateway: It is used for creating, publishing, maintaining, monitoring and securing REST and WebSocket APIs at any scale.
    4. Cognito: It provides authentication, authorization & user management for your web and mobile apps. Your users can sign in directly sign in with a username and password or through third parties such as Facebook, Amazon or Google.
    5. DynamoDB: It is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.

    Three-tier Serverless Architecture

    So, let’s take a use case in which you want to develop a three tier serverless application. The three tier architecture is a popular pattern for user facing applications, The tiers that comprise the architecture include the presentation tier, the logic tier and the data tier. The presentation tier represents the component that users directly interact with web page / mobile app UI. The logic tier contains the code required to translate user action at the presentation tier to the functionality that drives the application’s behaviour. The data tier consists of your storage media (databases, file systems, object stores) that holds the data relevant to the application. Figure shows the simple three-tier application.

     Figure: Simple Three-Tier Architectural Pattern

    Presentation Tier

    The presentation tier of the three tier represents the View part of the application. Here you can use S3 to host static website. On a static website, individual web pages include static content and they also contain client side scripting.

    The following is a quick procedure to configure an Amazon S3 bucket for static website hosting in the S3 console.

    To configure an S3 bucket for static website hosting

    1. Log in to the AWS Management Console and open the S3 console at

    2. In the Bucket name list, choose the name of the bucket that you want to enable static website hosting for.

    3. Choose Properties.

    4. Choose Static Website Hosting

    Once you enable your bucket for static website hosting, browsers can access all of your content through the Amazon S3 website endpoint for your bucket.

    5. Choose Use this bucket to host.

    A. For Index Document, type the name of your index document, which is typically named index.html. When you configure a S3 bucket for website hosting, you must specify an index document, which will be returned by S3 when requests are made to the root domain or any of the subfolders.

    B. (Optional) For 4XX errors, you can optionally provide your own custom error document that provides additional guidance for your users. Type the name of the file that contains the custom error document. If an error occurs, S3 returns an error document.

    C. (Optional) If you want to give advanced redirection rules, In the edit redirection rule text box, you have to XML to describe the rule.
    E.g.

    <RoutingRules>
        <RoutingRule>
            <Condition>
                <HttpErrorCodeReturnedEquals>403</HttpErrorCodeReturnedEquals>
            </Condition>
            <Redirect>
                <HostName>mywebsite.com</HostName>
                <ReplaceKeyPrefixWith>notfound/</ReplaceKeyPrefixWith>
            </Redirect>
        </RoutingRule>
    </RoutingRules>

    6. Choose Save

    7. Add a bucket policy to the website bucket that grants access to the object in the S3 bucket for everyone. You must make the objects that you want to serve publicly readable, when you configure a S3 bucket as a website. To do so, you write a bucket policy that grants everyone S3:GetObject permission. The following bucket policy grants everyone access to the objects in the example-bucket bucket.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "PublicReadGetObject",
                "Effect": "Allow",
                "Principal": "*",
                "Action": [
                    "s3:GetObject"
                ],
                "Resource": [
                    "arn:aws:s3:::example-bucket/*"
                ]
            }
        ]
    }

    Note: If you choose Disable Website Hosting, S3 removes the website configuration from the bucket, so that the bucket no longer accessible from the website endpoint, but the bucket is still available at the REST endpoint.

    Logic Tier

    The logic tier represents the brains of the application. Here the two core services for serverless will be used i.e. API Gateway and Lambda to form your logic tier can be so revolutionary. The feature of the 2 services allow you to build a serverless production application which is highly scalable, available and secure. Your application could use number of servers, however by leveraging this pattern you do not have to manage a single one. In addition, by using these managed services together you get following benefits:

    1. No operating system to choose, secure or manage.
    2. No servers to right size, monitor.
    3. No risk to your cost by over-provisioning.
    4. No Risk to your performance by under-provisioning.

    API Gateway

    API Gateway is a fully managed service for defining, deploying and maintaining APIs. Anyone can integrate with the APIs using standard HTTPS requests. However, it has specific features and qualities that result it being an edge for your logic tier.

    Integration with Lambda

    API Gateway gives your application a simple way to leverage the innovation of AWS lambda directly (HTTPS Requests). API Gateway forms the bridge that connects your presentation tier and the functions you write in Lambda. After defining the client / server relationship using your API, the contents of the client’s HTTPS requests are passed to Lambda function for execution. The content include request metadata, request headers and the request body.

    API Performance Across the Globe

    Each deployment of API Gateway includes an Amazon CloudFront distribution under the covers. Amazon CloudFront is a content delivery web service that used Amazon’s global network of edge locations as connection points for clients integrating with API. This helps drive down the total response time latency of your API. Through its use of multiple edge locations across the world, Amazon CloudFront also provides you capabilities to combat distributed denial of service (DDoS) attack scenarios.

    You can improve the performance of specific API requests by using API Gateway to store responses in an optional in-memory cache. This not only provides performance benefits for repeated API requests, but is also reduces backend executions, which can reduce overall cost.

    Let’s dive into each step

    1. Create Lambda Function
    Login to Aws Console and head over to Lambda Service and Click on “Create A Function”

    A. Choose first option “Author from scratch”
    B. Enter Function Name
    C. Select Runtime e.g. Python 2.7
    D. Click on “Create Function”

    As your function is ready, you can see your basic function will get generated in language you choose to write.
    E.g.

    import json
    
    def lambda_handler(event, context):
        # TODO implement
        return {
            'statusCode': 200,
            'body': json.dumps('Hello from Lambda!')
        }

    2. Testing Lambda Function

    Click on “Test” button at the top right corner where we need to configure test event. As we are not sending any events, just give event a name, for example, “Hello World” template as it is and “Create” it.

    Now, when you hit the “Test” button again, it runs through testing the function we created earlier and returns the configured value.

    Create & Configure API Gateway connecting to Lambda

    We are done with creating lambda functions but how to invoke function from outside world ? We need endpoint, right ?

    Go to API Gateway & click on “Get Started” and agree on creating an Example API but we will not use that API we will create “New API”. Give it a name by keeping “Endpoint Type” regional for now.

    Create the API and you will go on the page “resources” page of the created API Gateway. Go through the following steps:

    A. Click on the “Actions”, then click on “Create Method”. Select Get method for our function. Then, “Tick Mark” on the right side of “GET” to set it up.
    B. Choose “Lambda Function” as integration type.
    C. Choose the region where we created earlier.
    D. Write the name of Lambda Function we created
    E. Save the method where it will ask you for confirmation of “Add Permission to Lambda Function”. Agree to that & that is done.
    F. Now, we can test our setup. Click on “Test” to run API. It should give the response text we had on the lambda test screen.

    Now, to get endpoint. We need to deploy the API. On the Actions dropdown, click on Deploy API under API Actions. Fill in the details of deployment and hit Deploy.

    After that, we will get our HTTPS endpoint.

    On the above screen you can see the things like cache settings, throttling, logging which can be configured. Save the changes and browse the invoke URL from which we will get the response which was earlier getting from Lambda. So, here is our logic tier of serverless application is to be done.

    Data Tier

    By using Lambda as your logic tier, you have a number of data storage options for your data tier. These options fall into broad categories: Amazon VPC hosted data stores and IAM-enabled data stores. Lambda has the ability to integrate with both securely.

    Amazon VPC Hosted Data Stores

    1. Amazon RDS
    2. Amazon ElasticCache
    3. Amazon Redshift

    IAM-Enabled Data Stores

    1. Amazon DynamoDB
    2. Amazon S3
    3. Amazon ElasticSearch Service

    You can use any of those for storage purpose, But DynamoDB is one of best suited for ServerLess application.

    Why DynamoDB ?

    1. It is NoSQL DB, also that is fully managed by AWS.
    2. It provides fast & prectable performance with seamless scalability.
    3. DynamoDB lets you offload the administrative burden of operating and scaling a distributed system.
    4. It offers encryption at rest, which eliminates the operational burden and complexity involved in protecting sensitive data.
    5. You can scale up/down your tables throughput capacity without downtime/performance degradation.
    6. It provides On-Demand backups as well as enable point in time recovery for your DynamoDB tables.
    7. DynamoDB allows you to delete expired items from table automatically to help you reduce storage usage and the cost of storing data that is no longer relevant.

    Following is the sample script for DynamoDB with Python which you can use with lambda.

    from __future__ import print_function # Python 2/3 compatibility
    import boto3
    import json
    import decimal
    from boto3.dynamodb.conditions import Key, Attr
    from botocore.exceptions import ClientError
    
    # Helper class to convert a DynamoDB item to JSON.
    class DecimalEncoder(json.JSONEncoder):
        def default(self, o):
            if isinstance(o, decimal.Decimal):
                if o % 1 > 0:
                    return float(o)
                else:
                    return int(o)
            return super(DecimalEncoder, self).default(o)
    
    dynamodb = boto3.resource("dynamodb", region_name='us-west-2', endpoint_url="http://localhost:8000")
    
    table = dynamodb.Table('Movies')
    
    title = "The Big New Movie"
    year = 2015
    
    try:
        response = table.get_item(
            Key={
                'year': year,
                'title': title
            }
        )
    except ClientError as e:
        print(e.response['Error']['Message'])
    else:
        item = response['Item']
        print("GetItem succeeded:")
        print(json.dumps(item, indent=4, cls=DecimalEncoder))

    Note: To run the above script successfully you need to attach policy to your role for lambda. So in this case you need to attach policy for DynamoDB operations to take place & for CloudWatch if required to store your logs. Following is the policy which you can attach to your role for DB executions.

    {
    	"Version": "2012-10-17",
    	"Statement": [{
    			"Effect": "Allow",
    			"Action": [
    				"dynamodb:BatchGetItem",
    				"dynamodb:GetItem",
    				"dynamodb:Query",
    				"dynamodb:Scan",
    				"dynamodb:BatchWriteItem",
    				"dynamodb:PutItem",
    				"dynamodb:UpdateItem"
    			],
    			"Resource": "arn:aws:dynamodb:eu-west-1:123456789012:table/SampleTable"
    		},
    		{
    			"Effect": "Allow",
    			"Action": [
    				"logs:CreateLogStream",
    				"logs:PutLogEvents"
    			],
    			"Resource": "arn:aws:logs:eu-west-1:123456789012:*"
    		},
    		{
    			"Effect": "Allow",
    			"Action": "logs:CreateLogGroup",
    			"Resource": "*"
    		}
    	]
    }

    Sample Architecture Patterns

    You can implement the following popular architecture patterns using API Gateway & Lambda as your logic tier, Amazon S3 for presentation tier, DynamoDB as your data tier. For each example, we will only use AWS Service that do not require users to manage their own infrastructure.

    Mobile Backend

    1. Presentation Tier: A mobile application running on each user’s smartphone.

    2. Logic Tier: API Gateway & Lambda. The logic tier is globally distributed by the Amazon CloudFront distribution created as part of each API Gateway each API. A set of lambda functions can be specific to user / device identity management and authentication & managed by Amazon Cognito, which provides integration with IAM for temporary user access credentials as well as with popular third party identity providers. Other Lambda functions can define core business logic for your Mobile Back End.

    3. Data Tier: The various data storage services can be leveraged as needed; options are given above in data tier.

    Amazon S3 Hosted Website

    1. Presentation Tier: Static website content hosted on S3, distributed by Amazon CLoudFront. Hosting static website content on S3 is a cost effective alternative to hosting content on server-based infrastructure. However, for a website to contain rich feature, the static content often must integrate with a dynamic back end.

    2. Logic Tier: API Gateway & Lambda, static web content hosted in S3 can directly integrate with API Gateway, which can be CORS complaint.

    3. Data Tier: The various data storage services can be leveraged based on your requirement.

    ServerLess Costing

    At the top of the AWS invoice, we can see the total costing of AWS Services. The bill was processed for 2.1 million API request & all of the infrastructure required to support them.

    Following is the list of services with their costing.

    Note: You can get your costing done from AWS Calculator using following links;

    1. https://calculator.s3.amazonaws.com/index.html
    2. AWS Pricing Calculator

    Conclusion

    The three-tier architecture pattern encourages the best practice of creating application component that are easy to maintain, develop, decoupled & scalable. Serverless Application services varies based on the requirements over development.