Category: Data and AI

  • BigQuery 101: All the Basics You Need to Know

    Google BigQuery is an enterprise data warehouse built using BigTable and Google Cloud Platform. It’s serverless and completely managed. BigQuery works great with all sizes of data, from a 100 row Excel spreadsheet to several Petabytes of data. Most importantly, it can execute a complex query on those data within a few seconds.

    We need to note before we proceed, BigQuery is not a transactional database. It takes around 2 seconds to run a simple query like ‘SELECT * FROM bigquery-public-data.object LIMIT 10’ on a 100 KB table with 500 rows. Hence, it shouldn’t be thought of as OLTP (Online Transaction Processing) database. BigQuery is for Big Data!

    BigQuery supports SQL-like query, which makes it user-friendly and beginner friendly. It’s accessible via its web UI, command-line tool, or client library (written in C#, Go, Java, Node.js, PHP, Python, and Ruby). You can also take advantage of its REST APIs and get our job` done by sending a JSON request.

    Now, let’s dive deeper to understand it better. Suppose you are a data scientist (or a startup which analyzes data) and you need to analyze terabytes of data. If you choose a tool like MySQL, the first step before even thinking about any query is to have an infrastructure in place, that can store this magnitude of data.

    Designing this setup itself will be a difficult task because you have to figure out what will be the RAM size, DCOS or Kubernetes, and other factors. And if you have streaming data coming, you will need to set up and maintain a Kafka cluster. In BigQuery, all you have to do is a bulk upload of your CSV/JSON file, and you are done. BigQuery handles all the backend for you. If you need streaming data ingestion, you can use Fluentd. Another advantage of this is that you can connect Google Analytics with BigQuery seamlessly.

    BigQuery is serverless, highly available, and petabyte scalable service which allows you to execute complex SQL queries quickly. It lets you focus on analysis rather than handling infrastructure. The idea of hardware is completely abstracted and not visible to us, not even as virtual machines.

    Architecture of Google BigQuery

    You don’t need to know too much about the underlying architecture of BigQuery. That’s actually the whole idea of it – you don’t need to worry about architecture and operation.

    However, understanding BigQuery Architecture helps us in controlling costs, optimizing query performance, and optimizing storage. BigQuery is built using the Google Dremel paper.

    Quoting an Abstract from the Google Dremel Paper –

    “Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.”

    Dremel was in production at Google since 2006. Google used it for the following tasks –

    • Analysis of crawled web documents.
    • Tracking install data for applications on Android Market.
    • Crash reporting for Google products.
    • OCR results from Google Books.
    • Spam analysis.
    • Debugging of map tiles on Google Maps.
    • Tablet migrations in managed Bigtable instances.
    • Results of tests run on Google’s distributed build system.
    • Disk I/O statistics for hundreds of thousands of disks.
    • Resource monitoring for jobs run in Google’s data centers.
    • Symbols and dependencies in Google’s codebase.

    BigQuery is much more than Dremel. Dremel is just a query execution engine, whereas Bigquery is based on interesting technologies like Borg (predecessor of Kubernetes) and Colossus. Colossus is the successor to the Google File System (GFS) as mentioned in Google Spanner Paper.

    How BigQuery Stores Data?

    BigQuery stores data in a columnar format – Capacitor (which is a successor of ColumnarIO). BigQuery achieves very high compression ratio and scan throughput. Unlike ColumnarIO, now on BigQuery, you can directly operate on compressed data without decompressing it.

    Columnar storage has the following advantages:

    • Traffic minimization – When you submit a query, the required column values on each query are scanned and only those are transferred on query execution. E.g., a query `SELECT title FROM Collection` would access the title column values only.
    • Higher compression ratio – Columnar storage can achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3.

    (Image source:  Google Dremel Paper)

    Columnar storage has the disadvantage of not working efficiently when updating existing records. That is why Dremel doesn’t support any update queries.  

    How the Query Gets Executed?

    BigQuery depends on Borg for data processing. Borg simultaneously instantiates hundreds of Dremel jobs across required clusters made up of thousands of machines. In addition to assigning compute capacity for Dremel jobs, Borg handles fault-tolerance as well.

    Now, how do you design/execute a query which can run on thousands of nodes and fetches the result? This challenge was overcome by using the Tree Architecture. This architecture forms a gigantically parallel distributed tree for pushing down a query to the tree and aggregating the results from the leaves at a blazingly fast speed.

    (Image source: Google Dremel Paper)

    BigQuery vs. MapReduce

    The key differences between BigQuery and MapReduce are –

    • Dremel is designed as an interactive data analysis tool for large datasets
    • MapReduce is designed as a programming framework to batch process large datasets

    Moreover, Dremel finishes most queries within seconds or tens of seconds and can even be used by non-programmers, whereas MapReduce takes much longer (sometimes even hours or days) to process a query.

    Following is a comparison on running MapReduce on a row and columnar DB:

    (Image source: Google Dremel Paper)

    Another important thing to note is that BigQuery is meant to analyze structured data (SQL) but in MapReduce, you can write logic for unstructured data as well.

    Comparing BigQuery and Redshift

    In Redshift, you need to allocate different instance types and create your own clusters. The benefit of this is that it lets you tune the compute/storage to meet your needs. However, you have to be aware of (virtualized) hardware limits and scale up/out based on that. Note that you are charged by the hour for each instance you spin up.

    In BigQuery, you just upload the data and query it. It is a truly managed service. You are charged by storage, streaming inserts, and queries.

    There are more similarities in both the data warehouses than the differences.

    A smart user will definitely take advantage of the hybrid cloud (GCE+AWS) and leverage different services offered by both the ecosystems. Check out your quintessential guide to AWS Athena here.

    Getting Started With Google BigQuery

    Following is a quick example to show how you can quickly get started with BigQuery:

    1. There are many public datasets available on bigquery, you are going to play with ‘bigquery-public-data:stackoverflow’ dataset. You can click on the “Add Data” button on the left panel and select datasets.

    2. Next, find a language that has the best community, based on the response time. You can write the following query to do that.

    WITH question_answers_join AS (
      SELECT *
        , GREATEST(1, TIMESTAMP_DIFF(answers.first, creation_date, minute)) minutes_2_answer
      FROM (
        SELECT id, creation_date, title
          , (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c
             FROM `bigquery-public-data.stackoverflow.posts_answers` 
             WHERE a.id=parent_id
          ) answers
          , SPLIT(tags, '|') tags
        FROM `bigquery-public-data.stackoverflow.posts_questions` a
        WHERE EXTRACT(year FROM creation_date) > 2014
      )
    )
    SELECT COUNT(*) questions, tag
      , ROUND(EXP(AVG(LOG(minutes_2_answer))), 2) mean_geo_minutes
      , APPROX_QUANTILES(minutes_2_answer, 100)[SAFE_OFFSET(50)] median
    FROM question_answers_join, UNNEST(tags) tag
    WHERE tag IN ('javascript', 'python', 'rust', 'java', 'scala', 'ruby', 'go', 'react', 'c', 'c++')
    AND answers.c > 0
    GROUP BY tag
    ORDER BY mean_geo_minutes

    3. Now you can execute the query and get results –

    You can see that C has the best community followed by JavaScript!

    How to do Machine Learning on BigQuery?

    Now that you have a sound understanding of BigQuery. It’s time for some real action.

    As discussed above, you can connect Google Analytics with BigQuery by going to the Google Analytics Admin panel, then enable BigQuery by clicking on PROPERTY column, click All Products, then click Link BigQuery. After that, you need to enter BigQuery ID (or project number) and then BigQuery will be linked to Google Analytics. Note – Right now BigQuery integration is only available to Google Analytics 360.

    Assuming that you already have uploaded your google analytics data, here is how you can create a logistic regression model. Here, you are predicting whether a website visitor will make a transaction or not.

    CREATE MODEL `velotio_tutorial.sample_model`
    OPTIONS(model_type='logistic_reg') AS
    SELECT
      IF(totals.transactions IS NULL, 0, 1) AS label,
      IFNULL(device.operatingSystem, "") AS os,
      device.isMobile AS is_mobile,
      IFNULL(geoNetwork.country, "") AS country,
      IFNULL(totals.pageviews, 0) AS pageviews
    FROM
      `bigquery-public-data.google_analytics_sample.ga_sessions_*`
    WHERE
      _TABLE_SUFFIX BETWEEN '20190401' AND '20180630'

    Create a model named ‘velotio_tutorial.sample_model’. Now set the ‘model_type’ as ‘logistic_reg’ because you want to train a logistic regression model. A logistic regression model splits input data into two classes and gives the probability that the data is in one of the classes. Usually, in “spam or not spam” type of problems, you use logistic regression. Here, the problem is similar – a transaction will be made or not.

    The above query gets the total number of page views, the country from where the session originated, the operating system of visitors device, the total number of e-commerce transactions within the session, etc.

    Now you just press run query to execute the query.

    Conclusion

    BigQuery is a query service that allows us to run SQL-like queries against multiple terabytes of data in a matter of seconds. If you have structured data, BigQuery is the best option to go for. It can help even a non-programmer to get the analytics right!

    Learn how to build an ETL Pipeline for MongoDB & Amazon Redshift using Apache Airflow.

    If you need help with using machine learning in product development for your organization, connect with experts at Velotio!

  • Unit Testing Data at Scale using Deequ and Apache Spark

    Everyone knows the importance of knowledge and how critical it is to progress. In today’s world, data is knowledge. But that’s only when the data is “good” and correctly interpreted. Let’s focus on the “good” part. What do we really mean by “good data”?

    Its definition can change from use case to use case but, in general terms, good data can be defined by its accuracy, legitimacy, reliability, consistency, completeness, and availability.

    Bad data can lead to failures in production systems, unexpected outputs, and wrong inferences, leading to poor business decisions.

    It’s important to have something in place that can tell us about the quality of the data we have, how close it is to our expectations, and whether we can rely on it.

    This is basically the problem we’re trying to solve.

    The Problem and the Potential Solutions

    A manual approach to data quality testing is definitely one of the solutions and can work well.

    We’ll need to write code for computing various statistical measures, running them manually on different columns, maybe draw some plots, and then conduct some spot checks to see if there’s something not right or unexpected. The overall process can get tedious and time-consuming if we need to do it on a daily basis.

    Certain tools can make life easier for us, like:

    In this blog, we’ll be focussing on Amazon Deequ.

    Amazon Deequ

    Amazon Deequ is an open-source tool developed and used at Amazon. It’s built on top of Apache Spark, so it’s great at handling big data. Deequ computes data quality metrics regularly, based on the checks and validations set, and generates relevant reports.

    Deequ provides a lot of interesting features, and we’ll be discussing them in detail. Here’s a look at its main components:

    Source: AWS

    Prerequisites

    Working with Deequ requires having Apache Spark up and running with Deequ as one of the dependencies.

    As of this blog, the latest version of Deequ, 1.1.0, supports Spark 2.2.x to 2.4.x and Spark 3.0.x.

    Sample Dataset

    For learning more about Deequ and its features, we’ll be using an open-source IMDb dataset which has the following schema: 

    root
     |-- tconst: string (nullable = true)
     |-- titleType: string (nullable = true)
     |-- primaryTitle: string (nullable = true)
     |-- originalTitle: string (nullable = true)
     |-- isAdult: integer (nullable = true)
     |-- startYear: string (nullable = true)
     |-- endYear: string (nullable = true)
     |-- runtimeMinutes: string (nullable = true)
     |-- genres: string (nullable = true)
     |-- averageRating: double (nullable = true)
     |-- numVotes: integer (nullable = true)

    Here, tconst is the primary key, and the rest of the columns are pretty much self-explanatory.

    Data Analysis and Validation

    Before we start defining checks on the data, if we want to compute some basic stats on the dataset, Deequ provides us with an easy way to do that. They’re called metrics.

    Deequ provides support for the following metrics:

    ApproxCountDistinct, ApproxQuantile, ApproxQuantiles, Completeness, Compliance, Correlation, CountDistinct, DataType, Distance, Distinctness, Entropy, Histogram, Maximum, MaxLength, Mean, Minimum, MinLength, MutualInformation, PatternMatch, Size, StandardDeviation, Sum, UniqueValueRatio, Uniqueness

    Let’s go ahead and apply some metrics to our dataset.

    val runAnalyzer: AnalyzerContext = { AnalysisRunner
      .onData(data)
      .addAnalyzer(Size())
      .addAnalyzer(Completeness("averageRating"))
      .addAnalyzer(Uniqueness("tconst"))
      .addAnalyzer(Mean("averageRating"))
      .addAnalyzer(StandardDeviation("averageRating"))
      .addAnalyzer(Compliance("top rating", "averageRating >= 7.0"))
      .addAnalyzer(Correlation("numVotes", "averageRating"))
      .addAnalyzer(Distinctness("tconst"))
      .addAnalyzer(Maximum("averageRating"))
      .addAnalyzer(Minimum("averageRating"))
      .run()
    }
    
    val metricsResult = successMetricsAsDataFrame(spark, runAnalyzer)
    metricsResult.show(false)

    We get the following output by running the code above:

    +-----------+----------------------+-----------------+--------------------+
    |entity     |instance              |name             |value               |
    +-----------+----------------------+-----------------+--------------------+
    |Mutlicolumn|numVotes,averageRating|Correlation      |0.013454113877394851|
    |Column     |tconst                |Uniqueness       |1.0                 |
    |Column     |tconst                |Distinctness     |1.0                 |
    |Dataset    |*                     |Size             |7339583.0           |
    |Column     |averageRating         |Completeness     |0.14858528066240276 |
    |Column     |averageRating         |Mean             |6.886130810579155   |
    |Column     |averageRating         |StandardDeviation|1.3982924856469208  |
    |Column     |averageRating         |Maximum          |10.0                |
    |Column     |averageRating         |Minimum          |1.0                 |
    |Column     |top rating            |Compliance       |0.080230443609671   |
    +-----------+----------------------+-----------------+--------------------+

    Let’s try to quickly understand what this tells us.

    • The dataset has 7,339,583 rows.
    • The distinctness and uniqueness of the tconst column is 1.0, which means that all the values in the column are distinct and unique, which should be expected as it’s the primary key column.
    • The averageRating column has a min of 1 and a max of 10 with a mean of 6.88 and a standard deviation of 1.39, which tells us about the variation in the average rating values across the data.
    • The completeness of the averageRating column is 0.148, which tells us that we have an average rating available for around 15% of the dataset’s records.
    • Then, we tried to see if there’s any correlation between the numVotes and averageRating column. This metric calculates the Pearson correlation coefficient, which has a value of 0.01, meaning there’s no correlation between the two columns, which is expected.

    This feature of Deequ can be really helpful if we want to quickly do some basic analysis on a dataset.

    Let’s move on to defining and running tests and checks on the data.

    Data Validation

    For writing tests for our dataset, we use Deequ’s VerificationSuite and add checks on attributes of the dataset.

    Deequ has a big handy list of validators available to use, which are:

    hasSize, isComplete, hasCompleteness, isUnique, isPrimaryKey, hasUniqueness, hasDistinctness, hasUniqueValueRatio, hasNumberOfDistinctValues, hasHistogramValues, hasEntropy, hasMutualInformation, hasApproxQuantile, hasMinLength, hasMaxLength, hasMin, hasMax, hasMean, hasSum, hasStandardDeviation, hasApproxCountDistinct, hasCorrelation, satisfies, hasPattern, containsCreditCardNumber, containsEmail, containsURL, containsSocialSecurityNumber, hasDataType, isNonNegative, isPositive, isLessThan, isLessThanOrEqualTo, isGreaterThan, isGreaterThanOrEqualTo, isContainedIn

    Let’s apply some checks to our dataset.

    val validationResult: VerificationResult = { VerificationSuite()
      .onData(data)
      .addCheck(
        Check(CheckLevel.Error, "Review Check") 
          .hasSize(_ >= 100000) // check if the data has atleast 100k records
          .hasMin("averageRating", _ > 0.0) // min rating should not be less than 0
          .hasMax("averageRating", _ < 9.0) // max rating should not be greater than 9
          .containsURL("titleType") // verify that titleType column has URLs
          .isComplete("primaryTitle") // primaryTitle should never be NULL
          .isNonNegative("numVotes") // should not contain negative values
          .isPrimaryKey("tconst") // verify that tconst is the primary key column
          .hasDataType("isAdult", ConstrainableDataTypes.Integral) 
          //column contains Integer values only, expected as values this col has are 0 or 1
          )
      .run()
    }
    
    val results = checkResultsAsDataFrame(spark, validationResult)
    results.select("constraint","constraint_status","constraint_message").show(false)

    We have added some checks to our dataset, and the details about the check can be seen as comments in the above code.

    We expect all checks to pass for our dataset except the containsURL and hasMax ones.

    That’s because the titleType column doesn’t have URLs, and we know that the max rating is 10.0, but we are checking against 9.0.

    We can see the output below:

    +--------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
    |constraint                                                                                  |constraint_status|constraint_message                                   |
    +--------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
    |SizeConstraint(Size(None))                                                                  |Success          |                                                     |
    |MinimumConstraint(Minimum(averageRating,None))                                              |Success          |                                                     |
    |MaximumConstraint(Maximum(averageRating,None))                                              |Failure          |Value: 10.0 does not meet the constraint requirement!|
    |containsURL(titleType)                                                                      |Failure          |Value: 0.0 does not meet the constraint requirement! |
    |CompletenessConstraint(Completeness(primaryTitle,None))                                     |Success          |                                                     |
    |ComplianceConstraint(Compliance(numVotes is non-negative,COALESCE(numVotes, 0.0) >= 0,None))|Success          |                                                     |
    |UniquenessConstraint(Uniqueness(List(tconst),None))                                         |Success          |                                                     |
    |AnalysisBasedConstraint(DataType(isAdult,None),<function1>,Some(<function1>),None)          |Success          |                                                     |
    +--------------------------------------------------------------------------------------------+-----------------+-----------------------------------------------------+
    view raw

    In order to perform these checks, behind the scenes, Deequ calculated metrics that we saw in the previous section.

    To look at the metrics Deequ computed for the checks we defined, we can use: 

    VerificationResult.successMetricsAsDataFrame(spark,validationResult)
                      .show(truncate=false)

    Automated Constraint Suggestion

    Automated constraint suggestion is a really interesting and useful feature provided by Deequ.

    Adding validation checks on a dataset with hundreds of columns or on a large number of datasets can be challenging. With this feature, Deequ tries to make our task easier. Deequ analyses the data distribution and, based on that, suggests potential useful constraints that can be used as validation checks.

    Let’s see how this works.

    This piece of code can automatically generate constraint suggestions for us:

    val constraintResult = { ConstraintSuggestionRunner()
      .onData(data)
      .addConstraintRules(Rules.DEFAULT)
      .run()
    }
    
    val suggestionsDF = constraintResult.constraintSuggestions.flatMap { 
      case (column, suggestions) => 
        suggestions.map { constraint =>
          (column, constraint.description, constraint.codeForConstraint)
        } 
    }.toSeq.toDS()
    
    suggestionsDF.select("_1","_2").show(false)

    Let’s look at constraint suggestions generated by Deequ:

    +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |runtimeMinutes|'runtimeMinutes' has less than 72% missing values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
    |tconst        |'tconst' is not null                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
    |titleType     |'titleType' is not null                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
    |titleType     |'titleType' has value range 'tvEpisode', 'short', 'movie', 'video', 'tvSeries', 'tvMovie', 'tvMiniSeries', 'tvSpecial', 'videoGame', 'tvShort'                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
    |titleType     |'titleType' has value range 'tvEpisode', 'short', 'movie' for at least 90.0% of values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
    |averageRating |'averageRating' has no negative values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
    |originalTitle |'originalTitle' is not null                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
    |startYear     |'startYear' has less than 9% missing values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
    |startYear     |'startYear' has type Integral                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
    |startYear     |'startYear' has no negative values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
    |endYear       |'endYear' has type Integral  
    |endYear       |'endYear' has value range '2017', '2018', '2019', '2016', '2015', '2020', '2014', '2013', '2012', '2011', '2010',......|
    |endYear       |'endYear' has value range '' for at least 99.0% of values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
    |endYear       |'endYear' has no negative values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
    |numVotes      |'numVotes' has no negative values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
    |primaryTitle  |'primaryTitle' is not null                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
    |isAdult       |'isAdult' is not null                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
    |isAdult       |'isAdult' has no negative values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
    |genres        |'genres' has less than 7% missing values                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
    +--------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

    We shouldn’t expect the constraint suggestions generated by Deequ to always make sense. They should always be verified before using.

    This is because the algorithm that generates the constraint suggestions just works on the data distribution and isn’t exactly “intelligent.”

    We can see that most of the suggestions generated make sense even though they might be really trivial.

    For the endYear column, one of the suggestions is that endYear should be contained in a list of years, which indeed is true for our dataset. However, it can’t be generalized as every passing year, the value for endYear continues to increase.

    But on the other hand, the suggestion that titleType can take the following values: ‘tvEpisode,’ ‘short,’ ‘movie,’ ‘video,’ ‘tvSeries,’ ‘tvMovie,’ ‘tvMiniSeries,’ ‘tvSpecial,’ ‘videoGame,’ and ‘tvShort’ makes sense and can be generalized, which makes it a great suggestion.

    And this is why we should not blindly use the constraints suggested by Deequ and always cross-check them.

    Something we can do to improve the constraint suggestions is to use the useTrainTestSplitWithTestsetRatio method in ConstraintSuggestionRunner.
    It makes a lot of sense to use this on large datasets.

    How does this work? If we use the config useTrainTestSplitWithTestsetRatio(0.1), Deequ would compute constraint suggestions on 90% of the data and evaluate the suggested constraints on the remaining 10%, which would improve the quality of the suggested constraints.

    Anomaly Detection

    Deequ also supports anomaly detection for data quality metrics.

    The idea behind Deequ’s anomaly detection is that often we have a sense of how much change in certain metrics of our data can be expected. Say we are getting new data every day, and we know that the number of records we get on a daily basis are around 8 to 12k. On a random day, if we get 40k records, we know something went wrong with the data ingestion job or some other job didn’t go right.

    Deequ will regularly store the metrics of our data in a MetricsRepository. Once that’s done, anomaly detection checks can be run. These compare the current values of the metrics to the historical values stored in the MetricsRepository, and that helps Deequ to detect anomalous changes that are a red flag.

    One of Deequ’s anomaly detection strategies is the RateOfChangeStrategy, which limits the maximum change in the metrics by some numerical factor that can be passed as a parameter.

    Deequ supports other strategies that can be found here. And code examples for anomaly detection can be found here.

    Conclusion

    We learned about the main features and capabilities of AWS Lab’s Deequ.

    It might feel a little daunting to people unfamiliar with Scala or Spark, but using Deequ is very easy and straightforward. Someone with a basic understanding of Scala or Spark should be able to work with Deequ’s primary features without any friction.

    For someone who rarely deals with data quality checks, manual test runs might be a good enough option. However, for someone dealing with new datasets frequently, as in multiple times in a day or a week, using a tool like Deequ to perform automated data quality testing makes a lot of sense in terms of time and effort.

    We hope this article helped you get a deep dive into data quality testing and using Deequ for these types of engineering practices.

  • The Ultimate Beginner’s Guide to Jupyter Notebooks

    Jupyter Notebooks offer a great way to write and iterate on your Python code. It is a powerful tool for developing data science projects in an interactive way. Jupyter Notebook allows to showcase the source code and its corresponding output at a single place helping combine narrative text, visualizations and other rich media.The intuitive workflow promotes iterative and rapid development, making notebooks the first choice for data scientists. Creating Jupyter Notebooks is completely free as it falls under Project Jupyter which is completely open source.

    Project Jupyter is the successor to an earlier project IPython Notebook, which was first published as a prototype in 2010. Jupyter Notebook is built on top of iPython, an interactive tool for executing Python code in the terminal using REPL model(Read-Eval-Print-Loop). The iPython kernel executes the python code and communicates with the Jupyter Notebook front-end interface. Jupyter Notebooks also provide additional features like storing your code and output and keep the markdown by extending iPython.

    Although Jupyter Notebooks support using various programming languages, we will focus on Python and its application in this article.

    Getting Started with Jupyter Notebooks!

    Installation

    Prerequisites

    As you would have surmised from the above abstract we need to have Python installed on your machine. Either Python 2.7 or Python 3.+ will do.

    Install Using Anaconda

    The simplest way to get started with Jupyter Notebooks is by installing it using Anaconda. Anaconda installs both Python3 and Jupyter and also includes quite a lot of packages commonly used in the data science and machine learning community. You can follow the latest guidelines from here.

    Install Using Pip

    If, for some reason, you decide not to use Anaconda, then you can install Jupyter manually using Python pip package, just follow the below code:

    pip install jupyter

    Launching First Notebook

    Open your terminal, navigate to the directory where you would like to store you notebook and launch the Jupyter Notebooks. Then type the below command and the program will instantiate a local server at http://localhost:8888/tree.

    jupyter notebook

    A new window with the Jupyter Notebook interface will open in your internet browser. As you might have already noticed Jupyter starts up a local Python server to serve web apps in your browser, where you can access the Dashboard and work with the Jupyter Notebooks. The Jupyter Notebooks are platform independent which makes it easier to collaborate and share with others.

    The list of all files is displayed under the Files tab whereas all the running processes can be viewed by clicking on the Running tab and the third tab, Clusters is extended from IPython parallel, IPython’s parallel computing framework. It helps you to control multiple engines, extended from the IPython kernel.

    Let’s start by making a new notebook. We can easily do this by clicking on the New drop-down list in the top- right corner of the dashboard. You see that you have the option to make a Python 3 notebook as well as regular text file, a folder, and a terminal. Please select the Python 3 notebook option.

    Your Jupyter Notebook will open in a new tab as shown in below image.

    Now each notebook is opened in a new tab so that you can simultaneously work with multiple notebooks. If you go back to the dashboard tab, you will see the new file Untitled.ipynb and you should see some green icon to it’s left which indicates your new notebook is running.

     

    Why a .ipynb file?

    .ipynb is the standard file format for storing Jupyter Notebooks, hence the file name Untitled.ipynb. Let’s begin by first understanding what an .ipynb file is and what it might contain. Each .ipynb file is a text file that describes the content of your notebook in a JSON format. The content of each cell, whether it is text, code or image attachments that have been converted into strings, along with some additional metadata is stored in the .ipynb file. You can also edit the metadata by selecting “Edit > Edit Notebook Metadata” from the menu options in the notebook.

    You can also view the content of your notebook files by selecting “Edit” from the controls on the dashboard, there’s no reason to do so unless you really want to edit the file manually.

    Understanding the Notebook Interface

    Now that you have created a notebook, let’s have a look at the various menu options and functions, which are readily available. Take some time out to scroll through the the list of commands that opens up when you click on the keyboard icon (or press Ctrl + Shift + P).

    There are two prominent terminologies that you should care to learn about: cells and kernels are key both to understanding Jupyter and to what makes it more than just a content writing tool. Fortunately, these concepts are not difficult to understand.

    • A kernel is a program that interprets and executes the user’s code. The Jupyter Notebook App has an inbuilt kernel for Python code, but there are also kernels available for other programming languages.
    • A cell is a container which holds the executable code or normal text 

    Cells

    Cells form the body of a notebook. If you look at the screenshot above for a new notebook (Untitled.ipynb), the text box with the green border is an empty cell. There are 4 types of cells:

    • Code – This is where you type your code and when executed the kernel will display its output below the cell.
    • Markdown – This is where you type your text formatted using Markdown and the output is displayed in place when it is run.
    • Raw NBConvert – It’s a command line tool to convert your notebook into another format (like HTML, PDF etc.)
    • Heading – This is where you add Headings to separate sections and make your notebook look tidy and neat. This has now been merged into the Markdown option itself. Adding a ‘#’ at the beginning ensures that whatever you type after that will be taken as a heading.

    Let’s test out how the cells work with a basic “hello world” example. Type print(‘Hello World!’) in the cell and press Ctrl + Enter or click on the Run button in the toolbar at the top.

    print("Hello World!")

    Hello World!

    When you run the cell, the output will be displayed below, and the label to its left changes from In[ ] to In[1] . Moreover, to signify that the cell is still running, Jupyter changes the label to In[*]

    Additionally, it is important to note that the output of a code cell comes from any of the print statements in the code cell, as well as the value of the last line in the cell, irrespective of it being a variable, function call or some other code snippet.

    Markdown

    Markdown is a lightweight, markup language for formatting plain text. Its syntax has a one-to-one correspondence with HTML tags. As this article has been written in a Jupyter notebook, all of the narrative text and images you can see, are written in Markdown. Let’s go through the basics with the following example.

    # This is a level 1 heading 
    ### This is a level 3 heading
    This is how you write some plain text that would form a paragraph.
    You can emphasize the text by enclosing the text like "**" or "__" to make it bold and enclosing the text in "*" or "_" to make it italic. 
    Paragraphs are separated by an empty line.
    * We can include lists.
      * And also indent them.
    
    1. Getting Numbered lists is
    2. Also easy.
    
    [To include hyperlinks enclose the text with square braces and then add the link url in round braces](http://www.example.com)
    
    Inline code uses single backticks: `foo()`, and code blocks use triple backticks:
    
    ``` 
    foo()
    ```
    
    Or can be indented by 4 spaces: 
    
        foo()
        
    And finally, adding images is easy: ![Online Image](https://www.example.com/image.jpg) or ![Local Image](img/image.jpg) or ![Image Attachment](attachment:image.jpg)

    We have 3 different ways to attach images

    • Link the URL of an image from the web.
    • Use relative path of an image present locally
    • Add an attachment to the notebook by using “Edit>Insert Image” option; This method converts the image into a string and store it inside your notebook

    Note that adding an image as an attachment will make the .ipynb file much larger because it is stored inside the notebook in a string format.

    There are a lot more features available in Markdown. To learn more about markdown, you can refer to the official guide from the creator, John Gruber, on his website.

    Kernels

    Every notebook runs on top of a kernel. Whenever you execute a code cell, the content of the cell is executed within the kernel and any output is returned back to the cell for display. The kernel’s state applies to the document as a whole and not individual cells and is persisted over time.

    For example, if you declare a variable or import some libraries in a cell, they will be accessible in other cells. Now let’s understand this with the help of an example. First we’ll import a Python package and then define a function.

    import os, binascii
    def sum(x,y):
      return x+y

    Once the cell above  is executed, we can reference os, binascii and sum in any other cell.

    rand_hex_string = binascii.b2a_hex(os.urandom(15)) 
    print(rand_hex_string)
    x = 1
    y = 2
    z = sum(x,y)
    print('Sum of %d and %d is %d' % (x, y, z))

    The output should look something like this:

    c84766ca4a3ce52c3602bbf02a
    d1f7 Sum of 1 and 2 is 3

    The execution flow of a notebook is generally from top-to-bottom, but it’s common to go back to make changes. The order of execution is shown to the left of each cell, such as In [2] , will let you know whether any of your cells have stale output. Additionally, there are multiple options in the Kernel menu which often come very handy.

    • Restart: restarts the kernel, thus clearing all the variables etc that were defined.
    • Restart & Clear Output: same as above but will also wipe the output displayed below your code cells.
    • Restart & Run All: same as above but will also run all your cells in order from top-to-bottom.
    • Interrupt: If your kernel is ever stuck on a computation and you wish to stop it, you can choose the Interrupt option.

    Naming Your Notebooks

    It is always a best practice to give a meaningful name to your notebooks. You can rename your notebooks from the notebook app itself by double-clicking on the existing name at the top left corner. You can also use the dashboard or the file browser to rename the notebook file. We’ll head back to the dashboard to rename the file we created earlier, which will have the default notebook file name Untitled.ipynb.

    Now that you are back on the dashboard, you can simply select your notebook and click “Rename” in the dashboard controls

    Jupyter notebook - Rename

    Shutting Down your Notebooks

    We can shutdown a running notebook by selecting “File > Close and Halt” from the notebook menu. However, we can also shutdown the kernel either by selecting the notebook in the dashboard and clicking “Shutdown” or by going to “Kernel > Shutdown” from within the notebook app (see images below).

    Shutdown the kernel from Notebook App:

     

    Shutdown the kernel from Dashboard:

     

     

    Sharing Your Notebooks

    When we talk about sharing a notebook, there are two things that might come to our mind. In most cases, we would want to share the end-result of the work, i.e. sharing non-interactive, pre-rendered version of the notebook, very much similar to this article; however, in some cases we might want to share the code and collaborate with others on notebooks with the aid of version control systems such as Git which is also possible.

    Before You Start Sharing

    The state of the shared notebook including the output of any code cells is maintained when exported to a file. Hence, to ensure that the notebook is share-ready, we should follow below steps before sharing.

    1. Click “Cell > All Output > Clear”
    2. Click “Kernel > Restart & Run All”
    3. After the code cells have finished executing, validate the output. 

    This ensures that your notebooks don’t have a stale state or contain intermediary output.

    Exporting Your Notebooks

    Jupyter has built-in support for exporting to HTML, Markdown and PDF as well as several other formats, which you can find from the menu under “File > Download as” . It is a very convenient way to share the results with others. But if sharing exported files isn’t suitable for you, there are some other popular methods of sharing the notebooks directly on the web.

    • GitHub
    • With home to over 2 million notebooks, GitHub is the most popular place for sharing Jupyter projects with the world. GitHub has integrated support for rendering .ipynb files directly both in repositories and gists on its website.
    • You can just follow the GitHub guides for you to get started on your own.
    • Nbviewer
    • NBViewer is one of the most prominent notebook renderers on the web.
    • It also renders your notebook from GitHub and other such code storage platforms and provide a shareable URL along with it. nbviewer.jupyter.org provides a free rendering service as part of Project Jupyter.

    Data Analysis in a Jupyter Notebook

    Now that we’ve looked at what a Jupyter Notebook is, it’s time to look at how they’re used in practice, which should give you a clearer understanding of why they are so popular. As we walk through the sample analysis, you will be able to see how the flow of a notebook makes the task intuitive to work through ourselves, as well as for others to understand when we share it with them. We also hope to learn some of the more advanced features of Jupyter notebooks along the way. So let’s get started, shall we?

    Analyzing the Revenue and Profit Trends of Fortune 500 US companies from 1955-2013

    So, let’s say you’ve been tasked with finding out how the revenues and profits of the largest companies in the US changed historically over the past 60 years. We shall begin by gathering the data to analyze.

    Gathering the DataSet

    The data set that we will be using to analyze the revenue and profit trends of fortune 500 companies has been sourced from Fortune 500 Archives and Top Foreign Stocks. For your ease we have compiled the data from both the sources and created a CSV for you.

    Importing the Required Dependencies

    Let’s start off with a code cell specifically for imports and initial setup, so that if we need to add or change anything at a later point in time, we can simply edit and re-run the cell without having to change the other cells. We can start by importing Pandas to work with our data, Matplotlib to plot the charts and Seaborn to make our charts prettier.

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import sys

    Set the design styles for the charts

    sns.set(style="darkgrid")

    Load the Input Data to be Analyzed

    As we plan on using pandas to aid in our analysis, let’s begin by importing our input data set into the most widely used pandas data-structure, DataFrame.

    df = pd.read_csv('../data/fortune500_1955_2013.csv')

    Now that we are done loading our input dataset, let us see how it looks like!

    df.head()

    Looking good. Each row corresponds to a single company per year and all the columns we need are present.

    Exploring the Dataset

    Next, let’s begin by exploring our data set. We will primarily look into the number of records imported and the data types for each of the different columns that were imported.

    As we have 500 data points per year and since the data set has records between 1955 and 2012, the total number of records in the dataset looks good!

    Now, let’s move on to the individual data types for each of the column.

    df.columns = ['year', 'rank', 'company', 'revenue', 'profit']
    len(df)

    df.dtypes

    As we can see from the output of the above command the data types for the columns revenue and profit are being shown as object whereas the expected data type should be float. It indicates that there may be some non-numeric values in the revenue and profit columns.

    So let’s first look at the details of imported values for revenue.

    non_numeric_revenues = df.revenue.str.contains('[^0-9.-]')
    df.loc[non_numeric_revenues].head()

    print("Number of Non-numeric revenue values: ", len(df.loc[non_numeric_revenues]))

    Number of Non-numeric revenue values:	1

    print("List of distinct Non-numeric revenue values: ", set(df.revenue[non_numeric_revenues]))

    List of distinct Non-numeric revenue values:	{'N.A.'}

    As the number of non-numeric revenue values is considerably less compared to the total size of our data set. Hence, it would be easier to just remove those rows.

    df = df.loc[~non_numeric_revenues]
    df.revenue = df.revenue.apply(pd.to_numeric)
    eval(In[6])

    Now that the data type issue for column revenue is resolved, let’s move on to values in column profit.

    non_numeric_profits = df.profit.str.contains('[^0-9.-]')
    df.loc[non_numeric_profits].head()

    print("Number of Non-numeric profit values: ", len(df.loc[non_numeric_profits]))

    Number of Non-numeric profit values:	374

    print("List of distinct Non-numeric profit values: ", set(df.profit[non_numeric_profits]))

    List of distinct Non-numeric profit values:	{'N.A.'}

    As the number of non-numeric profit values is around 1.5% which is a small percentage of our data set, but not completely inconsequential. Let’s take a quick look at the distribution of values and if the rows having N.A. values are uniformly distributed over the years then it would be wise to just remove the rows with missing values.

    bin_sizes, _, _ = plt.hist(df.year[non_numeric_profits], bins=range(1955, 2013))

    As observed from the histogram above, majority of invalid values in single year is fewer than 25, removing these values would account for less than 4% of the data as there are 500 data points per year. Also, other than a surge around 1990, most years have fewer than less than 10 values missing. Let’s assume that this is acceptable for us and move ahead with removing these rows.

    df = df.loc[~non_numeric_profits]
    df.profit = df.profit.apply(pd.to_numeric)

    We should validate if that worked!

    eval(In[6])

    Hurray! Our dataset has been cleaned up.

    Time to Plot the graphs

    Let’s begin with defining a function to plot the graph, set the title and add lables for the x-axis and y-axis.

    # function to plot the graphs for average revenues or profits of the fortune 500 companies against year
    def plot(x, y, ax, title, y_label):
        ax.set_title(title)
        ax.set_ylabel(y_label)
        ax.plot(x, y)
        ax.margins(x=0, y=0)
        
    # function to plot the graphs with superimposed standard deviation    
    def plot_with_std(x, y, stds, ax, title, y_label):
        ax.fill_between(x, y - stds, y + stds, alpha=0.2)
        plot(x, y, ax, title, y_label)

    Let’s plot the average profit by year and average revenue by year using Matplotlib.

    group_by_year = df.loc[:, ['year', 'revenue', 'profit']].groupby('year')
    avgs = group_by_year.mean()
    x = avgs.index
    y = avgs.profit
    
    fig, ax = plt.subplots()
    plot(x, y, ax, 'Increase in mean Fortune 500 company profits from 1955 to 2013', 'Profit (millions)')

    y2 = avgs.revenue
    fig, ax = plt.subplots()
    plot(x, y2, ax, 'Increase in mean Fortune 500 company revenues from 1955 to 2013', 'Revenue (millions)')

    Woah! The charts for profits has got some huge ups and downs. It seems like they correspond to the early 1990s recession, the dot-com bubble in the early 2000s and the Great Recession in 2008.

    On the other hand, the Revenues are constantly growing and are comparatively stable. Also it does help to understand how the average profits recovered so quickly after the staggering drops because of the recession.

    Let’s also take a look at how the average profits and revenues compare to their standard deviations.

    fig, (ax1, ax2) = plt.subplots(ncols=2)
    title = 'Increase in mean and std Fortune 500 company %s from 1955 to 2013'
    stds1 = group_by_year.std().profit.values
    stds2 = group_by_year.std().revenue.values
    plot_with_std(x, y.values, stds1, ax1, title % 'profits', 'Profit (millions)')
    plot_with_std(x, y2.values, stds2, ax2, title % 'revenues', 'Revenue (millions)')
    fig.set_size_inches(14, 4)
    fig.tight_layout()

     

    That’s astonishing, the standard deviations are huge. Some companies are making billions while some others are losing as much, and the risk certainly has increased along with rising profits and revenues over the years. Although we could keep on playing around with our data set and plot plenty more charts to analyze, it is time to bring this article to a close.

    Conclusion

    As part of this article we have seen various features of the Jupyter notebooks, from basics like installation, creating, and running code cells to more advanced features like plotting graphs. The power of Jupyter Notebooks to promote a productive working experience and provide an ease of use is evident from the above example, and I do hope that you feel confident to begin using Jupyter Notebooks in your own work and start exploring more advanced features. You can read more about data analytics using Pandas here.

    If you’d like to further explore and want to look at more examples, Jupyter has put together A Gallery of Interesting Jupyter Notebooks that you may find helpful and the Nbviewer homepage provides a lot of examples for further references. Find the entire code here on Github.

  • Your Quintessential Guide to AWS Athena

    Introduction

    Serverless has become a new trend today and is here to stay for sure! Now when you think of wireless internet, you know that it still has some wires but you don’t need to worry about them as you don’t have to maintain them. Similarly, serverless has servers but you don’t have to keep worrying about handling or maintaining them. All you need to do is focus on your code and you’re good to go.

    It has some more benefits, such as:

    • Zero administration: You can deploy code without provisioning anything beforehand, or managing anything later. There is no concept of a fleet, an instance, or even an operating system.
    • Auto-scaling: It lets your service providers manage the scaling challenges. You don’t need to fire alerts or write scripts to scale up and down. It handles quick bursts of traffic and weekend lulls the same way.
    • Pay-per-use: The function-as-a-service compute and managed services are charged based on usage rather than pre-provisioned capacity. You can have complete resource utilization without paying a cent for idle time. The results? 90% cost-savings over a cloud VM, and the satisfaction of knowing that you never pay for resources you don’t use.

    What is AWS Athena?

    AWS Athena is a similar serverless service. It is more of an interactive query service than a code deployment service.

    Using Athena one can directly query the data stored in S3 buckets and using standard ANSI SQL.

    As mentioned earlier, it works on the principle of serverless, that is, there is no infrastructure to manage, and you pay only for the queries that you run.

    Athena is easy to use. You can simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

    It is based on Facebook’s PrestoDB and can be used to query structured and semi-structured data.

    Some Exciting Features of Athena are:

    • Serverless. No ETL – Not having to set up and manage any servers or data warehouses.
    • Only pay for the data that is scanned.
    • You can ensure better performance by compressing, partitioning, and converting your data into columnar formats.
    • Can also handle complex analysis, including large joins, window functions, and arrays.
    • Athena automatically executes queries in parallel.
    • Need to provide a path to the S3 folder and when new files added automatically reflects in the table.
    • Supports –
    • Support CSV, Json, Parquet, ORC, Avro data formats
    • Complex Joins and datatypes
    • View creation
    • Does not Support –
    • User-defined functions and stored procedures
    • Hive or Presto transactions
    • LZO (Snappy is supported)

    Pricing of Athena

    • AWS Athena is priced $5 for each TB of data scanned.
    • Queries are rounded up to the nearest MB, with a 10 MB minimum.
    • Users pay for stored data at regular S3 rates.
    • Amazon advises users to use compressed data files, have data in columnar formats, and routinely delete old results sets to keep charges low. Partitioning data in tables can speed up queries and reduce query bills.

    Athena vs. Redshift Spectrum

    • AWS also has Redshift as data warehouse service, and we can use redshift spectrum to query S3 data, so then why should you use Athena?

    Advantages of Redshift Spectrum:

    • Allows creation of Redshift tables. You’re able to join Redshift tables with Redshift spectrum tables efficiently.

    If you do not need those things then you should consider Athena as well Athena differences from Redshift spectrum:

    • Billing. This is a major difference and depending on your use case you may find one much cheaper than the other Performance.
    • Athena slightly faster. SQL syntax and features.
    • Athena is derived from presto and is a bit different to Redshift which has its roots in Postgres.
    • It’s easy enough to connect to Athena using API, JDBC or ODBC but many more products offer “standard out of the box” connection to Redshift.
    • Athena has GIS functions and lambdas.

    So in nutshell, if you have existing instances of redshift you would probably go for Redshift Spectrum, if not then you can opt for Athena for querying the data. In some cases, you can use both in tandem.

    Example

    Here is a sample query to create a sample database having 3 tables basic_details, contact_details and bill_details, Uploaded csv file to s3:

    Basic_details:

    const outside = {weather: FRIGHTFUL}
    const inside = {fire: DELIGHTFUL}
    const go = places => places.some(p=>p>outside.weather)))
    
    const snow = () => (outside.weather < inside.fire && !go(places)) {
      let it = snow()
    }
    
    let it = snow()
    
    const FRIGHTFUL = 1
    const DELIGHTFUL = 1337

    Bill_details:

    CREATE EXTERNAL TABLE `bil_details`(
      `id` int COMMENT '', 
      `amount_paid` string COMMENT '', 
      `amount_due` string COMMENT '')
    ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY ',' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://athena-blog/bill-details'
    TBLPROPERTIES (
      'has_encrypted_data'='false', 
      'skip.header.line.count'='1')

    Contact_details:

    CREATE EXTERNAL TABLE `contact_details`(
      `id` int COMMENT '', 
      `street` string COMMENT '', 
      `city` string COMMENT '', 
      `state` string COMMENT '', 
      `country` string COMMENT '', 
      `zip` string COMMENT '')
    ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY ',' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://athena-blog/contact-details'
    TBLPROPERTIES (
      'has_encrypted_data'='false', 
      'skip.header.line.count'='1')

    Sample Query for – FirstNames of People from Minnesota with amount_due > $100

    WITH basic AS 
        (SELECT id,
             first_name
        FROM basic_details
        WHERE lower(gender) = 'male' ), bill AS 
        (SELECT id
        FROM bil_details
        WHERE CAST(amount_due AS INTEGER) > 100 ), contact AS 
        (SELECT contact_details.id
        FROM contact_details
        JOIN bill
            ON contact_details.id = bill.id
        WHERE state= 'Minnesota' )
    SELECT basic.first_name
    FROM basic
    JOIN contact
        ON basic.id = contact.id 

    Output:

    Some Other Sample Queries:

    1. Searching for Values in JSON

    WITH dataset AS (
      SELECT * FROM (VALUES
        (JSON '{"name": "Bob Smith", "org": "legal", "projects": ["project1"]}'),
        (JSON '{"name": "Susan Smith", "org": "engineering", "projects": ["project1", "project2", "project3"]}'),
        (JSON '{"name": "Jane Smith", "org": "finance", "projects": ["project1", "project2"]}')
      ) AS t (users)
    )
    SELECT json_extract_scalar(users, '$.name') AS user
    FROM dataset
    WHERE json_array_contains(json_extract(users, '$.projects'), 'project2')

    Output:

    2. Extracting properties

    WITH dataset AS (
      SELECT '{"name": "Susan Smith",
               "org": "engineering",
               "projects": [{"name":"project1", "completed":false},
               {"name":"project2", "completed":true}]}'
        AS blob
    )
    SELECT
      json_extract(blob, '$.name') AS name,
      json_extract(blob, '$.projects') AS projects
    FROM dataset

    Output:

    3. Converting JSON to Athena Data Types

    WITH dataset AS (
      SELECT
        CAST(JSON '"HELLO ATHENA"' AS VARCHAR) AS hello_msg,
        CAST(JSON '12345' AS INTEGER) AS some_int,
        CAST(JSON '{"a":1,"b":2}' AS MAP(VARCHAR, INTEGER)) AS some_map
    )
    SELECT * FROM dataset

    Output:

    Conclusion

    Hence, we can easily say that AWS Athena gives us an efficient way to query our raw data present in different formats in S3 object storage, without spawning a dedicated infrastructure and at minimal cost.

    Need help with setting up AWS Athena for your organization? Connect with the experts at Velotio!

  • Building an Intelligent Recommendation Engine with Collaborative Filtering

    In this post, we will talk about building a collaborative recommendation system. For this, we will utilize patient ratings with a drug and medical condition dataset to generate treatment suggestions.

    Let’s take a practical scenario where multiple medical practitioners have treated patients with different medical conditions with the most suitable drugs available. For every prescribed drug, the patients are diagnosed and then suggested a treatment plan, which is our experiences.

    The purpose of the recommendation system is to understand and find patterns with the information provided by patients during the diagnosis, and then suggest a treatment plan, which most closely matches the pattern identified by the recommendation system. 

    At the end of this article, we are going deeper into how these recommendations work and how we can find one preferred suggestion and the next five closest suggestions for any treatment.

    Definitions

    A recommendation system suggests or predicts a user’s behaviour by observing patterns of their past behaviour compared to others.

    In simple terms, it is a filtering engine that picks more relevant information for specific users by using all the available information. It is often used in ecommerce like Amazon, Flipkart, Youtube, and Netflix and personalized user products like Alexa and Google Home Mini.

    For the medical industry, where suggestions must be most accurate, a recommendation system will also take experiences into account. So, we must use all our experiences, and such applications will use every piece of information for any treatment. 

    Recommendation systems use information like various medical conditions and their effect on each patient. They compare these patterns to every new treatment to find the closest similarity.

    Concepts and Technology

    To design the recommendation system, we need a few concepts, which are listed below.

    1. Concepts: Pattern Recognition, Correlation, Cosine Similarity, Vector norms (L1, L2, L-Infinity)

    2. Language: Python (library: Numpy & Pandas), Scipy, Sklearn

    As far as the prototype development is concerned, we have support of a library (Scipy & Sklearn) that executes all the algorithms for us. All we need is a little Python and to use library functions.

    Different Approaches for Recommendation Systems

    Below I have listed a few filtering approaches and examples:

    • Collaborative filtering: It is based on review or response of users for any entity. Here, the suggestion is based on the highest rated item by most of the users. E.g., movie or mobile suggestions.
    • Content-based filtering: It is based on the pattern of each user’s past activity. Here, the suggestion is based on the most preferred by similar users. E.g., food suggestions.
    • Popularity-based filtering: It is based on a pattern of popularity among all users. E.g., YouTube video suggestions  

    Based on these filtering approaches, there will be different approaches to recommender systems, which are explained below:

    • Multi-criteria recommender systems: Various conditions like age, gender, location, likes, and dislikes are used for categorization and then items are suggested. E.g., suggestion of apparel based on age and gender.
    • Risk-aware recommender systems: There is always uncertainty when users use Internet applications (website or mobile). Recommending any advertisement over the Internet must consider risk and users must be aware of this. E.g., advertisement display suggestion over Internet application. 
    • Mobile recommender systems: These are location-based suggestions that consist of users’ current location or future location and provide suggestions based on that. E.g., mostly preferred in traveling and tourism.
    • Hybrid recommender systems: These are the combination of multiple approaches for recommendations. E.g., suggestion of hotels and restaurants based on user preference and travel information.
    • Collaborative and content recommender systems: These are the combination of collaborative and content-based approaches. E.g., suggestion of the highest-rated movie of users’ preference along with their watch history.

    Practical Example with Implementation

    In this example, we have a sample dataset of drugs prescribed for various medical conditions and ratings given by patients. What we need here is for any medical condition we have to receive a suggestion for the most suitable prescribed drugs for treatment.

    Sample Dataset: 

    Below is the sample of the publicly available medical drug dataset used from the Winter 2018 Kaggle University Club Hackathon.

    drugNameconditionratingcondition_id
    MirtazapineDepression10201
    MesalamineCrohn’s Disease, Maintenance8185
    BactrimUrinary Tract Infection9657
    ContraveWeight Loss9677
    Cyclafem 1 / 35Birth Control9122
    ZyclaraKeratosis4365
    CopperBirth Control6122
    AmitriptylineMigraine Prevention9403
    MethadoneOpiate Withdrawal7460
    LevoraBirth Control2122
    ParoxetineHot Flashes1310
    MiconazoleVaginal Yeast Infection6664
    BelviqWeight Loss1677
    SeroquelSchizoaffective Disorde10575
    AmbienInsomnia2347
    NuvigilNarcolepsy9424
    ChantixSmoking Cessation10597
    Microgestin Fe 1 / 20Acne349
    KlonopinBipolar Disorde6121
    CiprofloxacinUrinary Tract Infection10657
    TrazodoneInsomnia1347
    EnteraGamIrritable Bowel Syndrome9356
    AripiprazoleBipolar Disorde1121
    CyclosporineKeratoconjunctivitis Sicca1364

    Sample Code: 

    We will do this in 5 steps:

    1. Importing required libraries

    2. Reading the drugsComTest_raw.csv file and creating a pivot matrix.

    3. Creating a KNN model using the NearestNeighbors function with distance metric- ‘cosine’ & algorithm- ‘brute’. Possible values for distance metric are ‘cityblock’, ‘euclidean’, ‘l1’, ‘l2’ & ‘manhattan’. Possible values for the algorithm are ‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’ & ‘cuml’.

    4. Selecting one medical condition randomly for which we have to suggest 5 drugs for treatment.

    5. Finding the 6 nearest neighbors for the sample, calling the kneighbors function with the trained KNN models created in step 3. The first k-neighbor for the sample medical condition is self with a distance of 0. The next 5 k-neighbors are drugs prescribed for the sample medical condition.

    #!/usr/bin/env python
    # coding: utf-8
    
    # Step 1
    import pandas as pd
    import numpy as np
    
    from scipy.sparse import csr_matrix
    from sklearn.neighbors import NearestNeighbors
    from sklearn.preprocessing import LabelEncoder
    encoder = LabelEncoder()
    
    
    # Step 2
    df = pd.read_csv('drugsComTest_raw.csv').fillna('NA')
    df['condition_id'] = pd.Series(encoder.fit_transform(df['condition'].values), index=df.index)
    df_medical = df.filter(['drugName', 'condition', 'rating', 'condition_id'], axis=1)
    df_medical_ratings_pivot=df_medical.pivot_table(index='drugName',columns='condition_id',values='rating').fillna(0)
    df_medical_ratings_pivot_matrix = csr_matrix(df_medical_ratings_pivot.values)
    
    
    # Step 3
    # distance =  [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]
    # algorithm = ['auto', 'ball_tree', 'kd_tree', 'brute', 'cuml']
    model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute')
    model_knn.fit(df_medical_ratings_pivot_matrix)
    
    
    # Step 4
    sample_index = np.random.choice(df_medical_ratings_pivot.shape[0])
    sample_condition = df_medical_ratings_pivot.iloc[sample_index,:].values.reshape(1, -1)
    
    
    # Step 5
    distances, indices = model_knn.kneighbors(sample_condition, n_neighbors = 6)
    for i in range(0, len(distances.flatten())):
        if i == 0:
            print('Recommendations for {0}:n'.format(df_medical_ratings_pivot.index[sample_index]))
        else:
            recommendation = df_medical_ratings_pivot.index[indices.flatten()[i]]
            distanceFromSample = distances.flatten()[i]
            print('{0}: {1}, with distance of {2}:'.format(i, recommendation, distanceFromSample))

    Explanation:

    This is the collaborative-based recommendation system that uses the patients’ ratings of given drug treatments to find similarities in medical conditions. Here, we are matching the patterns for ratings given to drugs by patients. This system compares all the rating patterns and tries to find similarities (cosine similarity).

    Challenges of Recommendation System

    Any recommendation system requires a decent quantity of quality information to process. Before developing such a system, we must be aware of it. Acknowledging and handling such challenges improve the accuracy of recommendation.

    1. Cold Start: Recommending a new user or a user without any previous behavior is a problem. We can recommend the most popular options to them. E.g., YouTube videos suggestion for newly registered users.

    2. Not Enough Data: Having insufficient data provides recommendations with less certainty. E.g., suggestion of hotels or restaurants will not be accurate if systems are uncertain about users’ locations.

    3. Grey Sheep Problem: This problem occurs when the inconsistent behavior of a user makes it difficult to find a pattern. E.g., multiple users are using the same account, so user activity will be wide, and the system will have difficulty in mapping such patterns. 

    4. Similar items: In these cases, there is not enough data to separate similar items. For these situations, we can recommend all similar items randomly. E.g., apparel suggestions for users with color and sizes. All shirts are similar. 

    5. Shilling Attacks: Intentional negative behavior that leads to bad/unwanted recommendations. While immoral, we cannot deny the possibility of such attacks. E.g., user ratings and reviews over various social media platforms.

    Accuracy and Performance Measures

    Accuracy evaluation is important as we always follow and try to improve algorithms. The most preferred measures for improving algorithms are user studies, online evaluations, and offline evaluations. Our recommendation models must be ready to learn from users’ activity daily. For online evaluations, we have to regularly test our recommendation system.

    If we understand the challenges of the recommendation system, we can prepare such testing datasets to test its accuracy. With these variations of datasets, we can improve our approach of user studies and offline evaluations.

    1. Online Evaluations: In online evaluations, prediction models are updated frequently with the unmonitored data, which leads to the possibility of unexpected accuracy. To verify this, the prediction models are exposed to the unmonitored data with less uncertainty and then the uncertainty of unmonitored data is gradually increased. 

    2. Offline Evaluations: In offline evaluations, the prediction models are trained with a sample dataset that consists of all possible uncertainty with expected outcomes. To verify this, the sample dataset will be gradually updated and prediction models will be verified with predicted and actual outcomes. E.g., creating multiple users with certain activity and expecting genuine suggestions for them.

    Conclusion

    As a part of this article, we have learned about the approaches, challenges, and evaluation methods, and then we created a practical example of the collaboration-based recommendation system. We also explored various types and filtering approaches with real-world scenarios.

    We have also executed sample code with a publicly available medical drug dataset with patient ratings. We can opt for various options for distance matrix and algorithm for the NearestNeighbors calculation. We have also listed various challenges for this system and understood the accuracy evaluation measures and things that affect and improve them.