Tag: machine learning

Machine Learning in Flutter using TensorFlow
Machine learning has become part of day-to-day life. Small tasks like searching songs on YouTube and suggestions on Amazon are using ML in the background. This is a well-developed field of technology with immense possibilities. But how we can use it?

This blog is aimed at explaining how easy it is to use machine learning models (which will act as a brain) to build powerful ML-based Flutter applications. We will briefly touch base on the following points

1. Definitions

Let’s jump to the part where most people are confused. A person who is not exposed to the IT industry might think AI, ML, & DL are all the same. So, let’s understand the difference.

Figure 01

‍

1.1. Artificial Intelligence (AI):

AI, i.e. artificial intelligence, is a concept of machines being able to carry out tasks in a smarter way. You all must have used YouTube. In the search bar, you can type the lyrics of any song, even lyrics that are not necessarily the starting part of the song or title of songs, and get almost perfect results every time. This is the work of a very powerful AI.
Artificial intelligence is the ability of a machine to do tasks that are usually done by humans. This ability is special because the task we are talking about requires human intelligence and discernment.

1.2. Machine Learning (ML):

Machine learning is a subset of artificial intelligence. It is based on the idea that we expose machines to new data, which can be a complete or partial row, and let the machine decide the future output. We can also say it is a sub-field of AI that deals with the extraction of patterns from data sets. With a new data set and processing, the last result machine will slowly reach the expected result. This means that the machine can find rules for optical behavior to get new output. It also can adapt itself to new changing data just like humans.

1.3. Deep Learning (ML):

Deep learning is again a smaller subset of machine learning, which is essentially a neural network with multiple layers. These neural networks attempt to simulate the behavior of the human brain, so you can say we are trying to create an artificial human brain. With one layer of a neural network, we can still make approximate predictions, and additional layers can help to optimize and refine for accuracy.

2. Types of ML

Before starting the implementation, we need to know the types of machine learning because it is very important to know which type is more suitable for our expected functionality.

Figure 02

2.1. Supervised Learning

As the name suggests, in supervised learning, the learning happens under supervision. Supervision means the data that is provided to the machine is already classified data i.e., each piece of data has fixed labels, and inputs are already mapped to the output.
Once the machine is learned, it is ready for the classification of new data.
This learning is useful for tasks like fraud detection, spam filtering, etc.

2.2. Unsupervised Learning

In unsupervised learning, the data given to machines to learn is purely raw, with no tags or labels. Here, the machine is the one that will create new classes by extracting patterns.
This learning can be used for clustering, association, etc.

2.3. Semi-Supervised Learning

Both supervised and unsupervised have their own limitations, because one requires labeled data, and the other does not, so this learning combines the behavior of both learnings, and with that, we can overcome the limitations.
In this learning, we feed row data and categorized data to the machine so it can classify the row data, and if necessary, create new clusters.

2.4. : Reinforcement Learning

For this learning, we feed the last output’s feedback with new incoming data to machines so they can learn from their mistakes. This feedback-based process will continue until the machine reaches the perfect output. This feedback is given by humans in the form of punishment or reward. This is like when a search algorithm gives you a list of results, but users do not click on other than the first result. It is like a human child who is learning from every available option and by correcting its mistakes, it grows.

3. TensorFlow

Machine learning is a complex process where we need to perform multiple activities like processing of acquiring data, training models, serving predictions, and refining future results.

To perform such operations, Google developed a framework in November 2015 called TensorFlow. All the above-mentioned processes can become easy if we use the TensorFlow framework.

For this project, we are not going to use a complete TensorFlow framework but a small tool called TensorFlow Lite

3.1. TensorFlow Lite

TensorFlow Lite allows us to run the machine learning models on devices with limited resources, like limited RAM or memory.

3.2. TensorFlow Lite Features
- Optimized for on-device ML by addressing five key constraints:
- Latency: because there’s no round-trip to a server
- Privacy: because no personal data leaves the device
- Connectivity: because internet connectivity is not required
- Size: because of a reduced model and binary size
- Power consumption: because of efficient inference and a lack of network connections
- Support for Android and iOS devices, embedded Linux, and microcontrollers
- Support for Java, Swift, Objective-C, C++, and Python programming languages
- High performance, with hardware acceleration and model optimization
- End-to-end examples for common machine learning tasks such as image classification, object detection, pose estimation, question answering, text classification, etc., on multiple platforms
4. What is Flutter?

Flutter is an open source, cross-platform development framework. With the help of Flutter by using a single code base, we can create applications for Android, iOS, web, as well as desktop. It was created by Google and uses Dart as a development language. The first stable version of Flutter was released in Apr 2018, and since then, there have been many improvements.

5. Building an ML-Flutter Application

We are now going to build a Flutter application through which we can find the state of mind of a person from their facial expressions. The below steps explain the update we need to do for an Android-native application. For an iOS application, please refer to the links provided in the steps.

5.1. TensorFlow Lite – Native setup (Android)
- In android/app/build.gradle, add the following setting in the android block:
```
aaptOptions {
        noCompress 'tflite'
        noCompress 'lite'
    }
```
5.2. TensorFlow Lite – Flutter setup (Dart)
- Create an assets folder and place your label file and model file in it. (These files we will create shortly.) In pubspec.yaml add:
```
assets:
   - assets/labels.txt
   - assets/<file_name>.tflite
```
Figure 02

‍
- Run this command (Install TensorFlow Light package):
```
$ flutter pub add tflite
```
- Add the following line to your package’s pubspec.yaml (and run an implicit flutter pub get):
```
dependencies:
     tflite: ^0.9.0
```
- Now in your Dart code, you can use:
```
import 'package:tflite/tflite.dart';
```
- Add camera dependencies to your package’s pubspec.yaml (optional):
```
dependencies:
     camera: ^0.10.0+1
```
- Now in your Dart code, you can use:
```
import 'package:camera/camera.dart';
```
- As the camera is a hardware feature, in the native code, there are few updates we need to do for both Android & iOS. To learn more, visit:
https://pub.dev/packages/camera
- Following is the code that will appear under dependencies in pubspec.yaml once the the setup is complete.
Figure 03
- Flutter will automatically download the most recent version if you ignore the version number of packages.
- Do not forget to add the assets folder in the root directory.
5.3. Generate model (using website)
- Visit the following website
- https://teachablemachine.withgoogle.com/
- Click on Get Started
‍
- Select Image project
- There are three different categories of ML projects available. We’ll choose an image project since we’re going to develop a project that analyzes a person’s facial expression to determine their emotional condition.
- The other two types, audio project and pose project, will be useful for creating projects that involve audio operation and human pose indication, respectively.
‍
- Select Standard Image model
- Once more, there are two distinct groups of image machine learning projects. Since we are creating a project for an Android smartphone, we will select a standard picture project.
- The other type, an Embedded Image Model project, is designed for hardware with relatively little memory and computing power.
‍
- Upload images for training the classes
- We will create new classes by clicking on “Add a class.”
- We must upload photographs to these classes as we are developing a project that analyzes a person’s emotional state from their facial expression.
- The more photographs we upload, the more precise our result will be.
- Click on train model and wait till training is over
- Click on Export model
- Select TensorFlow Lite Tab -> Quantized button -> Download my model
5.4. Add files/models to the Flutter project
- Labels.txt
File contains all the class names which you created during model creation.
- *.tflite
File contains the original model file as well as associated files a ZIP.

5.5. Load & Run ML-Model
- We are importing the model from assets, so this line of code is crucial. This model will serve as the project’s brain.
- Here, we’re configuring the camera using a camera controller and obtaining a live feed (Cameras[0] is the front camera).
6. Conclusion

We can achieve good performance of a Flutter app with an appropriate architecture, as discussed in this blog.
February 8, 2023
BigQuery 101: All the Basics You Need to Know
Google BigQuery is an enterprise data warehouse built using BigTable and Google Cloud Platform. It’s serverless and completely managed. BigQuery works great with all sizes of data, from a 100 row Excel spreadsheet to several Petabytes of data. Most importantly, it can execute a complex query on those data within a few seconds.

We need to note before we proceed, BigQuery is not a transactional database. It takes around 2 seconds to run a simple query like ‘SELECT * FROM bigquery-public-data.object LIMIT 10’ on a 100 KB table with 500 rows. Hence, it shouldn’t be thought of as OLTP (Online Transaction Processing) database. BigQuery is for Big Data!

BigQuery supports SQL-like query, which makes it user-friendly and beginner friendly. It’s accessible via its web UI, command-line tool, or client library (written in C#, Go, Java, Node.js, PHP, Python, and Ruby). You can also take advantage of its REST APIs and get our job` done by sending a JSON request.

Now, let’s dive deeper to understand it better. Suppose you are a data scientist (or a startup which analyzes data) and you need to analyze terabytes of data. If you choose a tool like MySQL, the first step before even thinking about any query is to have an infrastructure in place, that can store this magnitude of data.

Designing this setup itself will be a difficult task because you have to figure out what will be the RAM size, DCOS or Kubernetes, and other factors. And if you have streaming data coming, you will need to set up and maintain a Kafka cluster. In BigQuery, all you have to do is a bulk upload of your CSV/JSON file, and you are done. BigQuery handles all the backend for you. If you need streaming data ingestion, you can use Fluentd. Another advantage of this is that you can connect Google Analytics with BigQuery seamlessly.

BigQuery is serverless, highly available, and petabyte scalable service which allows you to execute complex SQL queries quickly. It lets you focus on analysis rather than handling infrastructure. The idea of hardware is completely abstracted and not visible to us, not even as virtual machines.

Architecture of Google BigQuery

You don’t need to know too much about the underlying architecture of BigQuery. That’s actually the whole idea of it – you don’t need to worry about architecture and operation.

However, understanding BigQuery Architecture helps us in controlling costs, optimizing query performance, and optimizing storage. BigQuery is built using the Google Dremel paper.

Quoting an Abstract from the Google Dremel Paper –

“Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.”

Dremel was in production at Google since 2006. Google used it for the following tasks –
- Analysis of crawled web documents.
- Tracking install data for applications on Android Market.
- Crash reporting for Google products.
- OCR results from Google Books.
- Spam analysis.
- Debugging of map tiles on Google Maps.
- Tablet migrations in managed Bigtable instances.
- Results of tests run on Google’s distributed build system.
- Disk I/O statistics for hundreds of thousands of disks.
- Resource monitoring for jobs run in Google’s data centers.
- Symbols and dependencies in Google’s codebase.
BigQuery is much more than Dremel. Dremel is just a query execution engine, whereas Bigquery is based on interesting technologies like Borg (predecessor of Kubernetes) and Colossus. Colossus is the successor to the Google File System (GFS) as mentioned in Google Spanner Paper.

How BigQuery Stores Data?

BigQuery stores data in a columnar format – Capacitor (which is a successor of ColumnarIO). BigQuery achieves very high compression ratio and scan throughput. Unlike ColumnarIO, now on BigQuery, you can directly operate on compressed data without decompressing it.

Columnar storage has the following advantages:
- Traffic minimization – When you submit a query, the required column values on each query are scanned and only those are transferred on query execution. E.g., a query `SELECT title FROM Collection` would access the title column values only.
- Higher compression ratio – Columnar storage can achieve a compression ratio of 1:10, whereas ordinary row-based storage can compress at roughly 1:3.
(Image source: Google Dremel Paper)

Columnar storage has the disadvantage of not working efficiently when updating existing records. That is why Dremel doesn’t support any update queries.

How the Query Gets Executed?

BigQuery depends on Borg for data processing. Borg simultaneously instantiates hundreds of Dremel jobs across required clusters made up of thousands of machines. In addition to assigning compute capacity for Dremel jobs, Borg handles fault-tolerance as well.

Now, how do you design/execute a query which can run on thousands of nodes and fetches the result? This challenge was overcome by using the Tree Architecture. This architecture forms a gigantically parallel distributed tree for pushing down a query to the tree and aggregating the results from the leaves at a blazingly fast speed.

(Image source: Google Dremel Paper)

BigQuery vs. MapReduce

The key differences between BigQuery and MapReduce are –
- Dremel is designed as an interactive data analysis tool for large datasets
- MapReduce is designed as a programming framework to batch process large datasets
Moreover, Dremel finishes most queries within seconds or tens of seconds and can even be used by non-programmers, whereas MapReduce takes much longer (sometimes even hours or days) to process a query.

Following is a comparison on running MapReduce on a row and columnar DB:

(Image source: Google Dremel Paper)

Another important thing to note is that BigQuery is meant to analyze structured data (SQL) but in MapReduce, you can write logic for unstructured data as well.

Comparing BigQuery and Redshift

In Redshift, you need to allocate different instance types and create your own clusters. The benefit of this is that it lets you tune the compute/storage to meet your needs. However, you have to be aware of (virtualized) hardware limits and scale up/out based on that. Note that you are charged by the hour for each instance you spin up.

In BigQuery, you just upload the data and query it. It is a truly managed service. You are charged by storage, streaming inserts, and queries.

There are more similarities in both the data warehouses than the differences.

A smart user will definitely take advantage of the hybrid cloud (GCE+AWS) and leverage different services offered by both the ecosystems. Check out your quintessential guide to AWS Athena here.

Getting Started With Google BigQuery

Following is a quick example to show how you can quickly get started with BigQuery:
1. There are many public datasets available on bigquery, you are going to play with ‘bigquery-public-data:stackoverflow’ dataset. You can click on the “Add Data” button on the left panel and select datasets.
2. Next, find a language that has the best community, based on the response time. You can write the following query to do that.
WITH question_answers_join AS ( SELECT * , GREATEST(1, TIMESTAMP_DIFF(answers.first, creation_date, minute)) minutes_2_answer FROM ( SELECT id, creation_date, title , (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c FROM `bigquery-public-data.stackoverflow.posts_answers` WHERE a.id=parent_id ) answers , SPLIT(tags, '|') tags FROM `bigquery-public-data.stackoverflow.posts_questions` a WHERE EXTRACT(year FROM creation_date) > 2014 ) ) SELECT COUNT(*) questions, tag , ROUND(EXP(AVG(LOG(minutes_2_answer))), 2) mean_geo_minutes , APPROX_QUANTILES(minutes_2_answer, 100)[SAFE_OFFSET(50)] median FROM question_answers_join, UNNEST(tags) tag WHERE tag IN ('javascript', 'python', 'rust', 'java', 'scala', 'ruby', 'go', 'react', 'c', 'c++') AND answers.c > 0 GROUP BY tag ORDER BY mean_geo_minutes
```
WITH question_answers_join AS (
  SELECT *
    , GREATEST(1, TIMESTAMP_DIFF(answers.first, creation_date, minute)) minutes_2_answer
  FROM (
    SELECT id, creation_date, title
      , (SELECT AS STRUCT MIN(creation_date) first, COUNT(*) c
         FROM `bigquery-public-data.stackoverflow.posts_answers` 
         WHERE a.id=parent_id
      ) answers
      , SPLIT(tags, '|') tags
    FROM `bigquery-public-data.stackoverflow.posts_questions` a
    WHERE EXTRACT(year FROM creation_date) > 2014
  )
)
SELECT COUNT(*) questions, tag
  , ROUND(EXP(AVG(LOG(minutes_2_answer))), 2) mean_geo_minutes
  , APPROX_QUANTILES(minutes_2_answer, 100)[SAFE_OFFSET(50)] median
FROM question_answers_join, UNNEST(tags) tag
WHERE tag IN ('javascript', 'python', 'rust', 'java', 'scala', 'ruby', 'go', 'react', 'c', 'c++')
AND answers.c > 0
GROUP BY tag
ORDER BY mean_geo_minutes
```
3. Now you can execute the query and get results –

You can see that C has the best community followed by JavaScript!

How to do Machine Learning on BigQuery?

Now that you have a sound understanding of BigQuery. It’s time for some real action.

As discussed above, you can connect Google Analytics with BigQuery by going to the Google Analytics Admin panel, then enable BigQuery by clicking on PROPERTY column, click All Products, then click Link BigQuery. After that, you need to enter BigQuery ID (or project number) and then BigQuery will be linked to Google Analytics. Note – Right now BigQuery integration is only available to Google Analytics 360.

Assuming that you already have uploaded your google analytics data, here is how you can create a logistic regression model. Here, you are predicting whether a website visitor will make a transaction or not.
CREATE MODEL `velotio_tutorial.sample_model` OPTIONS(model_type='logistic_reg') AS SELECT IF(totals.transactions IS NULL, 0, 1) AS label, IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(geoNetwork.country, "") AS country, IFNULL(totals.pageviews, 0) AS pageviews FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20190401' AND '20180630'
```
CREATE MODEL `velotio_tutorial.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, "") AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, "") AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20190401' AND '20180630'
```
Create a model named ‘velotio_tutorial.sample_model’. Now set the ‘model_type’ as ‘logistic_reg’ because you want to train a logistic regression model. A logistic regression model splits input data into two classes and gives the probability that the data is in one of the classes. Usually, in “spam or not spam” type of problems, you use logistic regression. Here, the problem is similar – a transaction will be made or not.

The above query gets the total number of page views, the country from where the session originated, the operating system of visitors device, the total number of e-commerce transactions within the session, etc.

Now you just press run query to execute the query.

Conclusion

BigQuery is a query service that allows us to run SQL-like queries against multiple terabytes of data in a matter of seconds. If you have structured data, BigQuery is the best option to go for. It can help even a non-programmer to get the analytics right!

Learn how to build an ETL Pipeline for MongoDB & Amazon Redshift using Apache Airflow.

If you need help with using machine learning in product development for your organization, connect with experts at Velotio!
December 12, 2022
Explanatory vs. Predictive Models in Machine Learning

My vision on Data Analysis is that there is continuum between explanatory models on one side and predictive models on the other side. The decisions you make during the modeling process depend on your goal. Let’s take Customer Churn as an example, you can ask yourself why are customers leaving? Or you can ask yourself which customers are leaving? The first question has as its primary goal to explain churn, while the second question has as its primary goal to predict churn. These are two fundamentally different questions and this has implications for the decisions you take along the way. The predictive side of Data Analysis is closely related to terms like Data Mining and Machine Learning.

SPSS & SAS

When we’re looking at SPSS and SAS, both of these languages originate from the explanatory side of Data Analysis. They are developed in an academic environment, where hypotheses testing plays a major role. This makes that they have significant less methods and techniques in comparison to R and Python. Nowadays, SAS and SPSS both have data mining tools (SAS Enterprise Miner and SPSS Modeler), however these are different tools and you’ll need extra licenses.

I have spent some time to build extensive macros in SAS EG to seamlessly create predictive models, which also does a decent job at explaining the feature importance. While a Neural Network may do a fair job at making predictions, it is extremely difficult to explain such models, let alone feature importance. The macros that I have built in SAS EG does precisely the job of explaining the features, apart from producing excellent predictions.

Open source TOOLS: R & PYTHON

One of the major advantages of open source tools is that the community continuously improves and increases functionality. R was created by academics, who wanted their algorithms to spread as easily as possible. R has the widest range of algorithms, which makes R strong on the explanatory side and on the predictive side of Data Analysis.

Python is developed with a strong focus on (business) applications, not from an academic or statistical standpoint. This makes Python very powerful when algorithms are directly used in applications. Hence, we see that the statistical capabilities are primarily focused on the predictive side. Python is mostly used in Data Mining or Machine Learning applications where a data analyst doesn’t need to intervene. Python is therefore also strong in analyzing images and videos. Python is also the easiest language to use when using Big Data Frameworks like Spark. With the plethora of packages and ever improving functionality, Python is a very accessible tool for data scientists.

MACHINE LEARNING MODELS

While procedures like Logistic Regression are very good at explaining the features used in a prediction, some others like Neural Networks are not. The latter procedures may be preferred over the former when it comes to only prediction accuracy and not explaining the models. Interpreting or explaining the model becomes an issue for Neural Networks. You can’t just peek inside a deep neural network to figure out how it works. A network’s reasoning is embedded in the behavior of numerous simulated neurons, arranged into dozens or even hundreds of interconnected layers. In most cases the Product Marketing Officer may be interested in knowing what are the factors that are most important for a specific advertising project. What can they concentrate on to get the response rates higher, rather than, what will be their response rate, or revenues in the upcoming year. These questions are better answered by procedures which can be interpreted in an easier way. This is a great article about the technical and ethical consequences of the lack of explanations provided by complex AI models.

Procedures like Decision Trees are very good at explaining and visualizing what exactly are the decision points (features and their metrics). However, those do not produce the best models. Random Forests, Boosting are the procedures which use Decision Trees as the basic starting point to build the predictive models, which are by far some of the best methods to build sophisticated prediction models.

While Random Forests use fully grown (highly complex) Trees, and by taking random samples from the training set (a process called Bootstrapping), then each split uses only a proper subset of features from the entire feature set to actually make the split, rather than using all of the features. This process of bootstrapping helps with lower number of training data (in many cases there is no choice to get more data). The (proper) subsetting of the features has a tremendous effect on de-correlating the Trees grown in the Forest (hence randomizing it), leading to a drop in Test Set error. A fresh subset of features is chosen at each step of splitting, making the method robust. The strategy also stops the strongest feature from appearing each time a split is considered, making all the trees in the forest similar. The final result is obtained by averaging the result over all trees (in case of Regression problems), or by taking a majority class vote (in case of classification problem).

On the other hand, Boosting is a method where a Forest is grown using Trees which are NOT fully grown, or in other words, with Weak Learners. One has to specify the number of trees to be grown, and the initial weights of those trees for taking a majority vote for class selection. The default weight, if not specified is the average of the number of trees requested. At each iteration, the method fits these weak learners, finds the residuals. Then the weights of those trees which failed to predict the correct class is increased so that those trees can concentrate better on the failed examples. This way, the method proceeds by improving the accuracy of the Boosted Trees, stopping when the improvement is below a threshold. One particularly implementation of Boosting, AdaBoost has very good accuracy over other implementations. AdaBoost uses Trees of depth 1, known as Decision Stump as each member of the Forest. These are slightly better than random guessing to start with, but over time they learn the pattern and perform extremely well on test set. This method is more like a feedback control mechanism (where the system learns from the errors). To address overfitting, one can use the hyper-parameter Learning Rate (lambda) by choosing values in the range: (0,1]. Very small values of lambda will take more time to converge, however larger values may have difficulty converging. This can be achieved by a iterative process to select the correct value for lambda, plotting the test error rate against values of lambda. The value of lambda with the lowest test error should be chosen.

In all these methods, as we move from Logistic Regression, to Decision Trees to Random Forests and Boosting, the complexity of the models increase, making it almost impossible to EXPLAIN the Boosting model to marketers/product managers. Decision Trees are easy to visualize, Logisitic Regression results can be used to demonstrate the most important factors in a customer acquisition model and hence will be well received by business leaders. On the other hand, the Random Forest and Boosting methods are extremely good predictors, without much scope for explaining. But there is hope: These models have functions for revealing the most important variables, although it is not possible to visualize why.

USING A BALANCED APPROACH

So I use a mixed strategy: Use the previous methods as a step in Exploratory Data Analysis, present the importance of features, characteristics of the data to the business leaders in phase one, and then use the more complicated models to build the prediction models for deployment, after building competing models. That way, one not only gets to understand what is happening and why, but also gets the best predictive power. In most cases that I have worked, I have rarely seen a mismatch between the explanation and the predictions using different methods. After all, this is all math and the way of delivery should not change end results. Now that’s a happy ending for all sides of the business!

December 12, 2022

Exploring OpenAI Gym: A Platform for Reinforcement Learning Algorithms

Introduction

According to the OpenAI Gym GitHub repository “OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. This is the gym open-source library, which gives you access to a standardized set of environments.”

Open AI Gym has an environment-agent arrangement. It simply means Gym gives you access to an “agent” which can perform specific actions in an “environment”. In return, it gets the observation and reward as a consequence of performing a particular action in the environment.

There are four values that are returned by the environment for every “step” taken by the agent.

Observation (object): an environment-specific object representing your observation of the environment. For example, board state in a board game etc
Reward (float): the amount of reward/score achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward/score.
Done (boolean): whether it’s time to reset the environment again. E.g you lost your last life in the game.
Info (dict): diagnostic information useful for debugging. However, official evaluations of your agent are not allowed to use this for learning.

Following are the available Environments in the Gym:

Classic control and toy text
Algorithmic
Atari
2D and 3D robots

Here you can find a full list of environments.

Cart-Pole Problem

Here we will try to write a solve a classic control problem from Reinforcement Learning literature, “The Cart-pole Problem”.

The Cart-pole problem is defined as follows:
“A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.”

The following code will quickly allow you see how the problem looks like on your computer.

import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample())

import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample())

This is what the output will look like:

Coding the neural network

#We first import the necessary libraries and define hyperparameters - 
import gym
import random
import numpy as np
import tflearn
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.estimator import regression
from statistics import median, mean
from collections import Counter
LR = 2.33e-4
env = gym.make("CartPole-v0")
observation = env.reset()
goal_steps = 500
score_requirement = 50
initial_games = 10000
#Now we will define a function to generate training data - 
def initial_population():
    # [OBS, MOVES]
    training_data = []
    # all scores:
    scores = []
    # scores above our threshold:
    accepted_scores = []
    # number of episodes
    for _ in range(initial_games):
        score = 0
        # moves specifically from this episode:
        episode_memory = []
        # previous observation that we saw
        prev_observation = []
        for _ in range(goal_steps):
            # choose random action left or right i.e (0 or 1)
            action = random.randrange(0,2)
            observation, reward, done, info = env.step(action)
            # since that the observation is returned FROM the action
            # we store previous observation and corresponding action
            if len(prev_observation) > 0 :
                episode_memory.append([prev_observation, action])
            prev_observation = observation
            score+=reward
            if done: break
        # reinforcement methodology here.
        # IF our score is higher than our threshold, we save
        # all we're doing is reinforcing the score, we're not trying
        # to influence the machine in any way as to HOW that score is
        # reached.
        if score >= score_requirement:
            accepted_scores.append(score)
            for data in episode_memory:
                # convert to one-hot (this is the output layer for our neural network)
                if data[1] == 1:
                    output = [0,1]
                elif data[1] == 0:
                    output = [1,0]
                # saving our training data
                training_data.append([data[0], output])
        # reset env to play again
        env.reset()
        # save overall scores
        scores.append(score)
# Now using tflearn we will define our neural network 
def neural_network_model(input_size):
    network = input_data(shape=[None, input_size, 1], name='input')
    network = fully_connected(network, 128, activation='relu')
    network = dropout(network, 0.8)
    network = fully_connected(network, 256, activation='relu')
    network = dropout(network, 0.8)
    network = fully_connected(network, 512, activation='relu')
    network = dropout(network, 0.8)
    network = fully_connected(network, 256, activation='relu')
    network = dropout(network, 0.8)
    network = fully_connected(network, 128, activation='relu')
    network = dropout(network, 0.8)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=LR, loss='categorical_crossentropy', name='targets')
    model = tflearn.DNN(network, tensorboard_dir='log')
    return model
#It is time to train the model now -
def train_model(training_data, model=False):
    X = np.array([i[0] for i in training_data]).reshape(-1,len(training_data[0][0]),1)
    y = [i[1] for i in training_data]
    if not model:
        model = neural_network_model(input_size = len(X[0]))
    model.fit({'input': X}, {'targets': y}, n_epoch=5, snapshot_step=500, show_metric=True, run_id='openai_CartPole')
    return model
training_data = initial_population()
model = train_model(training_data)
#Training complete, now we should play the game to see how the output looks like 
scores = []
choices = []
for each_game in range(10):
    score = 0
    game_memory = []
    prev_obs = []
    env.reset()
    for _ in range(goal_steps):
        env.render()
        if len(prev_obs)==0:
            action = random.randrange(0,2)
        else:
            action = np.argmax(model.predict(prev_obs.reshape(-1,len(prev_obs),1))[0])
        choices.append(action)
        new_observation, reward, done, info = env.step(action)
        prev_obs = new_observation
        game_memory.append([new_observation, action])
        score+=reward
        if done: break
    scores.append(score)
print('Average Score:',sum(scores)/len(scores))
print('choice 1:{}  choice 0:{}'.format(float((choices.count(1))/float(len(choices)))*100,float((choices.count(0))/float(len(choices)))*100))
print(score_requirement)

#We first import the necessary libraries and define hyperparameters - 

import gym
import random
import numpy as np
import tflearn
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.estimator import regression
from statistics import median, mean
from collections import Counter

LR = 2.33e-4
env = gym.make("CartPole-v0")
observation = env.reset()
goal_steps = 500
score_requirement = 50
initial_games = 10000

#Now we will define a function to generate training data - 

def initial_population():
    # [OBS, MOVES]
    training_data = []
    # all scores:
    scores = []
    # scores above our threshold:
    accepted_scores = []
    # number of episodes
    for _ in range(initial_games):
        score = 0
        # moves specifically from this episode:
        episode_memory = []
        # previous observation that we saw
        prev_observation = []
        for _ in range(goal_steps):
            # choose random action left or right i.e (0 or 1)
            action = random.randrange(0,2)
            observation, reward, done, info = env.step(action)
            # since that the observation is returned FROM the action
            # we store previous observation and corresponding action
            if len(prev_observation) > 0 :
                episode_memory.append([prev_observation, action])
            prev_observation = observation
            score+=reward
            if done: break

        # reinforcement methodology here.
        # IF our score is higher than our threshold, we save
        # all we're doing is reinforcing the score, we're not trying
        # to influence the machine in any way as to HOW that score is
        # reached.
        if score >= score_requirement:
            accepted_scores.append(score)
            for data in episode_memory:
                # convert to one-hot (this is the output layer for our neural network)
                if data[1] == 1:
                    output = [0,1]
                elif data[1] == 0:
                    output = [1,0]

                # saving our training data
                training_data.append([data[0], output])

        # reset env to play again
        env.reset()
        # save overall scores
        scores.append(score)

# Now using tflearn we will define our neural network 

def neural_network_model(input_size):

    network = input_data(shape=[None, input_size, 1], name='input')

    network = fully_connected(network, 128, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 256, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 512, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 256, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 128, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=LR, loss='categorical_crossentropy', name='targets')
    model = tflearn.DNN(network, tensorboard_dir='log')

    return model

#It is time to train the model now -

def train_model(training_data, model=False):

    X = np.array([i[0] for i in training_data]).reshape(-1,len(training_data[0][0]),1)
    y = [i[1] for i in training_data]

    if not model:
        model = neural_network_model(input_size = len(X[0]))

    model.fit({'input': X}, {'targets': y}, n_epoch=5, snapshot_step=500, show_metric=True, run_id='openai_CartPole')
    return model

training_data = initial_population()

model = train_model(training_data)

#Training complete, now we should play the game to see how the output looks like 

scores = []
choices = []
for each_game in range(10):
    score = 0
    game_memory = []
    prev_obs = []
    env.reset()
    for _ in range(goal_steps):
        env.render()

        if len(prev_obs)==0:
            action = random.randrange(0,2)
        else:
            action = np.argmax(model.predict(prev_obs.reshape(-1,len(prev_obs),1))[0])

        choices.append(action)

        new_observation, reward, done, info = env.step(action)
        prev_obs = new_observation
        game_memory.append([new_observation, action])
        score+=reward
        if done: break

    scores.append(score)

print('Average Score:',sum(scores)/len(scores))
print('choice 1:{}  choice 0:{}'.format(float((choices.count(1))/float(len(choices)))*100,float((choices.count(0))/float(len(choices)))*100))
print(score_requirement)

This is what the result will look like:

Conclusion

Though we haven’t used the Reinforcement Learning model in this blog, the normal fully connected neural network gave us a satisfactory accuracy of 60%. We used tflearn, which is a higher level API on top of Tensorflow for speeding-up experimentation. We hope that this blog will give you a head start in using OpenAI Gym.

We are waiting to see exciting implementations using Gym and Reinforcement Learning. Happy Coding!

December 12, 2022

Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow
Recently, I was involved in building an ETL (Extract-Transform-Load) pipeline. It included extracting data from MongoDB collections, perform transformations and then loading it into Redshift tables. Many ETL solutions are available in the market which kind-of solves the issue, but the key part of an ETL process lies in its ability to transform or process raw data before it is pushed to its destination.

Each ETL pipeline comes with a specific business requirement around processing data which is hard to be achieved using off-the-shelf ETL solutions. This is why a majority of ETL solutions are custom built manually, from scratch. In this blog, I am going to talk about my learning around building a custom ETL solution which involved moving data from MongoDB to Redshift using Apache Airflow.

Background:

I began by writing a Python-based command line tool which supported different phases of ETL, like extracting data from MongoDB, processing extracted data locally, uploading the processed data to S3, loading data from S3 to Redshift, post-processing and cleanup. I used the PyMongo library to interact with MongoDB and the Boto library for interacting with Redshift and S3.

I kept each operation atomic so that multiple instances of each operation can run independently of each other, which will help to achieve parallelism. One of the major challenges was to achieve parallelism while running the ETL tasks. One option was to develop our own framework based on threads or developing a distributed task scheduler tool using a message broker tool like Celery combined with RabbitMQ. After doing some research I settled for Apache Airflow. Airflow is a Python-based scheduler where you can define DAGs (Directed Acyclic Graphs), which would run as per the given schedule and run tasks in parallel in each phase of your ETL. You can define DAG as Python code and it also enables you to handle the state of your DAG run using environment variables. Features like task retries on failure handling are a plus.

We faced several challenges while getting the above ETL workflow to be near real-time and fault tolerant. We discuss the challenges faced and the solutions below:

Keeping your ETL code changes in sync with Redshift schema

While you are building the ETL tool, you may end up fetching a new field from MongoDB, but at the same time, you have to add that column to the corresponding Redshift table. If you fail to do so the ETL pipeline will start failing. In order to tackle this, I created a database migration tool which would become the first step in my ETL workflow.

The migration tool would:
- keep the migration status in a Redshift table and
- would track all migration scripts in a code directory.
In each ETL run, it would get the most recently ran migrations from Redshift and would search for any new migration script available in the code directory. If found it would run the newly found migration script after which the regular ETL tasks would run. This adds the onus on the developer to add a migration script if he is making any changes like addition or removal of a field that he is fetching from MongoDB.

Maintaining data consistency

While extracting data from MongoDB, one needs to ensure all the collections are extracted at a specific point in time else there can be data inconsistency issues. We need to solve this problem at multiple levels:
- While extracting data from MongoDB define parameters like modified date and extract data from different collections with a filter as records less than or equal to that date. This will ensure you fetch point in time data from MongoDB.
- While loading data into Redshift tables, don’t load directly to master table, instead load it to some staging table. Once you are done loading data in staging for all related collections, load it to master from staging within a single transaction. This way data is either updated in all related tables or in none of the tables.
A single bad record can break your ETL

While moving data across the ETL pipeline into Redshift, one needs to take care of field formats. For example, the Date field in the incoming data can be different than that in the Redshift schema design. Another example can be that the incoming data can exceed the length of the field in the schema. Redshift’s COPY command which is used to load data from files to redshift tables is very vulnerable to such changes in data types. Even a single incorrectly formatted record will lead to all your data getting rejected and effectively breaking the ETL pipeline.

There are multiple ways in which we can solve this problem. Either handle it in one of the transform jobs in the pipeline. Alternately we put the onus on Redshift to handle these variances. Redshift’s COPY command has many options which can help you solve these problems. Some of the very useful options are
- ACCEPTANYDATE: Allows any date format, including invalid formats such as 00/00/00 00:00:00, to be loaded without generating an error.
- ACCEPTINVCHARS: Enables loading of data into VARCHAR columns even if the data contains invalid UTF-8 characters.
- TRUNCATECOLUMNS: Truncates data in columns to the appropriate number of characters so that it fits the column specification.
Redshift going out of storage

Redshift is based on PostgreSQL and one of the common problems is when you delete records from Redshift tables it does not actually free up space. So if your ETL process is deleting and creating new records frequently, then you may run out of Redshift storage space. VACUUM operation for Redshift is the solution to this problem. Instead of making VACUUM operation a part of your main ETL flow, define a different workflow which runs on a different schedule to run VACUUM operation. VACUUM operation reclaims space and resorts rows in either a specified table or all tables in the current database. VACUUM operation can be FULL, SORT ONLY, DELETE ONLY & REINDEX. More information on VACUUM can be found here.

ETL instance going out of storage

Your ETL will be generating a lot of files by extracting data from MongoDB onto your ETL instance. It is very important to periodically delete those files otherwise you are very likely to go out of storage on your ETL server. If your data from MongoDB is huge, you might end up creating large files on your ETL server. Again, I would recommend defining a different workflow which runs on a different schedule to run a cleanup operation.

Making ETL Near Real Time

Processing only the delta rather than doing a full load in each ETL run

ETL would be faster if you keep track of the already processed data and process only the new data. If you are doing a full load of data in each ETL run, then the solution would not scale as your data scales. As a solution to this, we made it mandatory for the collection in our MongoDB to have a created and a modified date. Our ETL would check the maximum value of the modified date for the given collection from the Redshift table. It will then generate the filter query to fetch only those records from MongoDB which have modified date greater than that of the maximum value. It may be difficult for you to make changes in your product, but it’s worth the effort!

Compressing and splitting files while loading

A good approach is to write files in some compressed format. It saves your storage space on ETL server and also helps when you load data to Redshift. Redshift COPY command suggests that you provide compressed files as input. Also instead of a single huge file, you should split your files into parts and give all files to a single COPY command. This will enable Redshift to use it’s computing resources across the cluster to do the copy in parallel, leading to faster loads.

Streaming mongo data directly to S3 instead of writing it to ETL server

One of the major overhead in the ETL process is to write data first to ETL server and then uploading it to S3. In order to reduce disk IO, you should not store data to ETL server. Instead, use MongoDB’s handy stream API. For MongoDB Node driver, both the collection.find() and the collection.aggregate() function return cursors. The stream method also accepts a transform function as a parameter. All your custom transform logic could go into the transform function. AWS S3’s node library’s upload() function, also accepts readable streams. Use the stream from the MongoDB Node stream method, pipe it into zlib to gzip it, then feed the readable stream into AWS S3’s Node library. Simple! You will see a large improvement in your ETL process by this simple but important change.

Optimizing Redshift Queries

Optimizing Redshift Queries helps in making the ETL system highly scalable, efficient and also reduce the cost. Lets look at some of the approaches:

Add a distribution key

Redshift database is clustered, meaning your data is stored across cluster nodes. When you query for certain set of records, Redshift has to search for those records in each node, leading to slow queries. A distribution key is a single metric, which will decide the data distribution of all data records across your tables. If you have a single metric which is available for all your data, you can specify it as distribution key. When loading data into Redshift, all data for a certain value of distribution key will be placed on a single node of Redshift cluster. So when you query for certain records Redshift knows exactly where to search for your data. This is only useful when you are also using the distribution key to query the data.

Source: Slideshare

Generating a numeric primary key for string primary key

In MongoDB, you can have any type of field as your primary key. If your Mongo collections are having a non-numeric primary key and you are using those same keys in Redshift, your joins will end up being on string keys which are slower. Instead, generate numeric keys for your string keys and joining on it which will make queries run much faster. Redshift supports specifying a column with an attribute as IDENTITY which will auto-generate numeric unique value for the column which you can use as your primary key.

Conclusion:

In this blog, I have covered the best practices around building ETL pipelines for Redshift based on my learning. There are many more recommended practices which can be easily found in Redshift and MongoDB documentation.
December 12, 2022
A Step Towards Machine Learning Algorithms: Univariate Linear Regression
These days the concept of Machine Learning is evolving rapidly. The understanding of it is so vast and open that everyone is having their independent thoughts about it. Here I am putting mine. This blog is my experience with the learning algorithms. In this blog, we will get to know the basic difference between Artificial Intelligence, Machine Learning, and Deep Learning. We will also get to know the foundation Machine Learning Algorithm i.e Univariate Linear Regression.

Intermediate knowledge of Python and its library (Numpy, Pandas, MatPlotLib) is good to start. For Mathematics, a little knowledge of Algebra, Calculus and Graph Theory will help to understand the trick of the algorithm.

A way to Artificial intelligence, Machine Learning, and Deep Learning

These are the three buzzwords of today’s Internet world where we are seeing the future of the programming language. Specifically, we can say that this is the place where science domain meets with programming. Here we use scientific concepts and mathematics with a programming language to simulate the decision-making process. Artificial Intelligence is a program or the ability of a machine to make decisions more as humans do. Machine Learning is another program that supports Artificial Intelligence. It helps the machine to observe the pattern and learn from it to make a decision. Here programming is helping in observing the patterns not in making decisions. Machine learning requires more and more information from various sources to observe all of the variables for any given pattern to make more accurate decisions. Here deep learning is supporting machine learning by creating a network (neural network) to fetch all required information and provide it to machine learning algorithms.

What is Machine Learning

Definition: Machine Learning provides machines with the ability to learn autonomously based on experiences, observations and analyzing patterns within a given data set without explicitly programming.

This is a two-part process. In the first part, it observes and analyses the patterns of given data and makes a shrewd guess of a mathematical function that will be very close to the pattern. There are various methods for this. Few of them are Linear, Non-Linear, logistic, etc. Here we calculate the error function using the guessed mathematical function and the given data. In the second part we will minimize the error function. This minimized function is used for the prediction of the pattern.

Here are the general steps to understand the process of Machine Learning:
1. Plot the given dataset on x-y axis
2. By looking into the graph, we will guess more close mathematical function
3. Derive the Error function with the given dataset and guessed mathematical function
4. Try to minimize an error function by using some algorithms
5. Minimized error function will give us a more accurate mathematical function for the given patterns.
Getting Started with the First Algorithms: Linear Regression with Univariable

Linear Regression is a very basic algorithm or we can say the first and foundation algorithm to understand the concept of ML. We will try to understand this with an example of given data of prices of plots for a given area. This example will help us understand it better.
movieID title userID rating timestamp 0 1 Toy story 170 3.0 1162208198000 1 1 Toy story 175 4.0 1133674606000 2 1 Toy story 190 4.5 1057778398000 3 1 Toy story 267 2.5 1084284499000 4 1 Toy story 325 4.0 1134939391000 5 1 Toy story 493 3.5 1217711355000 6 1 Toy story 533 5.0 1050012402000 7 1 Toy story 545 4.0 1162333326000 8 1 Toy story 580 5.0 1162374884000 9 1 Toy story 622 4.0 1215485147000 10 1 Toy story 788 4.0 1188553740000
```
movieID	title	userID	rating	timestamp
0	1	Toy story	170	3.0	1162208198000
1	1	Toy story	175	4.0	1133674606000
2	1	Toy story	190	4.5	1057778398000
3	1	Toy story	267	2.5	1084284499000
4	1	Toy story	325	4.0	1134939391000
5	1	Toy story	493	3.5	1217711355000
6	1	Toy story	533	5.0	1050012402000
7	1	Toy story	545	4.0	1162333326000
8	1	Toy story	580	5.0	1162374884000
9	1	Toy story	622	4.0	1215485147000
10	1	Toy story	788	4.0	1188553740000
```
With this data, we can easily determine the price of plots of the given area. But what if we want the price of the plot with area 5.0 * 10 sq mtr. There is no direct price of this in our given dataset. So how we can get the price of the plots with the area not given in the dataset. This we can do using Linear Regression.

So at first, we will plot this data into a graph.

The below graphs describe the area of plots (10 sq mtr) in x-axis and its prices in y-axis (Lakhs INR).

Definition of Linear Regression

The objective of a linear regression model is to find a relationship between one or more features (independent variables) and a continuous target variable(dependent variable). When there is only feature it is called Univariate Linear Regression and if there are multiple features, it is called Multiple Linear Regression.

Hypothesis function:

Here we will try to find the relation between price and area of plots. As this is an example of univariate, we can see that the price is only dependent on the area of the plot.

By observing this pattern we can have our hypothesis function as below:

f(x) = w * x + b

where w is weightage and b is biased.

For the different value set of (w,b) there can be multiple line possible but for one set of value, it will be close to this pattern.

When we generalize this function for multivariable then there will be a set of values of w then these constants are also termed as model params.

Note: There is a range of mathematical functions that relate to this pattern and selection of the function is totally up to us. But point to be taken care is that neither it should be under or overmatched and function must be continuous so that we can easily differentiate it or it should have global minima or maxima.

Error for a point

As our hypothesis function is continuous, for every Xi (area points) there will be one Yi Predicted Price and Y will be the actual price.

So the error at any point,

Ei = Yi – Y = F(Xi) – Y

These errors are also called as residuals. These residuals can be positive (if actual points lie below the predicted line) or negative (if actual points lie above the predicted line). Our motive is to minimize this residual for each of the points.

Note: While observing the patterns it is possible that few points are very far from the pattern. For these far points, residuals will be much more so if these points are less in numbers than we can avoid these points considering that these are errors in the dataset. Such points are termed as outliers.

Energy Functions

As there are m training points, we can calculate the Average Energy function below

E (w,b) = 1/m ( iΣm (Ei) )

and

our motive is to minimize the energy functions

min (E (w,b)) at point ( w,b )

Little Calculus: For any continuous function, the points where the first derivative is zero are the points of either minima or maxima. If the second derivative is negative, it is the point of maxima and if it is positive, it is the point of minima.

Here we will do the trick – we will convert our energy function into an upper parabola by squaring the error function. It will ensure that our energy function will have only one global minima (the point of our concern). It will simplify our calculation that where the first derivative of the energy function will be zero is the point that we need and the value of (w,b) at that point will be our required point.

So our final Energy function is

E (w,b) = 1/2m ( iΣm (Ei)2 )

dividing by 2 doesn’t affect our result and at the time of derivation it will cancel out for e.g

the first derivative of x2 is 2x.

Gradient Descent Method

Gradient descent is a generic optimization algorithm. It iteratively hit and trials the parameters of the model in order to minimize the energy function.

In the above picture, we can see on the right side:
1. w0 and w1 is the random initialization and by following gradient descent it is moving towards global minima.
2. No of turns of the black line is the number of iterations so it must not be more or less.
3. The distance between the turns is alpha i.e the learning parameter.
By solving this left side equation we will be able to get model params at the global minima of energy functions.

Points to consider at the time of Gradient Descent calculations:
1. Random initialization: We start this algorithm at any random point that is set of random (w, b) value. By moving along this algorithm decide at which direction new trials have to be taken. As we know that it will be the upper parabola so by moving into the right direction (towards the global minima) we will get lesser value compared to the previous point.
2. No of iterations: No of iteration must not be more or less. If it is lesser, we will not reach global minima and if it is more, then it will be extra calculations around the global minima.
3. Alpha as learning parameters: when alpha is too small then gradient descent will be slow as it takes unnecessary steps to reach the global minima. If alpha is too big then it might overshoot the global minima. In this case it will neither converge nor diverge.
Implementation of Gradient Descent in Python
""" Method to read the csv file using Pandas and later use this data for linear regression. """ """ Better run with Python 3+. """ # Library to read csv file effectively import pandas import matplotlib.pyplot as plt import numpy as np # Method to read the csv file def load_data(file_name): column_names = ['area', 'price'] # To read columns io = pandas.read_csv(file_name,names=column_names, header=None) x_val = (io.values[1:, 0]) y_val = (io.values[1:, 1]) size_array = len(y_val) for i in range(size_array): x_val[i] = float(x_val[i]) y_val[i] = float(y_val[i]) return x_val, y_val # Call the method for a specific file x_raw, y_raw = load_data('area-price.csv') x_raw = x_raw.astype(np.float) y_raw = y_raw.astype(np.float) y = y_raw # Modeling w, b = 0.1, 0.1 num_epoch = 100 converge_rate = np.zeros([num_epoch , 1], dtype=float) learning_rate = 1e-3 for e in range(num_epoch): # Calculate the gradient of the loss function with respect to arguments (model parameters) manually. y_predicted = w * x_raw + b grad_w, grad_b = (y_predicted - y).dot(x_raw), (y_predicted - y).sum() # Update parameters. w, b = w - learning_rate * grad_w, b - learning_rate * grad_b converge_rate[e] = np.mean(np.square(y_predicted-y)) print(w, b) print(f"predicted function f(x) = x * {w} + {b}" ) calculatedprice = (10 * w) + b print(f"price of plot with area 10 sqmtr = 10 * {w} + {b} = {calculatedprice}")
```
""" Method to read the csv file using Pandas and later use this data for linear regression. """
""" Better run with Python 3+. """

# Library to read csv file effectively
import pandas
import matplotlib.pyplot as plt
import numpy as np

# Method to read the csv file
def load_data(file_name):
	column_names = ['area', 'price']
	# To read columns
	io = pandas.read_csv(file_name,names=column_names, header=None)
	x_val = (io.values[1:, 0])
	y_val = (io.values[1:, 1])
	size_array = len(y_val)
	for i in range(size_array):
		x_val[i] = float(x_val[i])
		y_val[i] = float(y_val[i])
		return x_val, y_val

# Call the method for a specific file
x_raw, y_raw = load_data('area-price.csv')
x_raw = x_raw.astype(np.float)
y_raw = y_raw.astype(np.float)
y = y_raw

# Modeling
w, b = 0.1, 0.1
num_epoch = 100
converge_rate = np.zeros([num_epoch , 1], dtype=float)
learning_rate = 1e-3
for e in range(num_epoch):
	# Calculate the gradient of the loss function with respect to arguments (model parameters) manually.
	y_predicted = w * x_raw + b
	grad_w, grad_b = (y_predicted - y).dot(x_raw), (y_predicted - y).sum()
	# Update parameters.
	w, b = w - learning_rate * grad_w, b - learning_rate * grad_b
	converge_rate[e] = np.mean(np.square(y_predicted-y))

print(w, b)
print(f"predicted function f(x) = x * {w} + {b}" )
calculatedprice = (10 * w) + b
print(f"price of plot with area 10 sqmtr = 10 * {w} + {b} = {calculatedprice}")
```
This is the basic implementation of Gradient Descent algorithms using numpy and Pandas. It is basically reading the area-price.csv file. Here we are normalizing the x-axis for better readability of data points over the graph. We have taken (w,b) as (0.1, 0.1) as random initialization. We have taken 100 as count of iterations and learning rate as .001.

In every iteration, we are calculating w and b value and seeing it for converging rate.

We can repeat this calculation for (w,b) for different values of random initialization, no of iterations and learning rate (alpha).

Note: There is another python Library TensorFlow which is more preferable for such calculations. There are inbuilt functions of Gradient Descent in TensorFlow. But for better understanding, we have used library numpy and pandas here.

RMSE (Root Mean Square Error)

RMSE: This is the method to verify that our calculation of (w,b) is accurate at what extent. Below is the basic formula of calculation of RMSE where f is the predicted value and the observed value.

Note: There is no absolute good or bad threshold value for RMSE, however, we can assume this based on our observed value. For an observed value ranges from 0 to 1000, the RMSE value of 0.7 is small, but if the range goes from 0 to 1, it is not that small.

Conclusion

As part of this article, we have seen a little introduction to Machine Learning and the need for it. Then with the help of a very basic example, we learned about one of the various optimization algorithms i.e. Linear Regression (for univariate only). This can be generalized for multivariate also. We then use the Gradient Descent Method for the calculation of the predicted data model in Linear Regression. We also learned the basic flow details of Gradient Descent. There is one example in python for displaying Linear Regression via Gradient Descent.
December 12, 2022
Chatbots With Google DialogFlow: Build a Fun Reddit Chatbot in 30 Minutes
Google DialogFlow

If you’ve been keeping up with the current advancements in the world of chat and voice bots, you’ve probably come across Google’s newest acquisition – DialogFlow (formerly, api.ai) – a platform that provides a use-case specific, engaging voice and text-based conversations, powered by AI. While understanding the intricacies of human conversations, where we say one thing but mean the other, is still an art lost on machines, a domain-specific bot is the closest thing we can build.

What is DialogFlow anyway?

Natural language understanding (NLU) has always been the painful part while building a chatbot. How do you make sure your bot is actually understanding what the user says, and parsing their requests correctly? Well, here’s where DialogFlow comes in and fills the gap. It actually replaces the NLU parsing bit so that you can focus on other areas like your business logic!

DialogFlow is simply a tool that allows you to make bots (or assistants or agents) that understand human conversation, string together a meaningful API call with appropriate parameters after parsing the conversation and respond with an adequate reply. You can then deploy this bot to any platform of your choosing – Facebook Messenger, Slack, Google Assistant, Twitter, Skype, etc. Or on your own app or website as well!

The building blocks of DialogFlow

Agent: DialogFlow allows you to make NLU modules, called agents (basically the face of your bot). This agent connects to your backend and provides it with business logic.

Intent: An agent is made up of intents. Intents are simply actions that a user can perform on your agent. It maps what a user says to what action should be taken. They’re entry points into a conversation.

In short, a user may request the same thing in many ways, re-structuring their sentences. But in the end, they should all resolve to a single intent.

Examples of intents can be:
“What’s the weather like in Mumbai today?” or “What is the recipe for an omelet?”

You can create as many intents as your business logic desires, and even co-relate them, using contexts. An intent decides what API to call, with what parameters, and how to respond back, to a user’s request.

Entity: An agent wouldn’t know what values to extract from a given user’s input. This is where entities come into play. Any information in a sentence, critical to your business logic, will be an entity. This includes stuff like dates, distance, currency, etc. There are system entities, provided by DialogFlow for simple things like numbers and dates. And then there are developer defined entities. For example, “category”, for a bot about Pokemon! We’ll dive into how to make a custom developer entity further in the post.

Context: Final concept before we can get started with coding is “Context”. This is what makes the bot truly conversational. A context-aware bot can remember things, and hold a conversation like humans do. Consider the following conversation:

“Hey, are you coming for piano practice tonight?”
“Sorry, I’ve got dinner plans.”
“Okay, what about tomorrow night then?”
“That works!”

Did you notice what just happened? The first question is straightforward to parse: The time is “tonight”, and the event, “piano practice”.

However, the second question, “Okay, what about tomorrow night then?” doesn’t specify anything about the actual event. It’s implied that we’re talking about “piano practice”. This sort of understanding comes naturally to us humans, but bots have to be explicitly programmed so that they understand the context across these sentences.

Making a Reddit Chatbot using DialogFlow

Now that we’re well equipped with the basics, let’s get started! We’re going to make a Reddit bot that tells a joke or an interesting fact from the day’s top threads on specific subreddits. We’ll also sprinkle in some context awareness so that the bot doesn’t feel “rigid”.

NOTE: You would need a billing-enabled account on Google Cloud Platform(GCP) if you want to follow along with this tutorial. It’s free and just needs your credit card details to set up.

Creating an Agent
1. Log in to the DialogFlow dashboard using your Google account. Here’s the link for the lazy.
2. Click on “Create Agent”
3. Enter the details as below, and hit “Create”. You can select any other Google project if it has billing enabled on it as well.
Setting up a “Welcome” Intent

As soon as you create the agent, you see this intents page:

The “Default Fallback” Intent exists in case the user says something unexpected and is outside the scope of your intents. We won’t worry too much about that right now. Go ahead and click on the “Default Welcome Intent”. We can notice a lot of options that we can tweak.
Let’s start with a triggering phrase. Notice the “User Says” section? We want our bot to activate as soon as we say something along the lines of:

Let’s fill that in. After that, scroll down to the “Responses” tab. You can see some generic welcome messages are provided. Get rid of them, and put in something more personalized to our bot, as follows:

Now, this does a couple of things. Firstly, it lets the user know that they’re using our bot. It also guides the user to the next point in the conversation. Here, it is an “or” question.

Hit “Save” and let’s move on.

Creating a Custom Entity

Before we start playing around with Intents, I want to set up a Custom Entity real quick. If you remember, Entities are what we extract from user’s input to process further. I’m going to call our Entity “content”. As the user request will be a content – either a joke or a fact. Let’s go ahead and create that. Click on the “Entities” tab on left-sidebar and click “Create Entity”.

Fill in the following details:

As you can see, we have 2 values possible for our content: “joke” and “fact”. We also have entered synonyms for each of them, so that if the user says something like “I want to hear something funny”, we know he wants a “joke” content. Click “Save” and let’s proceed to the next section!

Attaching our new Entity to the Intent

Create a new Intent called “say-content”. Add a phrase “Let’s hear a joke” in the “User Says” section, like so:

Right off the bat, we notice a couple of interesting things. Dialogflow parsed this input and associated the entity content to it, with the correct value (here, “joke”). Let’s add a few more inputs:

PS: Make sure all the highlighted words are in the same color and have associated the same entity. Dialogflow’s NLU isn’t perfect and sometimes assigns different Entities. If that’s the case, just remove it, double-click the word and assign the correct Entity yourself!

Let’s add a placeholder text response to see it work. To do that, scroll to the bottom section “Response”, and fill it like so:

The “$content” is a variable having a value extracted from user’s response that we saw above.

Let’s see this in action. On the right side of every page on Dialogflow’s platform, you see a “Try It Now” box. Use that to test your work at any point in time. I’m going to go ahead and type in “Tell a fact” in the box. Notice that the “Tell a fact” phrase wasn’t present in the samples that we gave earlier. Dialogflow keeps training using it’s NLU modules and can extract data from similarly structured sentences:

A Webhook to process requests

To keep things simple I’m gonna write a JS app that fulfills the request by querying the Reddit’s website and returning the appropriate content. Luckily for us, Reddit doesn’t need authentication to read in JSON format. Here’s the code:
'use strict'; const http = require('https'); exports.appWebhook = (req, res) => { let content = req.body.result.parameters['content']; getContent(content).then((output) => { res.setHeader('Content-Type', 'application/json'); res.send(JSON.stringify({ 'speech': output, 'displayText': output })); }).catch((error) => { // If there is an error let the user know res.setHeader('Content-Type', 'application/json'); res.send(JSON.stringify({ 'speech': error, 'displayText': error })); }); }; function getSubreddit (content) { if (content == "funny" || content == "joke" || content == "laugh") return {sub: "jokes", displayText: "joke"}; else { return {sub: "todayILearned", displayText: "fact"}; } } function getContent (content) { let subReddit = getSubreddit(content); return new Promise((resolve, reject) => { console.log('API Request: to Reddit'); http.get(`https://www.reddit.com/r/${subReddit["sub"]}/top.json?sort=top&t=day`, (resp) => { let data = ''; resp.on('data', (chunk) => { data += chunk; }); resp.on('end', () => { let response = JSON.parse(data); let thread = response["data"]["children"][(Math.floor((Math.random() * 24) + 1))]["data"]; let output = `Here's a ${subReddit["displayText"]}: ${thread["title"]}`; if (subReddit['sub'] == "jokes") { output += " " + thread["selftext"]; } output += "nWhat do you want to hear next, a joke or a fact?" console.log(output); resolve(output); }); }).on("error", (err) => { console.log("Error: " + err.message); reject(error); }); }); }
```
'use strict';
const http = require('https');
exports.appWebhook = (req, res) => { 
let content = req.body.result.parameters['content']; 
getContent(content).then((output) => {   
res.setHeader('Content-Type', 'application/json');   
res.send(JSON.stringify({ 'speech': output, 'displayText': output    })); 
}).catch((error) => {   
// If there is an error let the user know   
res.setHeader('Content-Type', 'application/json');   
res.send(JSON.stringify({ 'speech': error, 'displayText': error     })); 
});
};
function getSubreddit (content) { 
if (content == "funny" || content == "joke" || content == "laugh")   
return {sub: "jokes", displayText: "joke"};   
else {     
return {sub: "todayILearned", displayText: "fact"};   
}
}
function getContent (content) { 
let subReddit = getSubreddit(content); 
return new Promise((resolve, reject) => {   
console.log('API Request: to Reddit');   
http.get(`https://www.reddit.com/r/${subReddit["sub"]}/top.json?sort=top&t=day`, (resp) => {     
let data = '';     
resp.on('data', (chunk) => {       
data += chunk;     
});     
resp.on('end', () => {       
let response = JSON.parse(data);       
let thread = response["data"]["children"][(Math.floor((Math.random() * 24) + 1))]["data"];       
let output = `Here's a ${subReddit["displayText"]}: ${thread["title"]}`;       
if (subReddit['sub'] == "jokes") {         
output += " " + thread["selftext"];       
}       
output += "nWhat do you want to hear next, a joke or a fact?"       
console.log(output);       
resolve(output);     
});   
}).on("error", (err) => {     
console.log("Error: " + err.message);     
reject(error);   
}); 
});
}
```
Now, before going ahead, follow the steps 1-5 mentioned here religiously.

NOTE: For step 1, select the same Google Project that you created/used, when creating the agent.

Now, to deploy our function using gcloud:
```
$ gcloud beta functions deploy appWebHook --stage-bucket BUCKET_NAME --trigger-http
```
To find the BUCKET_NAME, go to your Google project’s console and click on Cloud Storage under the Resources section.

After you run the command, make note of the httpsTrigger URL mentioned. On the Dialoglow platform, find the “Fulfilment” tab on the sidebar. We need to enable webhooks and paste in the URL, like this:

Hit “Done” on the bottom of the page, and now the final step. Visit the “say_content” Intent page and perform a couple of steps.

1. Make the “content” parameter mandatory. This will make the bot ask explicitly for the parameter to the user if it’s not clear:

2. Notice a new section has been added to the bottom of the screen called “Fulfilment”. Enable the “Use webhook” checkbox:

Click “Save” and that’s it! Time to test this Intent out!

Reddit’s crappy humor aside, this looks neat. Our replies always drive the conversation to places (Intents) that we want it to.

Adding Context to our Bot

Even though this works perfectly fine, there’s one more thing I’d like to add quickly. We want the user to be able to say, “More” or “Give me another one” and the bot to be able to understand what this means. This is done by emitting and absorbing contexts between intents.

First, to emit the context, scroll up on the “say-content” Intent’s page and find the “Contexts” section. We want to output the “context”. Let’s say for a count of 5. The count makes sure the bot remembers what the “content” is in the current conversation for up to 5 back and forths.

Now, we want to create a new content that can absorb this type of context and make sense of phrases like “More please”:

Finally, since we want it to work the same way, we’ll make the Action and Fulfilment sections look the same way as the “say-content” Intent does:

And that’s it! Your bot is ready.

Integrations

Dialogflow provides integrations with probably every messaging service in the Silicon Valley, and more. But we’ll use the Web Demo. Go to “Integrations” tab from the sidebar and enable “Web Demo” settings. Your bot should work like this:

And that’s it! Your bot is ready to face a real person! Now, you can easily keep adding more subreddits, like news, sports, bodypainting, dankmemes or whatever your hobbies in life are! Or make it understand a few more parameters. For example, “A joke about Donald Trump”.

Consider that your homework. You can also add a “Bye” intent, and make the bot stop. Our bot currently isn’t so great with goodbyes, sort of like real people.

Debugging and Tips

If you’re facing issues with no replies from the Reddit script, go to your Google Project and check the Errors and Reportings tab to make sure everything’s fine under the hood. If outbound requests are throwing an error, you probably don’t have billing enabled.

Also, one caveat I found is that the entities can take up any value from the synonyms that you’ve provided. This means you HAVE to hardcode them in your business app as well. Which sucks right now, but maybe DialogFlow will provide a cleaner solution in the near future!
December 12, 2022
Real Time Text Classification Using Kafka and Scikit-learn
Introduction:

Text classification is one of the essential tasks in supervised machine learning (ML). Assigning categories to text, which can be tweets, Facebook posts, web page, library book, media articles, gallery, etc. has many applications like spam filtering, sentiment analysis, etc. In this blog, we build a text classification engine to classify topics in an incoming Twitter stream using Apache Kafka and scikit-learn – a Python based Machine Learning Library.

Let’s dive into the details. Here is a diagram to explain visually the components and data flow. The Kafka producer will ingest data from Twitter and send it to Kafka broker. The Kafka consumer will ask the Kafka broker for the tweets. We convert the tweets binary stream from Kafka to human readable strings and perform predictions using saved models. We train the models using Twenty Newsgroups which is a prebuilt training data from Sci-kit. It is a standard data set used for training classification algorithms.

In this blog we will use the following machine learning models:
- Bag-of-Words(BOW) to convert words to vectors : The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms.
- tf-idf(term frequency–inverse document frequency) and Multinomial Naive Bayes algorithm to do the predictions.
We have used the following libraries/tools:
- tweepy – Twitter library for python
- Apache Kafka
- scikit-learn
- pickle – Python Object serialization library
Let’s first understand the following key concepts:
- Word to Vector Methodology (Word2Vec)
- Bag-of-Words
- tf-idf
- Multinomial Naive Bayes classifier
Word2Vec methodology

One of the key ideas in Natural Language Processing(NLP) is how we can efficiently convert words into numeric vectors which can then be given as an input to machine learning models to perform predictions.

Neural networks or any other machine learning models are nothing but mathematical functions which need numbers or vectors to churn out the output except tree based methods, they can work on words.

For this we have an approach known as Word2Vec. A very trivial solution to this would be to use “one-hot” method of converting the word into a sparse matrix with only one element of the vector set to 1, the rest being zero.

For example, “the apple a day the good” would have following representation

Here we have transformed the above sentence into a 6×5 matrix, with the 5 being the size of the vocabulary as “the” is repeated. But what are we supposed to do when we have a gigantic dictionary to learn from say more than 100000 words? Here one hot encoding fails. In one hot encoding the relationship between the words is lost. Like “Lanka” should come after “Sri”.

Here is where Word2Vec comes in. Our goal is to vectorize the words while maintaining the context. Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words.

Tf-idf (term frequency–inverse document frequency)

TF-IDF is a statistic which determines how important is a word to the document in given corpus. Variations of tf-idf is used by search engines, for text summarizations etc. You can read more about tf-idf – here.

Multinomial Naive Bayes classifier

Naive Bayes Classifier comes from family of probabilistic classifiers based on Bayes theorem. We use it to classify spam or not spam, sports or politics etc. We are going to use this for classifying streams of tweets coming in. You can explore it – here.

Lets how they fit in together.

The data from the “20 newsgroups datasets” is completely in text format. We cannot feed it directly to any model to do mathematical calculations. We have to extract features from the datasets and have to convert them to numbers which a model can ingest and then produce an output.
So, we use Continuous Bag of Words and tf-idf for extracting features from datasets and then ingest them to multinomial naive bayes classifier to get predictions.

1. Train Your Model

We are going to use this dataset. We create another file and import the needed libraries We are using sklearn for ML and pickle to save trained model. Now we define the model.
from __future__ import division,print_function, absolute_import from sklearn.datasets import fetch_20newsgroups #built-in dataset from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.naive_bayes import MultinomialNB import pickle from kafka import KafkaConsumer #Defining model and training it categories = ["talk.politics.misc","misc.forsale","rec.motorcycles", "comp.sys.mac.hardware","sci.med","talk.religion.misc"] #http://qwone.com/~jason/20Newsgroups/ for reference def fetch_train_dataset(categories): twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42) return twenty_train def bag_of_words(categories): count_vect = CountVectorizer() X_train_counts = count_vect.fit_transform(fetch_train_dataset(categories).data) pickle.dump(count_vect.vocabulary_, open("vocab.pickle", 'wb')) return X_train_counts def tf_idf(categories): tf_transformer = TfidfTransformer() return (tf_transformer,tf_transformer.fit_transform(bag_of_words(categories))) def model(categories): clf = MultinomialNB().fit(tf_idf(categories)[1], fetch_train_dataset(categories).target) return clf model = model(categories) pickle.dump(model,open("model.pickle", 'wb')) print("Training Finished!") #Training Finished Here
```
from __future__ import division,print_function, absolute_import
from sklearn.datasets import fetch_20newsgroups #built-in dataset
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
import pickle
from kafka import KafkaConsumer

#Defining model and training it
categories = ["talk.politics.misc","misc.forsale","rec.motorcycles",
"comp.sys.mac.hardware","sci.med","talk.religion.misc"] #http://qwone.com/~jason/20Newsgroups/ for reference

def fetch_train_dataset(categories):
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
return twenty_train

def bag_of_words(categories):
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(fetch_train_dataset(categories).data)
pickle.dump(count_vect.vocabulary_, open("vocab.pickle", 'wb'))
return X_train_counts

def tf_idf(categories):
tf_transformer = TfidfTransformer()
return (tf_transformer,tf_transformer.fit_transform(bag_of_words(categories)))

def model(categories):
clf = MultinomialNB().fit(tf_idf(categories)[1], fetch_train_dataset(categories).target)
return clf

model = model(categories)
pickle.dump(model,open("model.pickle", 'wb'))
print("Training Finished!")
#Training Finished Here
```
2. The Kafka Tweet Producer

We have the trained model in place. Now lets get the real time stream of Twitter via Kafka. We define the Producer.
```
# import required libraries
from kafka import SimpleProducer, KafkaClient
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from twitter_config import consumer_key, consumer_secret, access_token, access_token_secret
import json
```
Now we will define Kafka settings and will create KafkaPusher Class. This is necessary because we need to send the data coming from tweepy stream to Kafka producer.
# Kafka settings topic = b'twitter-stream' # setting up Kafka producer kafka = KafkaClient('localhost:9092') producer = SimpleProducer(kafka) class KafkaPusher(StreamListener): def on_data(self, data): all_data = json.loads(data) tweet = all_data["text"] producer.send_messages(topic, tweet.encode('utf-8')) return True def on_error(self, status): print statusWORDS_TO_TRACK = ["Politics","Apple","Google","Microsoft","Bikes","Harley Davidson","Medicine"] if __name__ == '__main__': l = KafkaPusher() auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) stream = Stream(auth, l) while True: try: stream.filter(languages=["en"], track=WORDS_TO_TRACK) except: pass
```
# Kafka settings
topic = b'twitter-stream'

# setting up Kafka producer
kafka = KafkaClient('localhost:9092')
producer = SimpleProducer(kafka)

class KafkaPusher(StreamListener):

def on_data(self, data):
all_data = json.loads(data)
tweet = all_data["text"]
producer.send_messages(topic, tweet.encode('utf-8'))
return True

def on_error(self, status):
print statusWORDS_TO_TRACK = ["Politics","Apple","Google","Microsoft","Bikes","Harley Davidson","Medicine"]

if __name__ == '__main__':
l = KafkaPusher()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
while True:
try:
stream.filter(languages=["en"], track=WORDS_TO_TRACK)
except:
pass
```
Note – You need to start Kafka server before running this script.

3. Loading your model for predictions

Now we have the trained model in step 1 and a twitter stream in step 2. Lets use the model now to do actual predictions. The first step is to load the model:
```
#Loading model and vocab
print("Loading pre-trained model")
vocabulary_to_load = pickle.load(open("vocab.pickle", 'rb'))
count_vect = CountVectorizer(vocabulary=vocabulary_to_load)
load_model = pickle.load(open("model.pickle", 'rb'))count_vect._validate_vocabulary()
tfidf_transformer = tf_idf(categories)[0]
```
Then we start the kafka consumer and begin predictions:
#predicting the streaming kafka messages consumer = KafkaConsumer('twitter-stream',bootstrap_servers=['localhost:9092']) print("Starting ML predictions.") for message in consumer: X_new_counts = count_vect.transform([message.value]) X_new_tfidf = tfidf_transformer.transform(X_new_counts) predicted = load_model.predict(X_new_tfidf) print(message.value+" => "+fetch_train_dataset(categories).target_names[predicted[0]])
```
#predicting the streaming kafka messages
consumer = KafkaConsumer('twitter-stream',bootstrap_servers=['localhost:9092'])
print("Starting ML predictions.")
for message in consumer:
X_new_counts = count_vect.transform([message.value])
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = load_model.predict(X_new_tfidf)
print(message.value+" => "+fetch_train_dataset(categories).target_names[predicted[0]])
```
Following are some of the classification done by our model
- RT @amazingatheist: Making fun of kids who survived a school shooting just days after the event because you disagree with their politics is… => talk.politics.misc
- sci.med
- RT @DavidKlion: Apropos of that D’Souza tweet; I think in order to make sense of our politics, you need to understand that there are some t… => talk.politics.misc
- RT @BeauWillimon: These students have already cemented a place in history with their activism, and they’re just getting started. No one wil… => talk.politics.misc
- RT @byedavo: Cause we ain’t got no president => talk.politics.misc
- RT @appleinsider: .@Apple reportedly in talks to buy cobalt, key Li-ion battery ingredient, directly from miners … => comp.sys.mac.hardware
Here is the link to the complete git repository

Conclusion:

In this blog, we were successful in creating a data pipeline where we were using the Naive Bayes model for doing classification of the streaming twitter data. We can classify other sources of data like news articles, blog posts etc. Do let us know if you have any questions, queries and additional thoughts in the comments section below.

Happy coding!
December 12, 2022
A Quick Guide to Building a Serverless Chatbot With Amazon Lex
Amazon announced “Amazon Lex” in December 2016 and since then we’ve been using it to build bots for our customers. Lex is effectively the technology used by Alexa, Amazon’s voice-activated virtual assistant which lets people control things with voice commands such as playing music, setting alarm, ordering groceries, etc. It provides deep learning-powered natural-language understanding along with automatic speech recognition. Amazon now provides it as a service that allows developers to take advantage of the same features used by Amazon Alexa. So, now there is no need to spend time in setting up and managing the infrastructure for your bots.

Now, developers just need to design conversations according to their requirements in Lex console. The phrases provided by the developer are used to build the natural language model. After publishing the bot, Lex will process the text or voice conversations and execute the code to send responses.

I’ve put together this quick-start tutorial using which you can start building Lex chat-bots. To understand the terms correctly, let’s consider an e-commerce bot that supports conversations involving the purchase of books.

Lex-Related Terminologies

Bot: It consists of all the components related to a conversation, which includes:
- Intent: Intent represents a goal, needed to be achieved by the bot’s user. In our case, our goal is to purchase books.
- Utterances: An utterance is a text phrase that invokes intent. If we have more than one intent, we need to provide different utterances for them. Amazon Lex builds a language model based on utterance phrases provided by us, which then invoke the required intent. For our demo example, we need a single intent “OrderBook”. Some sample utterances would be:
- I want to order some books
- Can you please order a book for me
- Slots: Each slot is a piece of data that the user must supply in order to fulfill the intent. For instance, purchasing a book requires bookType and bookName as slots for intent “OrderBook” (I am considering these two factors for making the example simpler, otherwise there are so many other factors based on which one will purchase/select a book.).
  Slots are an input, a string, date, city, location, boolean, number etc. that are needed to reach the goal of the intent. Each slot has a name, slot type, a prompt, and is it required. The slot types are the valid values a user can respond with, which can be either custom defined or one of the Amazon pre-built types.
- Prompt: A prompt is a question that Lex uses to ask the user to supply some correct data (for a slot) that is needed to fulfill an intent e.g. Lex will ask “what type of book you want to buy?” to fill the slot bookType.
- Fulfillment: Fulfillment provides the business logic that is executed after getting all required slot values, need to achieve the goal. Amazon Lex supports the use of Lambda functions for fulfillment of business logic and for validations.
Let’s Implement this Bot!

Now that we are aware of the basic terminology used in Amazon Lex, let’s start building our chat-bot.

Creating Lex Bot:
- Go to Amazon Lex console, which is available only in US, East (N. Virginia) region and click on create button.
- Create a custom bot by providing following information:
1. Bot Name: PurchaseBook
2. Output voice: None, this is only a test based application
3. Set Session Timeout: 5 min
4. Add Amazon Lex basic role to Bot app: Amazon will create it automatically. Find out more about Lex roles & permissions here.
5. Click on Create button, which will redirect you to the editor page.
Architecting Bot Conversations

Create Slots: We are creating two slots named bookType and bookName. Slot type values can be chosen from 275 pre-built types provided by Amazon or we can create our own customized slot types.

Create custom slot type for bookType as shown here and consider predefined type named Amazon.Book for bookName.

Create Intent: Our bot requires single custom intent named OrderBook.

Configuring the Intents
- Utterances: Provide some utterances to invoke the intent. An utterance can consist only of Unicode characters, spaces, and valid punctuation marks. Valid punctuation marks are periods for abbreviations, underscores, apostrophes, and hyphens. If there is a slot placeholder in your utterance ensure, that it’s in the {slotName} format and has spaces at both ends.
Slots: Map slots with their types and provide prompt questions that need to be asked to get valid value for the slot. Note the sequence, Lex-bot will ask the questions according to priority.

Confirmation prompt: This is optional. If required you can provide a confirmation message e.g. Are you sure you want to purchase book named {bookName}?, where bookName is a slot placeholder.

Fulfillment: Now we have all necessary data gathered from the chatbot, it can just be passed over in lambda function, or the parameters can be returned to the client application that then calls a REST endpoint.

Creating Amazon Lambda Functions

Amazon Lex supports Lambda function to provide code hooks to the bot. These functions can serve multiple purposes such as improving the user interaction with the bot by using prior knowledge, validating the input data that bot received from the user and fulfilling the intent.
- Go to AWS Lambda console and choose to Create a Lambda function.
- Select blueprint as blank function and click next.
- To configure your Lambda function, provide its name, runtime and code needs to be executed when the function is invoked. The code can also be uploaded in a zip folder instead of providing it as inline code. We are considering Nodejs4.3 as runtime.
- Click next and choose Create Function.
We can configure our bot to invoke these lambda functions at two places. We need to do this while configuring the intent as shown below:-

where, botCodeHook and fulfillment are name of lambda functions we created.

Lambda initialization and validation

Lambda function provided here i.e. botCodeHook will be invoked on each user input whose intent is understood by Amazon Lex. It will validate the bookName with predefined list of books.
'use strict'; exports.handler = (event, context, callback) => { const sessionAttributes = event.sessionAttributes; const slots = event.currentIntent.slots; const bookName = slots.bookName; // predefined list of available books const validBooks = ['harry potter', 'twilight', 'wings of fire']; // negative check: if valid slot value is not obtained, inform lex that user is expected // respond with a slot value if (bookName && !(bookName === "") && validBooks.indexOf(bookName.toLowerCase()) === -1) { let response = { sessionAttributes: event.sessionAttributes, dialogAction: { type: "ElicitSlot", message: { contentType: "PlainText", content: `We do not have book: ${bookName}, Provide any other book name. For. e.g twilight.` }, intentName: event.currentIntent.name, slots: slots, slotToElicit : "bookName" } } callback(null, response); } // if valid book name is obtained, send command to choose next course of action let response = {sessionAttributes: sessionAttributes, dialogAction: { type: "Delegate", slots: event.currentIntent.slots } } callback(null, response); };
```
'use strict';
exports.handler = (event, context, callback) => {
    const sessionAttributes = event.sessionAttributes;
    const slots = event.currentIntent.slots;
    const bookName = slots.bookName;
  
    // predefined list of available books
    const validBooks = ['harry potter', 'twilight', 'wings of fire'];
  
    // negative check: if valid slot value is not obtained, inform lex that user is expected 
    // respond with a slot value 
    if (bookName && !(bookName === "") && validBooks.indexOf(bookName.toLowerCase()) === -1) {
        let response = { sessionAttributes: event.sessionAttributes,
          dialogAction: {
            type: "ElicitSlot",
             message: {
               contentType: "PlainText",
               content: `We do not have book: ${bookName}, Provide any other book name. For. e.g twilight.`
            },
             intentName: event.currentIntent.name,
             slots: slots,
             slotToElicit : "bookName"
          }
        }
        callback(null, response);
    }
  
    // if valid book name is obtained, send command to choose next course of action
    let response = {sessionAttributes: sessionAttributes,
      dialogAction: {
        type: "Delegate",
        slots: event.currentIntent.slots
      }
    }
    callback(null, response);
};
```
Fulfillment code hook

This lambda function is invoked after receiving all slot data required to fulfill the intent.
'use strict'; exports.handler = (event, context, callback) => { // when intent get fulfilled, inform lex to complete the state let response = {sessionAttributes: event.sessionAttributes, dialogAction: { type: "Close", fulfillmentState: "Fulfilled", message: { contentType: "PlainText", content: "Thanks for purchasing book." } } } callback(null, response); };
```
'use strict';

exports.handler = (event, context, callback) => {
    // when intent get fulfilled, inform lex to complete the state
    let response = {sessionAttributes: event.sessionAttributes,
      dialogAction: {
        type: "Close",
        fulfillmentState: "Fulfilled",
        message: {
          contentType: "PlainText",
          content: "Thanks for purchasing book."
        }
      }
    }
    callback(null, response);
};
```
Error Handling: We can customize the error message for our bot users. Click on error handling and replace default values with the required ones. Since the number of retries given is two, we can also provide different message for every retry.

Your Bot is Now Ready To Chat

Click on Build to build the chat-bot. Congratulations! Your Lex chat-bot is ready to test. We can test it in the overlay which appears in the Amazon Lex console.

Sample conversations:

I hope you have understood the basic terminologies of Amazon Lex along with how to create a simple chat-bot using serverless (Amazon Lambda). This is a really powerful platform to build mature and intelligent chatbots.
December 12, 2022
Your Complete Guide to Building Stateless Bots Using Rasa Stack
This blog aims at exploring the Rasa Stack to create a stateless chat-bot. We will look into how, the recently released Rasa Core, which provides machine learning based dialogue management, helps in maintaining the context of conversations using machine learning in an efficient way.

If you have developed chatbots, you would know how hopelessly bots fail in maintaining the context once complex use-cases need to be developed. There are some home-grown approaches that people currently use to build stateful bots. The most naive approach is to create the state machines where you create different states and based on some logic take actions. As the number of states increases, more levels of nested logic are required or there is a need to add an extra state to the state machine, with another set of rules for how to get in and out of that state. Both of these approaches lead to fragile code that is harder to maintain and update. Anyone who’s built and debugged a moderately complex bot knows this pain.

After building many chatbots, we have experienced that flowcharts are useful for doing the initial design of a bot and describing a few of the known conversation paths, but we shouldn’t hard-code a bunch of rules since this approach doesn’t scale beyond simple conversations.

Thanks to the Rasa guys who provided a way to go stateless where scaling is not at all a problem. Let’s build a bot using Rasa Core and learn more about this.

Rasa Core: Getting Rid of State Machines

The main idea behind Rasa Core is that thinking of conversations as a flowchart and implementing them as a state machine doesn’t scale. It’s very hard to reason about all possible conversations explicitly, but it’s very easy to tell, mid-conversation, if a response is right or wrong. For example, let’s consider a term insurance purchase bot, where you have defined different states to take different actions. Below diagram shows an example state machine:

Let’s consider a sample conversation where a user wants to compare two policies listed by policy_search state.

In above conversation, it can be compared very easily by adding some logic around the intent campare_policies. But real life is not so easy, as a majority of conversations are edge cases. We need to add rules manually to handle such cases, and after testing we realize that these clash with other rules we wrote earlier.

Rasa guys figured out how machine learning can be used to solve this problem. They have released Rasa Core where the logic of the bot is based on a probabilistic model trained on real conversations.

Structure of a Rasa Core App

Let’s understand few terminologies we need to know to build a Rasa Core app:

1. Interpreter: An interpreter is responsible for parsing messages. It performs the Natural Language Understanding and transforms the message into structured output i.e. intent and entities. In this blog, we are using Rasa NLU model as an interpreter. Rasa NLU comes under the Rasa Stack. In Training section, it is shown in detail how to prepare the training data and create a model.

2. Domain: To define a domain we create a domain.yml file, which defines the universe of your bot. Following things need to be defined in a domain file:
- Intents: Things we expect the user to say. It is more related to Rasa NLU.
- Entities: These represent pieces of information extracted what user said. It is also related to Rasa NLU.
- Templates: We define some template strings which our bot can say. The format for defining a template string is utter_<intent>. These are considered as actions which bot can take.
- Actions: List of things bot can do and say. There are two types of actions we define one those which will only utter message (Templates) and others some customised actions where some required logic is defined. Customised actions are defined as Python classes and are referenced in domain file.
- Slots: These are user-defined variables which need to be tracked in a conversation. For e.g to buy a term insurance we need to keep track of what policy user selects and details of the user, so all these details will come under slots.
3. Stories: In stories, we define what bot needs to do at what point in time. Based on these stories, a probabilistic model is generated which is used to decide which action to be taken next. There are two ways in which stories can be created which are explained in next section.

Let’s combine all these pieces together. When a message arrives in a Rasa Core app initially, interpreter transforms the message into structured output i.e. intents and entities. The Tracker is the object which keeps track of conversation state. It receives the info that a new message has come in. Then based on dialog model we generate using domain and stories policy chooses which action to take next. The chosen action is logged by the tracker and response is sent back to the user.

Training and Running A Sample Bot

We will create a simple Facebook chat-bot named Secure Life which assists you in buying term life insurance. To keep the example simple, we have restricted options such as age-group, term insurance amount, etc.

There are two models we need to train in the Rasa Core app:

Rasa NLU model based on which messages will be processed and converted to a structured form of intent and entities. Create following two files to generate the model:

data.json: Create this training file using the rasa-nlu trainer. Click here to know more about the rasa-nlu trainer.

nlu_config.json: This is the configuration file.
```
{
"pipeline": "spacy_sklearn",
"path" : "./models",
"project": "nlu",
"data" : "./data/data.md"
}
```
Run below command to train the rasa-nlu model:-
```
$ python -m rasa_nlu.train -c nlu_model_config.json --fixed_model_name current
```
Dialogue Model: This model is trained on stories we define, based on which the policy will take the action. There are two ways in which stories can be generated:
- Supervised Learning: In this type of learning we will create the stories by hand, writing them directly in a file. It is easy to write but in case of complex use-cases it is difficult to cover all scenarios.
- Reinforcement Learning: The user provides feedback on every decision taken by the policy. This is also known as interactive learning. This helps in including edge cases which are difficult to create by hand. You must be thinking how it works? Every time when a policy chooses an action to take, it is asked from the user whether the chosen action is correct or not. If the action taken is wrong, you can correct the action on the fly and store the stories to train the model again.
Since the example is simple, we have used supervised learning method, to generate the dialogue model. Below is the stories.md file.
## All yes * greet - utter_greet * affirm - utter_very_much_so * affirm - utter_gender * gender - utter_coverage_duration - action_gender * affirm - utter_nicotine * affirm - action_nicotine * age - action_thanks ## User not interested * greet - utter_greet * deny - utter_decline ## Coverage duration is not sufficient * greet - utter_greet * affirm - utter_very_much_so * affirm - utter_gender * gender - utter_coverage_duration - action_gender * deny - utter_decline
```
## All yes
* greet
- utter_greet
* affirm
- utter_very_much_so
* affirm
- utter_gender
* gender
- utter_coverage_duration
- action_gender
* affirm
- utter_nicotine
* affirm
- action_nicotine
* age
- action_thanks

## User not interested
* greet
- utter_greet
* deny
- utter_decline

## Coverage duration is not sufficient
* greet
- utter_greet
* affirm
- utter_very_much_so
* affirm
- utter_gender
* gender
- utter_coverage_duration
- action_gender
* deny
- utter_decline
```
Run below command to train dialogue model :
```
$ python -m rasa_core.train -s <path to stories.md file> -d <path to domain.yml> -o models/dialogue --epochs 300
```
Define a Domain: Create domain.yml file containing all the required information. Among the intents and entities write all those strings which bot is supposed to see when user say something i.e. intents and entities you defined in rasa NLU training file.
intents: - greet - goodbye - affirm - deny - age - gender slots: gender: type: text nicotine: type: text agegroup: type: text templates: utter_greet: - "hey there! welcome to Secure-Life!\nI can help you quickly estimate your rate of coverage.\nWould you like to do that ?" utter_very_much_so: - "Great! Let's get started.\nWe currently offer term plans of Rs. 1Cr. Does that suit your need?" utter_gender: - "What gender do you go by ?" utter_coverage_duration: - "We offer this term plan for a duration of 30Y. Do you think that's enough to cover entire timeframe of your financial obligations ?" utter_nicotine: - "Do you consume nicotine-containing products?" utter_age: - "And lastly, how old are you ?" utter_thanks: - "Thank you for providing all the info. Let me calculate the insurance premium based on your inputs." utter_decline: - "Sad to see you go. In case you change your plans, you know where to find me :-)" utter_goodbye: - "goodbye :(" actions: - utter_greet - utter_goodbye - utter_very_much_so - utter_coverage_duration - utter_age - utter_nicotine - utter_gender - utter_decline - utter_thanks - actions.ActionGender - actions.ActionNicotine - actions.ActionThanks
```
intents:
- greet
- goodbye
- affirm
- deny
- age
- gender

slots:
gender:
type: text
nicotine:
type: text
agegroup:
type: text

templates:
utter_greet:
- "hey there! welcome to Secure-Life!\nI can help you quickly estimate your rate of coverage.\nWould you like to do that ?"

utter_very_much_so:
- "Great! Let's get started.\nWe currently offer term plans of Rs. 1Cr. Does that suit your need?"

utter_gender:
- "What gender do you go by ?"

utter_coverage_duration:
- "We offer this term plan for a duration of 30Y. Do you think that's enough to cover entire timeframe of your financial obligations ?"

utter_nicotine:
- "Do you consume nicotine-containing products?"

utter_age:
- "And lastly, how old are you ?"

utter_thanks:
- "Thank you for providing all the info. Let me calculate the insurance premium based on your inputs."

utter_decline:
- "Sad to see you go. In case you change your plans, you know where to find me :-)"

utter_goodbye:
- "goodbye :("

actions:
- utter_greet
- utter_goodbye
- utter_very_much_so
- utter_coverage_duration
- utter_age
- utter_nicotine
- utter_gender
- utter_decline
- utter_thanks
- actions.ActionGender
- actions.ActionNicotine
- actions.ActionThanks
```
Define Actions: Templates defined in domain.yml also considered as actions. A sample customized action is shown below where we are setting a slot named gender with values according to the option selected by the user.
from rasa_core.actions.action import Action from rasa_core.events import SlotSet class ActionGender(Action): def name(self): return 'action_gender' def run(self, dispatcher, tracker, domain): messageObtained = tracker.latest_message.text.lower() if ("male" in messageObtained): return [SlotSet("gender", "male")] elif ("female" in messageObtained): return [SlotSet("gender", "female")] else: return [SlotSet("gender", "others")]
```
from rasa_core.actions.action import Action
from rasa_core.events import SlotSet

class ActionGender(Action):
def name(self):
return 'action_gender'
def run(self, dispatcher, tracker, domain):
messageObtained = tracker.latest_message.text.lower()

if ("male" in messageObtained):
return [SlotSet("gender", "male")]
elif ("female" in messageObtained):
return [SlotSet("gender", "female")]
else:
return [SlotSet("gender", "others")]
```
Running the Bot

Create a Facebook app and get the app credentials. Create a bot.py file as shown below:
from rasa_core import utils from rasa_core.agent import Agent from rasa_core.interpreter import RasaNLUInterpreter from rasa_core.channels import HttpInputChannel from rasa_core.channels.facebook import FacebookInput logger = logging.getLogger(__name__) def run(serve_forever=True): # create rasa NLU interpreter interpreter = RasaNLUInterpreter("models/nlu/current") agent = Agent.load("models/dialogue", interpreter=interpreter) input_channel = FacebookInput( fb_verify="your_fb_verify_token", # you need tell facebook this token, to confirm your URL fb_secret="your_app_secret", # your app secret fb_tokens={"your_page_id": "your_page_token"}, # page ids + tokens you subscribed to debug_mode=True # enable debug mode for underlying fb library ) if serve_forever: agent.handle_channel(HttpInputChannel(5004, "/app", input_channel)) return agent if __name__ == '__main__': utils.configure_colored_logging(loglevel="DEBUG") run()
```
from rasa_core import utils
from rasa_core.agent import Agent
from rasa_core.interpreter import RasaNLUInterpreter
from rasa_core.channels import HttpInputChannel
from rasa_core.channels.facebook import FacebookInput

logger = logging.getLogger(__name__)

def run(serve_forever=True):
# create rasa NLU interpreter
interpreter = RasaNLUInterpreter("models/nlu/current")
agent = Agent.load("models/dialogue", interpreter=interpreter)

input_channel = FacebookInput(
fb_verify="your_fb_verify_token", # you need tell facebook this token, to confirm your URL
fb_secret="your_app_secret", # your app secret
fb_tokens={"your_page_id": "your_page_token"}, # page ids + tokens you subscribed to
debug_mode=True # enable debug mode for underlying fb library
)

if serve_forever:
agent.handle_channel(HttpInputChannel(5004, "/app", input_channel))
return agent

if __name__ == '__main__':
utils.configure_colored_logging(loglevel="DEBUG")
run()
```
Run the file and your bot is ready to test. Sample conversations are provided below:

Summary

You have seen how Rasa Core has made it easier to build bots. Just create few files and boom! Your bot is ready! Isn’t it exciting? I hope this blog provided you some insights on how Rasa Core works. Start exploring and let us know if you need any help in building chatbots using Rasa Core.
December 12, 2022