Category: Our Insights

Know Everything About Spinnaker & How to Deploy Using Kubernetes Engine
As marketed, Spinnaker is an open-source, multi-cloud continuous delivery platform that helps you release software changes with high velocity and confidence.

Open sourced by Netflix and heavily contributed to by Google, it supports all major cloud vendors (AWS, Azure, App Engine, Openstack, etc.) including Kubernetes.

In this blog I’m going to walk you through all the basic concepts in Spinnaker and help you create a continuous delivery pipeline using Kubernetes Engine, Cloud Source Repositories, Container Builder, Resource Manager, and Spinnaker. After creating a sample application, we will configure these services to automatically build, test, and deploy it. When the application code is modified, the changes trigger the continuous delivery pipeline to automatically rebuild, retest, and redeploy the new version.

What Spinnaker Provides?

Application management and Application Deployment are its two core features.

Application Management

Spinnaker’s application management features can be used to view and manage your cloud resources.

Modern tech organizations operate collections of services—sometimes referred to as “applications” or “microservices”. A Spinnaker application models this concept.

Applications, Clusters, and Server Groups are the key concepts Spinnaker uses to describe services. Load balancers and Firewalls describe how services are exposed to users.

Application
- An application in Spinnaker is a collection of clusters, which in turn are collections of server groups. The application also includes firewalls and load balancers. An application represents the service which needs to be deployed using Spinnaker, all configuration for that service, and all the infrastructure on which it will run. Normally, a different application is configured for each service, though Spinnaker does not enforce that.
Cluster
- Clusters are logical groupings of Server Groups in Spinnaker.
- Note: Cluster, here, does not map to a Kubernetes cluster. It’s merely a collection of Server Groups, irrespective of any Kubernetes clusters that might be included in your underlying architecture.
Server Group
- The base resource, the Server Group, identifies the deployable artifact (VM image, Docker image, source location) and basic configuration settings such as number of instances, autoscaling policies, metadata, etc. This resource is optionally associated with a Load Balancer and a Firewall. When deployed, a Server Group is a collection of instances of the running software (VM instances, Kubernetes pods).
Load Balancer
- A Load Balancer is associated with an ingress protocol and port range. Traffic is balanced among the instances present in Server Groups. Optionally, health checks can be enabled for a load balancer, with flexibility to define health criteria and specify the health check endpoint.
Firewall
- A Firewall defines network traffic access. It is effectively a set of firewall rules defined by an IP range (CIDR) along with a communication protocol (e.g., TCP) and port range.
Application Deployment

Pipeline
- The pipeline is the key deployment management construct in Spinnaker. It consists of a sequence of actions, known as stages. Parameters can be passed from one stage to the next one in the pipeline.
- You can start a pipeline manually, or you can configure it to be automatically triggered by an event, such as a Jenkins job completing, a new Docker image being pushed in your docker registry, a CRON type schedule, or maybe a stage in another pipeline.
- You can configure the pipeline to emit notifications, by email, SMS or HipChat, to interested parties at various points during pipeline execution (such as on pipeline start/complete/fail).
Stage
- A Stage in Spinnaker is an atomic building block for a pipeline, describing an action that the pipeline will perform. You can sequence stages in a Pipeline in any order, though some stage sequences may be more common than others. There are different types of stages in Spinnaker such as Deploy, Manual Judgment, Resize, Disable, and many more. The full list of stages and read about implementation details for each provider here.
Deployment Strategies
- Spinnaker supports all the cloud native deployment strategies including Red/Black (a.k.a Blue/Green), Rolling red/black and Canary deployments, etc.
What is Spinnaker Made Of?

Spinnaker is composed of a number of independent microservices:
- Deck Deck is the custom browser-based GUI.
- Gate is the API gateway. All the API calls from UI (Deck) and other API callers go to Spinnaker through Gate.
- Orca is the orchestration engine. It handles all ad-hoc operations and pipelines.
- Clouddriver is responsible for all mutating calls to the cloud providers and for indexing/caching all deployed resources.
- Front50 is used to persist the metadata of applications, pipelines, projects and notifications.
- Rosco is the bakery. It helps to create machine images for various cloud vendors (for example GCE images for GCP , AMIs for AWS, Azure VM images). It currently wraps Packer, but will be expanded to support additional mechanisms for producing images.
- Igor is used to trigger pipelines via continuous integration jobs in systems like Jenkins and Travis CI, and it allows Jenkins/Travis stages to be used in pipelines.
- Echo is Spinnaker’s eventing bus. It supports sending notifications (e.g. Slack, email, Hipchat, SMS), and acts on incoming webhooks from services like GitHub.
- Fiat is Spinnaker’s authorization service. It is used to query a user’s access permissions for accounts, applications and service accounts.
- Kayenta provides automated canary analysis for Spinnaker.
- Halyard is Spinnaker’s configuration service. Halyard manages the lifecycle of each of the above services. It only interacts with these services during Spinnaker start-up, updates, and rollbacks.
By default, Spinnaker binds ports accordingly for all the above mentioned microservices. For us the UI (Deck) will be exposed onto Port 9000.

What are We Going to Do?
- Set up your environment by launching Cloud Shell, creating a Kubernetes Engine cluster, and configuring your identity and user management scheme.
- Download a sample application, create a Git repository, and upload it to a Cloud Source Repository.
- Deploy Spinnaker to Kubernetes Engine using Helm.
- Build a Docker image from the source code.
- Create triggers to create Docker images when the source code for application changes.
- Configure a Spinnaker pipeline to reliably and continuously deploy your application to Kubernetes Engine.
- Deploy a code change, triggering the pipeline, and watch it roll out to production.
Note: This blog post uses various billable components in GCP like GKE, Container Builder etc.

Pipeline Architecture

To continuously deliver application updates to users, companies need an automated process that reliably builds, tests, and updates their software. Code changes should automatically flow through a pipeline that includes artifact creation, unit testing, functional testing, and production rollout. In some cases, they want a code update to apply to only a subset of their users, so that it is exercised realistically before pushing it to entire user base. If one of these canary releases proves unsatisfactory, the automated procedure must be able to quickly roll back the software changes.

With Kubernetes Engine and Spinnaker, we can create a robust continuous delivery flow that helps us to ensure that software is shipped as quickly as it is developed and validated. Although rapid iteration is the end goal, we must first ensure that each application revision passes through a series of automated validations before becoming a candidate for production rollout. When a given change has been vetted through automation, we can also validate the application manually and conduct further pre-release testing.

After the team decides the application is ready for production, one of the team members can approve it for production deployment.

Application Delivery Pipeline

We are going to build the continuous delivery pipeline shown in the following diagram.

Prerequisites
- Fair bit of experience in GCP services like:
- GKE (Google Kubernetes Engine)
- Google Compute
- Google APIs
- Cloud Source Repository
- Container Builder
- Cloud Storage
- Cloud Load Balancing
- Knowledge in K8s terminology like Services, Deployments, Pods, etc
- Familiarity with Kubectl and Helm package manager
Before Starting just enable the APIs needed on GCP
- Kubernetes API
- Compute API
- Resource Manger API
- IAM API
Set Up a Kubernetes Cluster
1. Go to the Console and scroll the left panel down to Compute->Kubernetes Engine->Kubernetes Clusters.
2. Click Create Cluster.
3. Choose a name or leave as the default one.
4. Under Machine Type, click Customize.
5. Allocate at least 2 vCPU and 10GB of RAM.
6. Change the cluster size to 2.
7. Enable Legacy Authorization while customizing the cluster.
8. Keep the rest of the defaults and click Create.
In a minute or two the cluster will be created and ready to go.

Configure identity and access management

Create a Cloud Identity and Access Management (Cloud IAM) service account to delegate permissions to Spinnaker, allowing it to store data in Cloud Storage. Spinnaker stores its pipeline data in Cloud Storage to ensure reliability and resiliency. If our Spinnaker deployment unexpectedly fails, we can create an identical deployment in minutes with access to the same pipeline data as the original.

1. Create the service account:
```
$ gcloud iam service-accounts create spinnaker-storage-account  --display-name spinnaker-storage-account
```
2. Store the service account email address and our current project ID in environment variables for use in later commands:
```
$ export SA_EMAIL=$(gcloud iam service-accounts list  --filter="displayName:spinnaker-storage-account"  --format='value(email)')
$ export PROJECT=$(gcloud info --format='value(config.project)')
```
3. Bind the storage.admin role to our service account:
```
$ gcloud projects add-iam-policy-binding  $PROJECT --role roles/storage.admin --member serviceAccount:$SA_EMAI
```
4. Download the service account key. We will need this key later while installing Spinnaker and we need to also upload the key to Kubernetes Engine.
```
$ gcloud iam service-accounts keys create spinnaker-sa.json --iam-account $SA_EMAIL
```
Deploying Spinnaker using Helm

In this section, we will deploy Spinnaker onto the K8s cluster via Charts with the help of K8s package manager Helm. Helm has made it very easy to deploy Spinnaker, it can be a very painful act to deploy it manually via Halyard and configure it.

Install Helm

1. Download and install the helm binary:
```
$ wget https://storage.googleapis.com/kubernetes-helm/helm-v2.9.0-linux-amd64.tar.gz
```
2. Unzip the file to your local system:
```
$ tar zxfv helm-v2.9.0-linux-amd64.tar.gz$ sudo chmod +x linux-amd64/helm && sudo mv linux-amd64/helm /usr/bin/helm
```
3. Grant Tiller, the server side of Helm, the cluster-admin role in your cluster:
```
$ kubectl create clusterrolebinding user-admin-binding  --clusterrole=cluster-admin --user=$(gcloud config get-value account)
$ kubectl create serviceaccount tiller --namespace kube-system
$ kubectl create clusterrolebinding tiller-admin-binding  --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
```
4. Grant Spinnaker the cluster-admin role so it can deploy resources across all namespaces:
```
$ kubectl create clusterrolebinding --clusterrole=cluster-admin       --serviceaccount=default:default spinnaker-admin
```
5. Initialize Helm to install Tiller in your cluster:
```
$ helm init --service-account=tiller --upgrade
$ helm repo update
```
6. Ensure that Helm is properly installed by running the following command. If Helm is correctly installed, v2.9.0 appears for both client and server.
```
$ helm version
```
Configure Spinnaker

1. Create a bucket for Spinnaker to store its pipeline configuration:
```
$ export PROJECT=$(gcloud info --format='value(config.project)')
$ export BUCKET=$PROJECT-spinnaker-configgsutil mb -c regional -l us-central1  gs://$BUCKET
```
2. Create the configuration file:
```
$ export SA_JSON=$(cat spinnaker-sa.json)
$ export PROJECT=$(gcloud info --format='value(config.project)')
$ export BUCKET=$PROJECT-spinnaker-config
$ cat > spinnaker-config.yaml <
```
```
# Disable minio as the defaultminio:      
enabled: false 

# Configure your Docker registries here accounts:      
name: gcr       
address: https://gcr.io 
username: _json_key 
password: '$SA_JSON'
email: 1234@5678.com EOF
```
Deploy the Spinnaker chart
1. Use the Helm command-line interface to deploy the chart with the configuration set earlier. This command typically takes five to ten minutes to complete, so we will be providing a deploy timeout with ` — timeout`.
```
$ helm install -n cd stable/spinnaker -f spinnaker-config.yaml --timeout  600 --version 0.3.1
```
After the command completes, run the following command to set up port forwarding to the Spinnaker UI from Cloud Shell:
```
$ export DECK_POD=$(kubectl get pods --namespace default -l  "component=deck" -o jsonpath="{.items[0].metadata.name}")
$ kubectl port-forward --namespace default $DECK_POD 8080:9000  >> /dev/null &
```
The above command exposes the Spinnaker UI onto the local machine that we’re using to run all the commands. We can use any port of our choosing instead of 8080 in above command. Now the UI can be opened onto the url http://localhost:8080.

Building the Docker image

In this section, we will configure Container Builder to detect changes to the application source code, if yes then build a Docker image, and then push it to Container Registry.

For this step we will use a sample app provided by the Google community

Create your source code repository

1. Download the source code:
```
$ wget https://gke-spinnaker.storage.googleapis.com/sample-app.tgz
```
2. Unpack the source code:
```
$ tar xzfv sample-app.tgz
```
3. Change directories to source code:
```
$ cd sample-app
```
4. Set the username and email address for Git commits in this repository. Replace [EMAIL_ADDRESS] with Git email address, and replace [USERNAME] with Git username.
```
$ git config --global user.email "[EMAIL_ADDRESS]"
$ git config --global user.name "[USERNAME]"
```
5. Make the initial commit to source code repository:
```
$ git init
$ git add .
$ git commit -m "Initial commit"
```
6. Create a repository to host the code:
```
$ gcloud source repos create sample-app
$ git config credential.helper gcloud.sh
```
7. Add our newly created repository as remote:
```
$ export PROJECT=$(gcloud info --format='value(config.project)')
$ git remote add origin  https://source.developers.google.com/p/$PROJECT/r/sample-app
```
8. Push the code to the new repository’s master branch:
```
$ git push origin master
```
9. Check that we can see our source code in the console.

Configuring the build triggers

In this section, we configure Google Container Builder to build and push your Docker images every time we push Git tags to our source repository. Container Builder automatically checks out the source code, builds the Docker image from the Dockerfile in repository, and pushes that image to Container Registry.
1. In the GCP Console, click Build Triggers in the Container Registry section.
2. Select Cloud Source Repository and click Continue.
3. Select your newly created sample-app repository from the list, and click Continue.
4. Set the following trigger settings:
5. Name:sample-app-tags
6. Trigger type: Tag
7. Tag (regex): v.*
8. Build configuration: cloudbuild.yaml
9. cloudbuild.yaml location: /cloudbuild.yaml
10. Click Create trigger.
From now on, whenever we push a Git tag prefixed with the letter “v” to source code repository, Container Builder automatically builds and pushes our application as a Docker image to Container Registry.

Let’s build our first image:

Push the first image using the following steps:

1. Go to source code folder in Cloud Shell.

2. Create a Git tag:
```
$ git tag v1.0.0
```
3. Push the tag:
```
$ git push --tags
```
4. In Container Registry, click Build History to check that the build has been triggered. If not, verify the trigger was configured properly in the previous section.

Configuring your deployment pipelines

Now that our images are building automatically, we need to deploy them to the Kubernetes cluster.

We deploy to a scaled-down environment for integration testing. After the integration tests pass, we must manually approve the changes to deploy the code to production services.

Create the application

1. In the Spinnaker UI, click Actions, then click Create Application.

2. In the New Application dialog, enter the following fields:
1. Name: sample
2. Owner Email: [your email address]
3. Click Create.

Create service load balancers

To avoid having to enter the information manually in the UI, use the Kubernetes command-line interface to create load balancers for the services. Alternatively, we can perform this operation in the Spinnaker UI.

On the local machine where the code resides, run the following command from the sample-app root directory:
```
$ kubectl apply -f k8s/services
```
Create the deployment pipeline

Now we create the continuous delivery pipeline. The pipeline is configured to detect when a Docker image with a tag prefixed with “v” has arrived in your Container Registry.

1. Create a new pipeline named say “Deploy”.

2. Go to the Config page for the pipeline that we just created and click Pipeline Actions -> Edit as JSON.

3. Change the directory to the source code directory and update the current pipeline-deploy.json at path spinnaker/pipeline-deploy.json according to our needs.
```
$ export PROJECT=$(gcloud info --format='value(config.project)')
$ sed s/PROJECT/$PROJECT/g spinnaker/pipeline-deploy.json > spinnaker/updated-pipeline-deploy.json
```
4. Now in the JSON editor just copy the whole file spinnaker/updated-pipeline-deploy.json.

5. Click on Update Pipeline and we should have an updated pipeline config now.

6. In the Spinnaker UI, click Pipelines on the top navigation bar.

7. Click Configure in the Deploy pipeline.

8. The continuous delivery pipeline configuration appears in the UI:

Running the pipeline manually

The configuration we just created contains a trigger to start the pipeline when a new Git tag containing the prefix “v” is pushed. Now we test the pipeline by running it manually.

1. Return to the Pipelines page by clicking Pipelines.

2. Click Start Manual Execution.

3. Select the v1.0.0 tag from the Tag drop-down list, then click Run.

4. After the pipeline starts, click Details to see more information about the build’s progress. This section shows the status of the deployment pipeline and its steps. Steps in blue are currently running, green ones have completed successfully, and red ones have failed. Click a stage to see details about it.

5. After 3 to 5 minutes the integration test phase completes and the pipeline requires manual approval to continue the deployment.

6. Hover over the yellow “person” icon and click Continue.

7. Your rollout continues to the production frontend and backend deployments. It completes after a few minutes.

8. To view the app, click Load Balancers in the top right of the Spinnaker UI.

9. Scroll down the list of load balancers and click Default, under sample-frontend-prod.

10. Scroll down the details pane on the right and copy application’s IP address by clicking the clipboard button on the Ingress IP.

11. Paste the address into the browser to view the production version of the application.

12. We have now manually triggered the pipeline to build, test, and deploy your application.

Triggering the pipeline automatically via code changes

Now let’s test the pipeline end to end by making a code change, pushing a Git tag, and watching the pipeline run in response. By pushing a Git tag that starts with “v”, we trigger Container Builder to build a new Docker image and push it to Container Registry. Spinnaker detects that the new image tag begins with “v” and triggers a pipeline to deploy the image to canaries, run tests, and roll out the same image to all pods in the deployment.

1. Change the colour of the app from orange to blue:
```
$ sed -i 's/orange/blue/g' cmd/gke-info/common-service.go
```
view rawcolor.js hosted with ❤ by GitHub

2. Tag your change and push it to the source code repository:
```
$ git commit -a -m "Change colour to blue"git tag v1.0.1git push --tags
```
view rawtag_color.js hosted with ❤ by GitHub

3. See the new build appear in the Container Builder Build History.

4. Click Pipelines to watch the pipeline start to deploy the image.

5. Observe the canary deployments. When the deployment is paused, waiting to roll out to production, start refreshing the tab that contains our application. Nine of our backends are running the previous version of your application, while only one backend is running the canary. Now we should see the new, blue version of our application appear about every tenth time we refresh.

6. After testing completes, return to the Spinnaker tab and approve the deployment.

7. When the pipeline completes, application looks like the following screenshot. Note that the colour has changed to blue because of code change, and that the Version field now reads v1.0.1.

8. We have now successfully rolled out your application to your entire production environment!!!!!!

9. Optionally, we can roll back this change by reverting the previous commit. Rolling back adds a new tag (v1.0.2), and pushes the tag back through the same pipeline we used to deploy v1.0.1:
```
$ git revert v1.0.1
$ git tag v1.0.2
$ git push --tags
```
view rawrevert.js hosted with ❤ by GitHub

Conclusion

Now that you know how to get Spinnaker up and running in a development environment, start using it already. In this blog, we have done everything from installing a K8s cluster on GCP to deploying an End to End Pipeline just like that in a production environment. Hope you found it helpful.

References

https://cloud.google.com/solutions/continuous-delivery-spinnaker-kubernetes-engine
November 19, 2025
How to Load Unstructured Data into Apache Hive
In today’s world, a lot of data is being generated daily. To process data that is large and very complex, traditional tools can’t be used. Huge volumes of complex data is simply called Big Data. Converting this raw data into meaningful insights, organizations can make better decisions with their products. We need a dedicated tool to help this raw data to be converted into meaningful data or knowledge. Thankfully, there are certain tools that can help.

Hadoop is one of the most popular frameworks used to process and store Big Data. Hive, in turn, is a tool that is designed to be used alongside Hadoop. In the blog, we are going to discuss the different ways we can load semi-structured and unstructured data into Hive. We will also be discussing what Hive is and how it works. How does the performance of Hive differ from working with structured vs. semi-structured vs. unstructured data?

What is Hive?

Hive is a data warehousing infrastructure tool developed on top of the Hadoop Distributed File System(HDFS). Hive can be used on top of any DFS.) Hive uses Hive query language(HQL), which is very much similar to structured query language(SQL). If you are familiar with SQL, then it is much easier to get started with HQL.

It is used for data querying and analysis over a large amount of data distributed over the Hadoop Distributed File System(HDFS). Hive supports reading, writing, and managing a large amount of data that is residing in the Hadoop Distributed File System(HDFS). Hive is mostly used for structured data but in this blog, we will see how we can load unstructured data.

Initially, Hive was developed at Facebook(Meta), and later it became an open-source project of Apache Software Foundation.

How does Hive work?

Source – AnalyticsVidya

Hive was created to allow non-programmers familiar with SQL to work with large datasets, using an HQL interface that is similar to SQL interface. Traditional databases are designed for small or medium datasets and not large ones. But Hive uses a distributed file system and batch processing to process large datasets very efficiently.

Hive transforms HQL queries into one or more Map-Reduce jobs or Tez jobs, and then these jobs run on Hadoop’s scheduler, YARN. Basically, HQL is an abstraction over Map-Reduce programs. After the execution of the job/query, the resulting data is stored in HDFS.

What is SerDe in Hive?

SerDe is short for “Serializer and Deserializer” in Hive. It’s going to be an important topic for this blog. So, you should have a basic understanding of what SerDe is and how it works.

If not, don’t worry, first of all, we will understand what Serialization and Deserialization is. When an object is converted into a byte stream, it’s into a binary format so that the object can be transmitted over a network or written into persistent storage like HDFS. This process of converting data objects into byte streams is called Serialization.

Now, we can transmit data objects or write data objects into persistent storage. But how can we receive transmitted data over the network again in a meaningful way because we will not be able to understand binary data properly? So, the process of converting byte stream or binary data back into objects is called Deserialization.

In Hive, tables are converted into row objects and row objects are written into HDFS using a Built-in Hive Serializer. And these row objects are converted back into tables using a Built-in Hive Deserializer.

Built-in SerDes:

• Avro (Hive 0.9.1 and later)

• ORC (Hive 0.11 and later)

• RegEx

• Thrift

• Parquet (Hive 0.13 and later)

• CSV (Hive 0.14 and later)

• JsonSerDe (Hive 0.12 and later in hcatalog-core)

Now, suppose we have data in a format that Hive’s Built-in SerDe can’t process. In such a scenario, we can write our own custom SerDe. Here, we will discuss how the row is converted into a table and vice versa. But writing your own custom SerDe is a complicated and complex process.

There is another way: we can use RegexSerDe. RegexSerDe uses a regular expression to serialize and deserialize the data using regex. RegexSerDe extracts groups as columns. Here, group means regular expressions capturing groups. In a regular expression, capturing groups are a way to treat multiple characters as a single unit. Groups can be created using placing parentheses. For example, the regular expression “(velotio)” creates a single group containing the characters “v,” “e,” “l,” “o,” ”t,” “i,” and “o.”

This is just an overview of SerDe in Hive, but you can deep dive into SerDe. Also, the following image shows the flow of How Hive reads and writes records.

Source: Dummies.com

Types of data :

Big data can be classified in three ways

Structured data:

The data that can be organized into a well-defined structure is called Structured Data. Structured data can be easily stored, read, or transferred in the same defined structure. The best example of structured data is the table stored in Relational Databases. Tables have columns and rows that define a well-organized and fixed structure to data. Another example of structured data is an Excel file. An Excel file also has rows and columns that define a proper structure to data.

Source: O’Reilly

Semi-structured data:

The data that can not be organized into a fixed structure like a table but can be represented with properties such as tags, metadata, or other markers that separate data fields are called semi-structured data. Examples of semi-structured data are JSON and XML files. JSON files contain “key” and “values” pairs, where the key is a tag, and the value is actual data to be stored.

Source: Software Testing Help

Unstructured data:

The data which can not be organized into any structure is called unstructured data. The social media messages fall under the unstructured data category as they can not be organized into either a fixed structure like a table or even with tags or markers that will separate data fields. More examples of unstructured data are text files, multimedia content like images and videos.

Source: Fluxicon

Performance impact of working with structured vs, semi-structured vs, unstructured data

Storage:

Structured data is always stored in RDBMS. Structured data have a high organization level among all three. Semi-structured data has no schema but has some properties or tags. Structured data have less organization level compared to structured data but higher organization level than unstructured data. While unstructured data has no schema, so it has the lowest organization level.

Data manipulation:

Data manipulation includes updating and deleting data. Consider an example where we want to update the name of a student using his roll number. Data manipulation in structured data is easy to perform as we have defined structure and we can manipulate specific records very easily. We can easily update the student’s name using his roll number in structured data. Whereas in unstructured data, there is no schema available, so it is not easy to manipulate data in unstructured data as compared to structured data.

Searching of data:

Searching for particular data in structured data is easy compared to searching for data in unstructured data. In unstructured data, we will need to go through all lines, and each word and searching of data in unstructured data will get complex. Searching data in semi-structured data is also easy as we just need to specify the key to get the data.

Scaling of data:

Scaling structured data is very hard. It can be scaled vertically by adding an existing machine’s RAM or CPU, but scaling it horizontally is hard to do. However, scaling semi-structured data and unstructured data is easy.

Data sets we are using:

1) Video_Games_5.json

This dataset contains product item reviews and metadata from Amazon, including 142.8 million audits spreading over May 1996 – July 2014. Reviews (ratings, text, helpfulness votes), item metadata (depictions, category information, price, brand, and image features), and links (also viewed/also bought graphs) are all included in this dataset. The dataset represents data in a semi-structured manner.

The following image shows one record of the entire dataset.
{ "reviewerID": "A2HD75EMZR8QLN", "asin": "0700099867", "reviewerName": "123", "helpful": [ 8, 12 ], "reviewText": "Installing the game was a struggle (because of games for windows live bugs).Some championship races and cars can only be \"unlocked\" by buying them as an addon to the game. I paid nearly 30 dollars when the game was new. I don't like the idea that I have to keep paying to keep playing.I noticed no improvement in the physics or graphics compared to Dirt 2.I tossed it in the garbage and vowed never to buy another codemasters game. I'm really tired of arcade style rally/racing games anyway.I'll continue to get my fix from Richard Burns Rally, and you should to. :)http://www.amazon.com/Richard-Burns-Rally-PC/dp/B000C97156/ref=sr_1_1?ie=UTF8&qid;=1341886844&sr;=8-1&keywords;=richard+burns+rallyThank you for reading my review! If you enjoyed it, be sure to rate it as helpful.", "overall": 1, "summary": "Pay to unlock content? I don't think so.", "unixReviewTime": 1341792000, "reviewTime": "07 9, 2012"}
```
{
  "reviewerID": "A2HD75EMZR8QLN",
  "asin": "0700099867",
  "reviewerName": "123",
  "helpful": [
    8,
    12
  ],
  "reviewText": "Installing the game was a struggle (because of games for windows live bugs).Some championship races and cars can only be \"unlocked\" by buying them as an addon to the game. I paid nearly 30 dollars when the game was new. I don't like the idea that I have to keep paying to keep playing.I noticed no improvement in the physics or graphics compared to Dirt 2.I tossed it in the garbage and vowed never to buy another codemasters game. I'm really tired of arcade style rally/racing games anyway.I'll continue to get my fix from Richard Burns Rally, and you should to. :)http://www.amazon.com/Richard-Burns-Rally-PC/dp/B000C97156/ref=sr_1_1?ie=UTF8&qid;=1341886844&sr;=8-1&keywords;=richard+burns+rallyThank you for reading my review! If you enjoyed it, be sure to rate it as helpful.",
  "overall": 1,
  "summary": "Pay to unlock content? I don't think so.",
  "unixReviewTime": 1341792000,
  "reviewTime": "07 9, 2012"}
```
2) sparkLog.txt

Apache Spark (https://spark.apache.org) is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning, and graph processing. Currently, Spark has been widely deployed in the industry.

The Dataset / Log Set was collected by aggregating logs from the Spark system in a lab at CUHK, which comprises a total of 32 machines. The logs are aggregated at the machine level. Logs are provided as-is without further modification or labeling, which involve both normal and abnormal application runs.

The dataset represents data in an un-structured manner.
17/06/09 20:10:40 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 17/06/09 20:10:40 INFO spark.SecurityManager: Changing view acls to: yarn,curi 17/06/09 20:10:40 INFO spark.SecurityManager: Changing modify acls to: yarn,curi 17/06/09 20:10:40 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, curi); users with modify permissions: Set(yarn, curi) 17/06/09 20:10:41 INFO spark.SecurityManager: Changing view acls to: yarn,curi 17/06/09 20:10:41 INFO spark.SecurityManager: Changing modify acls to: yarn,curi 17/06/09 20:10:41 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, curi); users with modify permissions: Set(yarn, curi) 17/06/09 20:10:41 INFO slf4j.Slf4jLogger: Slf4jLogger started 17/06/09 20:10:41 INFO Remoting: Starting remoting 17/06/09 20:10:41 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@mesos-slave-07:55904]
```
17/06/09 20:10:40 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
17/06/09 20:10:40 INFO spark.SecurityManager: Changing view acls to: yarn,curi
17/06/09 20:10:40 INFO spark.SecurityManager: Changing modify acls to: yarn,curi
17/06/09 20:10:40 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, curi); users with modify permissions: Set(yarn, curi)
17/06/09 20:10:41 INFO spark.SecurityManager: Changing view acls to: yarn,curi
17/06/09 20:10:41 INFO spark.SecurityManager: Changing modify acls to: yarn,curi
17/06/09 20:10:41 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, curi); users with modify permissions: Set(yarn, curi)
17/06/09 20:10:41 INFO slf4j.Slf4jLogger: Slf4jLogger started
17/06/09 20:10:41 INFO Remoting: Starting remoting
17/06/09 20:10:41 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutorActorSystem@mesos-slave-07:55904]
```
Now, we will look into different ways of loading unstructured data into Hive.

How to load semi-structured data into Hive?

1) Using Spark

If you are aware of Spark, loading semi-structured data into the spark is very easy. Spark can read JSON files, XML files, and convert them into Spark DataFrame. In Spark, DataFrame is a distributed collection of data that is organized into columns and rows. It is logically similar to tables in relational databases.

Now, we have our semi-structured data in an organized way. We can now write this organized DataFrame into Hive as a table from Spark.

Below is the code to read the JSON file and write it as a table in Hive.

‍
from pyspark.sql import SparkSession ## creating sparkSession to get entrypoint to spark application sparkSession = SparkSession\ .builder\ .appName('Write_table_to_hive')\ .enableHiveSupport()\ .getOrCreate() ## reading data from dataset "Video_Games_5.json" GamesReviewDataFrame = sparkSession.read.format("json") \ .format("json") \ .option("path", "/home/velotio/Downloads/UnstructuredData/Video_Games_5.json")\ .load() ## we can modify data the way we want to represent in table here GamesReviewDataFrame.show() ## writing dataframe "GamesReviewDataFrame" as a table in HIVE. GamesReviewDataFrame.write.saveAsTable("GameReviewTable") sparkSession.stop()
```
from pyspark.sql import SparkSession

## creating sparkSession to get entrypoint to spark application
sparkSession = SparkSession\
 .builder\
 .appName('Write_table_to_hive')\
 .enableHiveSupport()\
 .getOrCreate()

## reading data from dataset "Video_Games_5.json"
GamesReviewDataFrame = sparkSession.read.format("json") \
          .format("json") \
          .option("path", "/home/velotio/Downloads/UnstructuredData/Video_Games_5.json")\
          .load()

## we can modify data the way we want to represent in table here 
GamesReviewDataFrame.show()

## writing dataframe "GamesReviewDataFrame" as a table in HIVE.
GamesReviewDataFrame.write.saveAsTable("GameReviewTable")

sparkSession.stop()
```
Output for above code:

As you can see in the output, a few records of the DataFrame are displayed in an organized table.

2) Using built-in SerDe, JSON SerDe

Hive provides us with a few built-in SerDe. Using this built-in SerDe, we can load data into Hive. In our case, we have used the Video_Games_5.json file as a dataset for semi-structured data, which is a JSON file. So, we will be using built-in JsonSerDe to load Video_Games_5.json data into Hive. This JsonSerDe can be used to read data in JSON format.

We will need to add JsonSerDe.jar to Hive.

You can download JsonSerDe here.

1) Copy dataset Video_Games_5.json from the local file system to the docker container.

To load data into the Hive table, we need to copy the dataset Video_Games_5.json into HDFS. As we are running HDFS and Hive in the docker container, we will need to copy this dataset from the Local File System to the docker container.
```
docker cp /home/velotio/Downloads/UnstructuredData/Video_Games_5.json 0fde53f41006:/aniket
```
2) Copy dataset Video_Games_5.json from a docker container to the HDFS file system.
```
ls
ls /aniket
hdfs dfs -ls /
hdfs dfs -ls /aniket
hdfs dfs -put /aniket/Video_Games_5.json /aniket
hdfs dfs -ls /aniket
```
3) Copy json-serde.jar from the local file system to the docker container

To use JsonSerDe, add the json-serde.jar file to Hive so that Hive can use it.

We store this json-serde.jar file to HDFS storage where our dataset is also present. As Hive is running on top of HDFS, we can access the HDFS path from Hive. But to store json-serde.jar on HDFS, the file needs to be present in the docker container. For that, we copy json-serde.jar to the docker container first.
```
docker cp /home/velotio/Downloads/json-serde-1.3.7.3.jar 0fde53f41006:/aniket
```
4) Copy json-serde.jar from a docker container to the HDFS file system.
```
ls /aniket
hdfs dfs -ls /aniket

hdfs dfs -put /aniket/json-serde-1.3.7.3.jar /aniket
hdfs dfs -ls /aniket
```
5) Add json-serde.jar file to Hive
```
ADD JAR hdfs:///aniket/json-serde-1.3.7.3.jar;
```
6) Create Hive table GameReviews

To load data into Hive and define the structure of our data, we must create a table in Hive before loading the data. The table holds the data in an organized manner.

While creating the table, we are specifying “row format serde,” which tells Hive to use the provided SerDe for reading and writing Hive data.
```
create table GameReviews(
reviewerID string,
asin string,
reviewerName string,
helpful array<int>,
reviewText string,
overall int,
summary string,
unixReviewTime int,
reviewTime string
)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
stored as textfile;
```
‍

7) Load data from the Video_Games_5.json dataset into the table.

We are loading data from Video_Games_5.json into the Hive table. With the help of SerDe provided while creating a table, Hive will parse this data and load it into the table.
```
load data inpath '/aniket/Video_Games_5.json' into table GameReviews;
```
8) Check data from the table.

Just cross-check if the data is loaded properly into the table.
```
select reviewerID,asin,reviewerName,overall from GameReviews limit 10;
```
How to load unstructured data into Hive ?

1. Using Regex SerDe

For unstructured data, the built-in SerDe can’t work with excluded RegxSerDe. To load unstructured data into Hive, we can use RegexSerde. First of all, we will need to figure out what unstructured data is useful. After knowing what data is useful, we can extract data using pattern matching. For that, we can use regular expressions. With regular expressions, we will load unstructured data of the SparkLog.txt dataset into Hive.

In our case, we are going to use the following regular expression:

“([0-9]{2}/[0-9]{2}/[0-9]{2}) ([0-9]{2}:[0-9]{2}:[0-9]{2}) [a-zA-Z]* ([a-zA-Z0-9.]*): (.*)$”

“([0-9]{2}/[0-9]{2}/[0-9]{2})”: First group in Regular Expression matches date values.

“([0-9]{2}:[0-9]{2}:[0-9]{2})”: Second Group in Regular Expression matches timestamp values.

“[a-zA-Z]*”: This pattern matches any string with multiple occurrences of char a to z and A to Z; this pattern will be ignored in the Hive table as we are not collecting this pattern as a group.

“([a-zA-Z0-9.]*):”: Third group in regular expression matches with multiple occurrences of char a to z, A to Z, 0 to 9 and “.”

“(.*)$”: Fourth and last group matches with all characters in the remaining string.

1) Copy dataset sparkLog.txt from local file system to docker container.

To load data into Hive table, we need dataset sparkLog.txt into HDFS. As we are running HDFS and Hive in the docker container, we will need to copy this dataset from the Local File System to the docker container first.
```
docker cp /home/velotio/Downloads/UnstructuredData/sparkLog.txt 6d94029a1f34:/aniket
```
2) Copy dataset sparkLog.txt from a docker container to HDFS file system.
```
lsnls /aniketnhdfs dfs -ls /nhdfs dfs -ls /aniketnhdfs dfs -put /aniket/sparkLog.txt /aniketnhdfs dfs -ls /aniket
```
3) Create a Hive table sparkLog.

To load data into Hive and define the structure to our data, we must create a table in Hive before loading the data. The table holds the data in an organized manner.

While creating the table, we are specifying “row format SerDe,” which tells Hive to use the provided SerDe for reading and writing Hive data. For RegexSerDe, we must specify serdeproperties: “input.regex” and “output.format.string.”
```
create table sparkLog 
(
    datedata  string
   ,time string
   ,component string
      ,action string
)
    row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
    with serdeproperties 
	(
		"input.regex" = "([0-9]{2}/[0-9]{2}/[0-9]{2}) ([0-9]{2}:[0-9]{2}:[0-9]{2}) [a-zA-Z]* ([a-zA-Z0-9.]*): (.*)$",
		"output.format.string" = "%1$s %2$s %3$s %4$s"
	)
STORED AS TEXTFILE;
```
4) Load data from the sparkLog.txt dataset into the table.

We are loading data from sparkLog.txt into the Hive table. With the help of the SerDe provided while creating the table, Hive will parse this data and will load it into the table.
```
load data inpath '/aniket/sparkLog.txt’ into table sparkLog;
```
5) Check the data from the table.

Cross-check if the data is loaded properly into the table.
```
select * from sparkLog;
```
2) Using HQL functions

For unstructured data, we have already seen how to use RegexSerDe to load unstructured data into Hive. But what if I am not aware of regular expressions or can’t write complex regular expressions to match patterns in a string? There is another way to load unstructured data into Hive using some HQL user-defined functions.

What we need to do is create a dummy table and load unstructured data as it is into Hive in just one column in the table named “line.” We are loading unstructured data into a dummy Hive table column named as a line. The first record of the “line” column will contain the first line of DataSet, and the second record of the line column will contain the second line of DataSet. Like this, the entire Dataset will be loaded into a dummy table.

Now, using HQL user-defined functions on the dummy Hive table, we can write specific data to specific columns into the main table using the “insert into” statement. You should be able to extract the data that you want using HQL user-defined functions.

1) Copy dataset sparkLog.txt from local file system to docker container

To load data into Hive table, we need dataset sparkLog.txt into HDFS. As we are running HDFS and Hive in the docker container, we will need to copy this dataset from the local file system to the docker container.
```
docker cp /home/velotio/Downloads/UnstructuredData/sparkLog.txt 6d94029a1f34:/aniket
```
2) Copy the dataset sparkLog.txt from a docker container to HDFS file system.
```
ls
ls /aniket
hdfs dfs -ls /
hdfs dfs -ls /aniket
hdfs dfs -put /aniket/sparkLog.txt /aniket
hdfs dfs -ls /aniket
```
3) Create Hive table log

We are creating a dummy Hive table as a log. We are specifying “row format delimited lines terminated by ‘/n’,” which tells Hive to consider default value for fields delimiter and ‘/n’ for line delimiter.
```
create table if not exists log 
(
	line string
)
row format delimited
lines terminated by ‘n’
STORED AS TEXTFILE;
```
4) Load the data from sparkLog.txt dataset into the table log.

We are loading data from sparkLog.txt into Hive table log.
```
load data inpath '/aniket/sparkLog.txt’ into table log;
```
5) Create Hive table log sparkLog.

We are creating a Hive table sparkLog to keep our organized data. This organized data will be extracted from a dummy Hive table log.
```
create table sparkLog 
(
   datedata  string
   ,time string
   ,component string
      ,action string
)
row format delimited
lines terminated by ‘\n’
STORED AS TEXTFILE;
```
6) Parse the data from log table using a case statement and insert records into the sparkLog table.

We are using HQL user-defined functions to get the specific data and inserting this data into our sparkLog table using insert into statement.
```
insert into sparkLog select
split(line, ' ')[0] as datedata,
split(line, ' ')[1] as timedata,
split(split(line, ': ')[0],' ')[3] as component,
split(line, ': ')[1] as action
from log ;
```
7) Check data from the table.

Crosscheck if the data is loaded properly into the table.
```
select * from sparkLog;
```
Summary

After going through the above blog, you might have gotten more familiarity with Hive, its architecture. how you can use different serializers and deserializers in Hive. Now, you are able to load not only structured data but also unstructured data into Hive. If you are interested in knowing more about Apache Hive, you can visit the below documentation.

1. Hive Tutorial
2. LanguageManual
3. Hive Wiki Pages
November 19, 2025
Making Your Terminal More Productive With Z-Shell (ZSH)
When working with servers or command-line-based applications, we spend most of our time on the command line. A good-looking and productive terminal is better in many aspects than a GUI (Graphical User Interface) environment since the command line takes less time for most use cases. Today, we’ll look at some of the features that make a terminal cool and productive.

You can use the following steps on Ubuntu 20.04. If you are using a different operating system, your commands will likely differ. If you’re using Windows, you can choose between Cygwin, WSL, and Git Bash.

Prerequisites

Let’s upgrade the system and install some basic tools needed.
```
sudo apt update && sudo apt upgrade
sudo apt install build-essential curl wget git
```
Z-Shell (ZSH)

Zsh is an extended Bourne shell with many improvements, including some features of Bash and other shells.

Let’s install Z-Shell:
```
sudo apt install zsh
```
Make it our default shell for our terminal:
```
chsh -s $(which zsh)
```
Now restart the system and open the terminal again to be welcomed by ZSH. Unlike other shells like Bash, ZSH requires some initial configuration, so it asks for some configuration options the first time we start it and saves them in a file called .zshrc in the home directory (/home/user) where the user is the current system user.

For now, we’ll skip the manual work and get a head start with the default configuration. Press 2, and ZSH will populate the .zshrc file with some default options. We can change these later.

The initial configuration setup can be run again as shown in the below image:

Oh-My-ZSH

Oh-My-ZSH is a community-driven, open-source framework to manage your ZSH configuration. It comes with many plugins and helpers. It can be installed with one single command as below.

Installation
```
sh -c "$(wget https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh -O -)"
```
It’d take a backup of our existing .zshrc in a file zshrc.pre-oh-my-zsh, so whenever you uninstall it, the backup would be restored automatically.

Font

A good terminal needs some good fonts, we’d use Terminess nerd font to make our terminal look awesome, which can be downloaded here. Once downloaded, extract and move them to ~/.local/share/fonts to make them available for the current user or to /usr/share/fonts to be available for all the users.
```
tar -xvf Terminess.zip
mv *.ttf ~/.local/share/fonts 
```
Once the font is installed, it will look like:

Among all the things Oh-My-ZSH provides, 2 things are community favorites, plugins, and themes.

Theme

My go-to ZSH theme is powerlevel10k because it’s flexible, provides everything out of the box, and is easy to install with one command as shown below:
```
git clone --depth=1 https://github.com/romkatv/powerlevel10k.git ${ZSH_CUSTOM:-$HOME/.oh-my-zsh/custom}/themes/powerlevel10k
```
To set this theme in .zshrc:

Close the terminal and start it again. Powerlevel10k will welcome you with the initial setup, go through the setup with the options you want. You can run this setup again by executing the below command:
```
p10k configure
```
Tools and plugins we can’t live without

Plugins can be added to the plugins array in the .zshrc file. For all the plugins you want to use from the below list, add those to the plugins array in the .zshrc file like so:

ZSH-Syntax-Highlighting

This enables the highlighting of commands as you type and helps you catch syntax errors before you execute them:

As you can see, “ls” is in green but “lss” is in red.

Execute below command to install it:
```
git clone https://github.com/zsh-users/zsh-syntax-highlighting.git ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-syntax-highlighting
```
ZSH Autosuggestions

This suggests commands as you type based on your history:

The below command is how you can install it by cloning the git repo:
```
git clone https://github.com/zsh-users/zsh-autosuggestions ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-autosuggestions
```
ZSH Completions

For some extra Zsh completion scripts, execute below command
```
git clone https://github.com/zsh-users/zsh-completions ${ZSH_CUSTOM:=~/.oh-my-zsh/custom}/plugins/zsh-completions 
```
autojump

It’s a faster way of navigating the file system; it works by maintaining a database of directories you visit the most. More details can be found here.
```
sudo apt install autojump 
```
You can also use the plugin Z as an alternative if you’re not able to install autojump or for any other reason.

Internal Plugins

Some plugins come installed with oh-my-zsh, and they can be included directly in .zshrc file without any installation.

copyfile

It copies the content of a file to the clipboard.
```
copyfile test.txt
```
copypath

It copies the absolute path of the current directory to the clipboard.

copybuffer

This plugin copies the command that is currently typed in the command prompt to the clipboard. It works with the keyboard shortcut CTRL + o.

sudo

Sometimes, we forget to prefix a command with sudo, but that can be done in just a second with this plugin. When you hit the ESC key twice, it will prefix the command you’ve typed in the terminal with sudo.

web-search

This adds some aliases for searching with Google, Wikipedia, etc. For example, if you want to web-search with Google, you can execute the below command:
```
google oh my zsh
```
Doing so will open this search in Google:

More details can be found here.

Remember, you’d have to add each of these plugins in the .zshrc file as well. So, in the end, this is how the plugins array in .zshrc file should look like:
```
plugins=(
        zsh-autosuggestions
        zsh-syntax-highlighting
        zsh-completions
        autojump
        copyfile
        copydir
        copybuffer
        history
        dirhistory
        sudo
        web-search
        git
) 
```
You can add more plugins, like docker, heroku, kubectl, npm, jsontools, etc., if you’re a developer. There are plugins for system admins as well or for anything else you need. You can explore them here.

Enhancd

Enhancd is the next-gen method to navigate file system with cli. It works with a fuzzy finder, we’ll install it fzf for this purpose.
```
sudo apt install fzf
```
Enhancd can be installed with zplug plugin manager for Zsh, so first we’ll install zplug with the below command:
```
$ curl -sL --proto-redir -all,https https://raw.githubusercontent.com/zplug/installer/master/installer.zsh | zsh
```
Append the following to .zshrc:
```
source ~/.zplug/init.zsh
zplug load
```
Now close your terminal, open it again, and use zplug to install enhanced
```
zplug "b4b4r07/enhancd", use:init.sh
```
Aliases

As a developer, I need to execute git commands many times a day, typing each command every time is too cumbersome, so we can use aliases for them. Aliases need to be added .zshrc, and here’s how we can add them.
```
alias gs='git status'
alias ga='git add .'
alias gf='git fetch'
alias gr='git rebase'
alias gp='git push'
alias gd='git diff'
alias gc='git commit'
alias gh='git checkout'
alias gst='git stash'
alias gl='git log --oneline --graph'
```
You can add these anywhere in the .zshrc file.

Colorls

Another tool that makes you say wow is Colorls. This tool colorizes the output of the ls command. This is how it looks once you install it:

It works with Ruby, below is how you can install both Ruby and Colors:
```
sudo apt install ruby ruby-dev ruby-colorize
sudo gem install colorls
```
Now, restart your terminal and execute the command colors in your terminal to see the magic!

Bonus – We can add some aliases as well if we want the same output of Colorls when we execute the command ls. Note that we’re adding another alias for ls to make it available as well.
```
alias cl='ls'
alias ls='colorls'
alias la='colorls -a'
alias ll='colorls -l'
alias lla='colorls -la'
```
These are the tools and plugins I can’t live without now, Let me know if I’ve missed anything.

Automation

Do you wanna repeat this process again, if let’s say, you’ve bought a new laptop and want the same setup?

You can automate all of this if your answer is no, and that’s why I’ve created Project Automator. This project does a lot more than just setting up a terminal: it works with Arch Linux as of now but you can take the parts you need and make it work with almost any *nix system you like.

Explaining how it works is beyond the scope of this article, so I’ll have to leave you guys here to explore it on your own.

Conclusion

We need to perform many tasks on our systems, and using a GUI(Graphical User Interface) tool for a task can consume a lot of your time, especially if you repeat the same task on a daily basis like converting a media stream, setting up tools on a system, etc.

Using a command-line tool can save you a lot of time and you can automate repetitive tasks with scripting. It can be a great tool for your arsenal.
November 19, 2025
The Impact of Buy Now, Pay Later Model on Consumer Behavior and the Payments Engineering Behind It
Buy Now, Pay Later (BNPL) has moved from a niche fintech innovation to a mainstream payment method, reshaping how people shop, spend, and manage credit today. Monthly BNPL spending increased almost 21% from $201.60 in June 2024 to $243.90 in June 2025, according to Empower Personal Dashboard™ data.

With BNPL growing, both customer expectations and spending patterns are evolving, and behind it all, payments engineering has become central. It is powering real-time credit checks, seamless checkout integrations, secure installment processing, and scalable infrastructures that ensure the Buy Now Pay Later model delivers on its promise of convenience, flexibility, and trust.

Buy Now Pay Later model: Shifting Consumer Expectations

The Buy Now Pay Later model is redefining what consumers demand in payments:
- Instant Approvals: Shoppers want credit decisions in seconds, not days.
- Transparency: Clear installment schedules and upfront costs with no hidden fees.
- Flexibility: Options ranging from Pay-in-4 to longer repayment plans.
- Integration: BNPL woven seamlessly into eCommerce checkouts, mobile apps, and even in-store POS systems.
These expectations underscore why payments engineering must focus on both experience and trust.

Payments Engineering: The Backbone of the Buy Now Pay Later Model

For Buy Now Pay Later model to scale and remain trustworthy, payments engineering is essential. Core elements include:
1. Real-Time Risk Assessment
  - AI-driven credit models approve or decline BNPL transactions instantly.
  - Regulatory pressure is increasing around affordability checks.
2. Seamless Checkout Integration
  - APIs and SDKs embed the Buy Now Pay Later model directly into digital and in-store journeys.
  - UX design ensures clarity and transparency.
3. Transaction Orchestration
  - Splitting purchases into multiple payments requires precise ledgering, routing, and reconciliation at scale.
4. Fraud Prevention & Compliance
  - BNPL engineering integrates identity checks, AML measures, and PCI DSS compliance.
5. Scalable Infrastructure
  - Cloud-native platforms ensure resilience and handle seasonal spikes in transaction volumes.
Without payments engineering, the Buy Now Pay Later model could not deliver its promise of flexibility and security.

Real-World Insights from BNPL Research
- Checkout framing effect
  “Imagine you’re buying a $100 dress. If you see “Pay now: $100”, that’s a big number. But if the checkout shows “Pay in 4: $25 per month”, you feel the cost is more manageable—and you’re more likely to click purchase.”
- Comparing BNPL vs. credit cards
  “Someone accustomed to paying with credit cards might see the full card bill at once, which can trigger cost awareness or even comparison-shopping. But with BNPL, because each payment is smaller and delayed, there’s less friction. BNPL users spend more under these conditions than credit‐card users do.”
- Behavioral/psychological angle
  Use a scenario:
  “Jane wants to buy a $400 laptop. She hesitates because that’s a large hit all at once. But if the option is “4 payments of $100 with no interest,” she feels it’s more feasible, and goes ahead. The installment breakdown makes the cost feel smaller in present terms.”
  This illustrates the psychological mechanisms the study uncovers.
Risks and Regulatory Shifts

BNPL’s rapid adoption also comes with notable challenges:
- Credit Reporting: Repayment histories are increasingly reported to credit bureaus, making defaults more visible and impactful.
- Overextension: A growing number of users rely on BNPL for cash flow rather than convenience, leading to rising late payments.
- Global Regulations: From the EU’s Consumer Credit Directive to UK affordability reforms, mandatory checks for transparency and responsible lending are reshaping the BNPL landscape.
These shifts mean providers can no longer treat compliance and risk management as afterthoughts. This is where payments engineering takes center stage. Engineering-led approaches allow businesses to:
- Automate credit checks, affordability assessments, and regulatory reporting
- Design secure, scalable BNPL platforms that can adapt to global compliance requirements
- Use AI and advanced analytics to flag high-risk behavior before it escalates
- Ensure seamless, low-friction customer experiences while embedding compliance into the transaction flow
To navigate these risks and regulatory shifts, providers must move beyond reactive fixes and embrace proactive, engineering-led strategies. Success depends on translating compliance requirements into technical architecture, system design, and embedded controls that scale with the business.

At R Systems, we enable organizations to strengthen their BNPL platforms with cloud-native architectures, API-first integrations, AI-driven fraud and risk models, and compliance-by-design frameworks. In today’s market, BNPL is no longer a competitive edge, it’s a baseline expectation. With our payments engineering expertise, businesses can not only stay compliant but also lead with secure, reliable, and future-ready BNPL solutions. Talk to our Experts Now.
October 3, 2025
Every Millisecond Matters: How AI is Rewriting the Rules of Real-Time Transactions.
Artificial Intelligence (AI) is reshaping the future of banking and payments. It has moved from a supporting technology to a core driver of growth and innovation. The global AI in banking and payments market is projected to reach $190.33 billion by 2030, reflecting its rapid adoption and transformative potential.

Recent studies highlight that 86% of financial firms consider AI important to their operations, with the technology expected to unlock $340 billion in annual productivity gains. Adoption is not just theoretical, 70% of financial institutions reported AI-driven revenue growth in 2024, underscoring its tangible impact on the industry.

This transformation is especially evident in the space of real-time transactions, where speed, security, and customer experience are non-negotiable. As real-time payments become the norm across global financial systems, the role of AI in transactions has expanded from fraud detection to personalized experiences, smarter risk scoring, and automated decision-making. By enabling instant analysis and adaptive responses, AI ensures that financial institutions can handle the demands of today’s fast-paced payment ecosystem, where every second counts, and trust is just as critical as efficiency.

Why AI for Real-Time Transactions

The rise of real-time payments is changing how money moves worldwide. Whether it’s peer-to-peer transfers, e-commerce checkouts, cross-border remittances, or securities trading, transactions now happen in milliseconds. This speed brings significant bottlenecks like online frauds, heightened regulatory scrutiny, compliance challenges, and the constant pressure to maintain security without disrupting the customer experience. Traditional systems often struggle to balance these demands, making AI in transactions an essential enabler of safe, efficient, and scalable payments.

Key Roles of AI in Real-Time Payments

1. Fraud Detection and Prevention

AI models analyze behavioral data, device fingerprints, and transaction history in real time. Unlike static systems, they learn continuously to detect new fraud tactics, flagging suspicious activity instantly while allowing legitimate payments to proceed without friction.

2. Smarter Risk Scoring

Every transaction can be assigned a dynamic risk score by AI. High-risk transactions are flagged for verification, while low-risk ones move through seamlessly. This approach reduces false positives, improves approval rates, and strengthens customer trust.

3. Personalized Customer Journeys

AI in transactions extends beyond security into personalization. Payment platforms can recommend tailored offers, loyalty rewards, or financing options at the point of payment, enhancing both customer satisfaction and business revenue.

4. Intelligent Automation and Compliance

AI-powered systems streamline KYC (Know Your Customer) and AML (Anti-Money Laundering) checks, automating tasks that once caused delays. Automated dispute resolution and instant decision-making further improve operational efficiency.

5. Performance and Scalability

During spikes such as holiday sales or IPO launches, AI optimizes transaction routing and system performance. Predictive models forecast demand, helping payment providers ensure uptime and reliability.

Outlook: AI as the Backbone of Real-Time Payments

Looking ahead, the role of AI will only grow stronger as real-time payments become common universally. When we look at the future of AI in transactions, a few key trends are already starting to take shape, pointing toward a faster, smarter, and more secure payment ecosystem.
- Explainable AI (XAI): Making AI’s decision-making transparent to regulators and customers.
- Quantum-Resistant Security: Preparing payments infrastructure for next-gen threats.
- Autonomous Financial Agents: AI-powered assistants conducting transactions on behalf of individuals or businesses.
- Cross-Border Real-Time Payments: AI bridging regulatory and compliance gaps between global markets.
Concluding

The rise of real-time payments is transforming customer expectations, where speed and trust go hand in hand. AI in transactions is the force making this possible by detecting fraud, ensuring compliance, and keeping payments seamless and secure.

At R Systems, we are hacking the future of real-time payments with our expertise in AI, data, and cloud engineering. By combining powerful tools and proven frameworks, we enable financial institutions to modernize faster, stay resilient, and deliver intelligent transaction experiences that inspire customer confidence today and tomorrow. Talk to our Expert Now.
September 29, 2025
8X More Flexible Assessments: Modernizing K-12 Evaluation with Scalable Architecture
- Modern Architecture Upgrade – Rebuilt the client’s flagship Instant Grading platform with a modern foundation, enhancing reliability, uptime, and adaptability to evolving classroom needs.
- Flexibility & Efficiency – Expanded assessment options from 9 to 75 per question, accelerated development cycles, and simplified onboarding for educators and developers alike.
- Strategic Outcomes – Delivered 8X more assessment flexibility, ensured smoother scaling to millions of students, and positioned the client as a global leader in next-generation K–12 evaluations
September 24, 2025
Securing Card Transactions: The Role of Card Management Systems in Fraud Detection and Prevention
Card fraud continues to evolve, keeping financial institutions and consumers on high alert. According to the latest predictions from the Nilson Report, global fraud losses in card payments are expected to reach $403.88 billion over the next decade. As card payment volumes surge worldwide, criminals are becoming increasingly sophisticated, ranging from bulk purchases of stolen card data to complex account takeovers and social engineering schemes.

This isn’t a temporary spike—it’s a permanent shift in the threat landscape. Financial institutions must act with urgency or risk mounting losses and eroding customer trust. That’s where the Card Management System (CMS) comes in. More than just card issuance, a modern CMS serves as the command center for digital payment security, providing real-time authorization controls, tokenization, and integration with fraud detection systems.

Key Card Management System Modules
- Product & BIN management (create/configure card products)
- Authorization rules & real-time limits (velocity, MCC, geography)
- Tokenization & wallet provisioning connectors (device tokens, network tokens)
- Fraud orchestration & rules engine (integration with fraud scoring services)
- Lifecycle management (issuing, reissue, suspend, close)
- Reporting, reconciliation & regulatory controls (PCI, AML/KYC hooks)
How Card Management System capabilities map to fraud prevention

1. Real-time authorization controls and dynamic rules
A CMS enforces transaction-level rules in milliseconds, blocking suspicious activities before they result in losses. For instance, it can decline a transaction happening in two different countries within minutes or challenge an unusually high purchase with additional authentication.

2. Tokenization & EMV payment tokens

Tokenization ensures card numbers are never directly exposed in digital transactions. Instead, tokens tied to devices, merchants, or specific transactions reduce the usability of stolen data. EMV tokenization has become a global standard and is now a critical CMS capability.

3. Strong Customer Authentication (SCA), 3-D Secure

Modern CMS platforms integrate SCA and 3-D Secure protocols, ensuring that high-risk transactions undergo step-up authentication (e.g., biometrics, OTP). Data from the European Banking Authority (EBA) confirms that SCA-protected transactions show significantly lower fraud rates compared to those without SCA.

4. AI-Driven Fraud Detection

Modern CMS platforms integrate with ML-driven fraud engines (in-house or third-party) to score

Advanced CMS platforms integrate machine learning and behavioral analytics that score transactions in real time. This reduces false positives while increasing fraud detection rates, balancing security with user experience.

5. Issuer controls exposed to cardholder

Two-way controls exposed to customers via mobile apps (instant lock/unlock, merchant category blocks, spend limits, geofencing, virtual card creation) are effective first-line defenses. They reduce the window of exposure for stolen card data and strengthen user trust, and those capabilities are commonly implemented as CMS APIs.

6. Customer Empowerment

Banks are increasingly exposing card control features to customers like instant lock/unlock, category-specific spending, and geo-blocking via mobile banking apps. These CMS-driven features allow cardholders to actively defend against fraud.

Typical Card Management System architecture patterns that improve security
- Separation of duties: Distinct services for token vault, auth/risk decisioning, and card lifecycle reduce blast radius.
- Event-driven authorization pipeline: Use a fast, streamable pipeline to inject real-time risk signals into the CMS before authorisation responses are returned.
- Secure, auditable key & credential management: Store keys in HSMs; use role-based access and rotate keys per policy to meet PCI and regulatory expectations.
- Token first, minimal PAN storage: Design systems so PANs are exchanged only at trusted boundaries and replaced with tokens in the CMS database.
- Multi-factor flows & step-up authentication: Integrate SCA / 3-D Secure / device attestation so the CMS can require extra proof for risky transactions.
Best Practices for Financial Institutions
1. Adopt a token-first approach: Store PANs only in secure vaults, use tokens everywhere else.
2. Integrate ML fraud engines: Blend rule-based controls with real-time analytics.
3. Enable customer controls: Empower users with simple security features in mobile apps.
4. Ensure regulatory compliance: Stay aligned with PCI DSS v4.0 and regional mandates like PSD2.
5. Regularly update rule sets: Fraud evolves quickly, static rules are ineffective.
Conclusion

Card fraud is no longer a background risk, it’s a frontline battle in digital banking. Financial institutions that fail to act decisively will not only suffer financial losses but also lose customer trust, which is far harder to rebuild.

A Card Management System is no longer just about issuing and managing cards, it is the nerve center of digital payment security. With real-time authorization controls, tokenization, integration with AI-driven fraud engines, and customer-facing controls, a modern CMS equips financial institutions to stay ahead of fraudsters.

At R Systems, we help banks, Fintechs, and payment providers modernize their payment ecosystems with next-generation Card Management Systems. Our expertise spans:
- Global gateway integrations
- GenAI-driven onboarding accelerators for faster time-to-market
- PCI-compliant mobile and web SDKs for secure checkout
- Optimized payment routing and higher transaction success rates
- AI-led fraud detection and orchestration to minimize risk
- Actionable analytics unlocking additional revenue from payments data
With proven payments engineering capabilities, R Systems enables institutions to strengthen digital payment security, reduce fraud exposure, and deliver trusted customer experiences at scale. Talk to our Experts Now.
September 18, 2025
OptimaAI Suite
Our OptimaAI Suite flyer showcases how R Systems helps enterprises harness GenAI across the entire software lifecycle with:
- AI-assisted software delivery copilots for coding, reviews, testing, and deployment
- GenAI-powered modernization for legacy systems, accelerating transformation
- Secure, governed frameworks with responsible AI guardrails and compliance checks
- Intelligent interfaces, chatbots, copilots, voice agents, and search to boost user productivity
- Domain-specific LLMs, pipelines, and accelerators tailored to industry needs
With this flyer you will –
- See how organizations achieved 18% faster development and 16% efficiency gains in modernization
- Discover proven OptimaAI Suite implementations that reduce costs, enhance quality, and speed innovation
- Learn how to scale AI adoption responsibly across engineering, operations, and customer experience
September 17, 2025
Know Your Company

September 17, 2025
Beyond Cost Control: 3 Ways FinOps Powers Growth and Agility
When most executives hear the term FinOps, they think about cost control. They imagine a team combing through invoices, cutting unused resources, and negotiating discounts. That is part of the story, but not the whole picture. In reality, FinOps is not just about saving money, it is about enabling growth, innovation, and agility in a cloud-driven world.

Cloud has given organizations unprecedented flexibility to scale infrastructure and deploy new features. But that same flexibility often leads to overspending, waste, and inefficiency. A recent study suggests that up to 30% of cloud spend is wasted, often because of idle resources, lack of visibility, or poor alignment between finance and engineering. For business leaders, this isn’t just a budget concern. Every dollar wasted represents engineering time lost, product releases delayed, and innovation deferred.

That’s where FinOps comes in.

At its core, FinOps (short for Cloud Financial Operations) is about bringing finance, technology, and business together to make smarter decisions. It aligns spending with business impact, provides the visibility leaders need to prioritize, and frees up capital that can be reinvested in research, new capabilities, and market expansion. In other words: FinOps transforms cloud from a cost center into a growth engine.

Why Cost Alone is the Wrong Lens

Organizations often approach FinOps with a narrow goal: reduce the cloud bill. While cutting unnecessary spend is important, it is only the starting point. If FinOps stops there, companies miss its real value.

Cloud waste isn’t just a financial inefficiency. It limits engineering capacity by tying up budgets in unused services. Teams hesitate to experiment with new tools because they lack clarity on budget trade-offs. Finance departments, worried about ballooning costs, become blockers instead of enablers.

By reframing FinOps from cost-cutting to growth-enabling, leaders unlock new opportunities. Strategic savings are not about trimming fat for the sake of it rather, they are about reallocating resources to what matters most: innovation, customer experience, and market differentiation.

How FinOps Turns Cloud Savings into Business Growth:

1. Visibility that Powers Better Decisions

FinOps provides transparency into cloud usage across teams, applications, and business units. This isn’t just about dashboards; it’s about understanding the link between cloud spend and business outcomes. When leaders can see which workloads drive revenue, which experiments pay off, and which services drain resources without returns, they can prioritize effectively.

This visibility ensures that every dollar spent is an investment, not just an expense.

2. Aligning Finance and Engineering

In traditional IT, finance and engineering often operate at odds. Finance wants predictability, engineering wants speed. FinOps bridges the gap by creating a shared language of value. With the right governance, engineering teams gain freedom to innovate while finance gains confidence in the ROI.

The result: finance shifts from being a gatekeeper to a trusted business partner.

3. Reinvesting in Innovation

Perhaps the most overlooked benefit of FinOps is the capacity it creates. Strategic cost optimization frees up capital that can be redirected into R&D, new product lines, and scaling operations. In competitive industries, this reinvestment can be the difference between leading and lagging.

A Case in Point: Growth Through FinOps

At R Systems, we recently worked with a leading healthcare supply chain provider that faced mounting cloud costs. The client was concerned not only about overspending, but also about delayed innovation. Their teams struggled to balance cost control with the need to modernize their supply chain systems.

Through our Cloud Cost Governance framework, we implemented a FinOps strategy that combined cost visibility, workload optimization, and cross-team accountability. Within a year, the client cut annual cloud costs by 20%.

But here is the real story: the savings weren’t simply pocketed. They were reinvested into innovation projects that modernized logistics operations and improved service delivery for healthcare providers nationwide. What began as a cost exercise became a growth initiative.

This is the essence of FinOps. It is not just about efficiency rather it is about fueling transformation.

Read the full story here- Driving Supply Chain Efficiency with Cloud Cost Governance – R Systems

The R Systems Advantage in Cloud FinOps

FinOps is not a one-time project. It is a continuous discipline that requires the right mix of process, culture, and technology. At R Systems, we bring this holistic view to every client engagement.
- FinOps Cloud Cost Management: We help enterprises gain real-time visibility into spend and align costs with business outcomes.
- FinOps Cost Optimization: Our frameworks reduce waste while ensuring teams have the resources they need to innovate.
- FinOps as a Service: We deliver ongoing governance and automation, so FinOps practices evolve with the business.
- Cloud Financial Management Expertise: With decades of experience in cloud engineering and enterprise IT, we design programs that balance growth with governance.
Our approach is rooted in collaboration. We don’t just analyze numbers; we empower cross-functional teams to make informed, agile decisions. By embedding FinOps into daily operations, organizations unlock both cost savings and growth potential.

For more on our approach, visit our Cloud FinOps page.

Looking Ahead: FinOps as the New Normal

The pace of digital transformation will only accelerate. Cloud adoption is no longer about “if” but “how fast” and “how smart.” In this context, FinOps will become a standard operating model for high-performing organizations.

The companies that thrive will be those that treat FinOps not as a defensive measure, but as an offensive strategy. They will use FinOps to fund innovation, empower engineers, and turn finance into a growth partner.

As Gurpreet Singh aptly wrote, FinOps is not about cutting costs, but about making the right costs. And as DNX Solutions reminds us, it is about moving beyond traditional cost management to create value.

At R Systems, we believe the future of FinOps lies in this growth-oriented mindset. The organizations we work with are not just trimming expenses—they are building the capacity to innovate faster, scale smarter, and compete stronger.

What to do next?

If your organization views FinOps purely as a cost-cutting exercise, it’s time to rethink. The real opportunity is to harness FinOps as a growth enabler. By combining visibility, alignment, and reinvestment, you can transform your cloud strategy from reactive control to proactive innovation.

R Systems can help you get there. Our Cloud FinOps services are designed to unlock both savings and scale, so you can invest confidently in the future.

The question is not whether you need FinOps.

The question is whether you will use it to cut costs, or to fuel growth.

The choice is yours. Let’s build the future of cloud together.

Start the journey — talk to our Cloud FinOps experts today.
September 16, 2025