Category: Engineering blogs

Linux Internals of Kubernetes Networking
Introduction

This blog is a hands-on guide designed to help you understand Kubernetes networking concepts by following along. We’ll use K3s, a lightweight Kubernetes distribution, to explore how networking works within a cluster.

System Requirements

Before getting started, ensure your system meets the following requirements:
- A Linux-based system (Ubuntu, CentOS, or equivalent).
- At least 2 CPU cores and 4 GB of RAM.
- Basic familiarity with Linux commands.
Installing K3s

To follow along with this guide, we first need to install K3s—a lightweight Kubernetes distribution designed for ease of use and optimized for resource-constrained environments.

Install K3s

You can install K3s by running the following command in your terminal:
```
curl -sfL https://get.k3s.io | sh -
```
This script will:
1. Download and install the K3s server.
2. Set up the necessary dependencies.
3. Start the K3s service automatically after installation.
Verify K3s Installation

After installation, you can check the status of the K3s service to make sure everything is running correctly:
```
systemctl status k3s
```
If everything is correct, you should see that the K3s service is active and running.

Set Up kubectl

K3s comes bundled with its own kubectl binary. To use it, you can either:

Use the K3s binary directly:
```
k3s kubectl get pods -A
```
Or set up the kubectl config file by exporting the Kubeconfig path:
```
export KUBECONFIG="/etc/rancher/k3s/k3s.yaml"
sudo chown -R $USER $KUBECONFIG
kubectl get pods -A
```
Understanding Kubernetes Networking

In Kubernetes, networking plays a crucial role in ensuring seamless communication between pods, services, and external resources. In this section, we will dive into the network configuration and explore how pods communicate with one another.

Viewing Pods and Their IP Addresses

To check the IP addresses assigned to the pods, use the following kubectl command:

CODE: https://gist.github.com/velotiotech/1961a4cdd5ec38f7f0fbe0523821dc7f.sh

This will show you a list of all the pods across all namespaces, including their corresponding IP addresses. Each pod is assigned a unique IP address within the cluster.

You’ll notice that the IP addresses are assigned by Kubernetes and typically belong to the range specified by the network plugin (such as Flannel, Calico, or the default CNI). K3s uses Flannel CNI by default and sets default pod CIDR as 10.42.0.0/24. These IPs allow communication within the cluster.

Observing Network Configuration Changes

Upon starting K3s, it sets up several network interfaces and configurations on the host machine. These configurations are key to how the Kubernetes networking operates. Let’s examine the changes using the IP utility.

Show All Network Interfaces

Run the following command to list all network interfaces:
```
ip link show
```
This will show all the network interfaces.
- lo, enp0s3, and enp0s9 are the network interfaces that belong to the host.
- flannel.1 interface is created by Flannel CNI for inter-pod communication that exists on different nodes.
- cni0 interface is created by bridge CNI plugin for inter-pod communication that exists on the same node.
- vethXXXXXXXX@ifY interface is created by bridge CNI plugin. This interface connects pods with the cni0 bridge.
Show IP Addresses

To display the IP addresses assigned to the interfaces:
```
ip -c -o addr show
```
You should see the IP addresses of all the network interfaces. With regards to K3s-related interfaces, only cni0 and flannel.1 have IP addresses. The rest of the vethXXXXXXXX interfaces only have MAC addresses; the details regarding this will be explained in the later section of this blog.

Pod-to-Pod Communication and Bridge Networks

The diagram illustrates how container networking works within a Kubernetes (K3s) node, showing the key components that enable pods to communicate with each other and the outside world. Let’s break down this networking architecture:

At the top level, we have the host interface (enp0s9) with IP 192.168.2.224, which is the node’s physical network interface connected to the external network. This is the node’s gateway to the outside world.

enp0s9 interface is connected to the cni0 bridge (IP: 10.42.0.1/24), which acts like a virtual switch inside the node. This bridge serves as the internal network hub for all pods running on the node.

Each of the pods runs in its own network namespace, with each one having its own separate network stack, which includes its own network interfaces and routing tables. Each of the pod’s internal interfaces, eth0, as shown in the diagram above, has an IP address, which is the pod’s IP address. eth0 inside the pod is connected to its virtual ethernet (veth) pair that exists in the host’s network and connects the eth0 interface of the pod to the cni0 bridge.

Exploring Network Namespaces in Detail

Kubernetes uses network namespaces to isolate networking for each pod, ensuring that pods have separate networking environments and do not interfere with each other.

A network namespace is a Linux kernel feature that provides network isolation for a group of processes. Each namespace has its own network interfaces, IP addresses, routing tables, and firewall rules. Kubernetes uses this feature to ensure that each pod has its own isolated network environment.

In Kubernetes:
- Each pod has its own network namespace.
- Each container within a pod shares the same network namespace.
Inspecting Network Namespaces

To inspect the network namespaces, follow these steps:

If you installed k3s as per this blog, k3s by default selects containerd runtime, your commands to get the container pid will be different if you run k3s with docker or other container runtimes.

Identify the container runtime and get the list of running containers.
```
sudo crictl ps
```
Get the container-id from the output and use it to get the process ID
```
sudo crictl inspect <container-id> | grep pid
```
Check the network namespace associated with the container
```
sudo ls -l /proc/<container-pid>/ns/net
```
You can use nsenter to enter the network namespace for further exploration.

Executing Into Network Namespaces

To explore the network settings of a pod’s namespace, you can use the nsenter command.
```
sudo nsenter --net=/proc/<container-pid>/ns/net
ip addr show
```
Script to exec into network namespace

You can use the following script to get the container process ID and exec into the pod network namespace directly.
```
POD_ID=$(sudo crictl pods --name <pod_name> -q) 
CONTAINER_ID=$(sudo crictl ps --pod $POD_ID -q) 
nsenter -t $(sudo crictl inspect $CONTAINER_ID | jq -r .info.pid) -n ip addr show
```
Veth Interfaces and Their Connection to Bridge

Inside the pod’s network namespace, you should see the pod’s interfaces (lo and eth0) and the IP address: 10.42.0.8 assigned to the pod. If observed closely, we see eth0@if13, which means eth0 is connected to interface 13 (in your system the corresponding veth might be different). Interface eth0 inside the pod is a virtual ethernet (veth) interface, veths are always created in interconnected pairs. In this case, one end of veth is eth0 while the other part is if13. But where does if13 exist? It exists as a part of the host network connecting the pod’s network to the host network via the bridge (cni0) in this case.
```
ip link show | grep 13
```
Here you see veth82ebd960@if2, which denotes that the veth is connected to interface number 2 in the pod’s network namespace. You can verify that the veth is connected to bridge cni0 as follows and that the veth of each pod is connected to the bridge, which enables communication between the pods on the same node.
```
brctl show
```
Demonstrating Pod-to-Pod Communication

Deploy Two Pods

Deploy two busybox pods to test communication:
```
kubectl run pod1 --image=busybox --restart=Never -- sleep infinity
kubectl run pod2 --image=busybox --restart=Never -- sleep infinity
```
Get the IP Addresses of the Pods
```
kubectl get pods pod1 pod2 -o wide -A
```
Pod1 IP : 10.42.0.9

Pod2 IP : 10.42.0.10

Ping Between Pods and Observe the Traffic Between Two Pods

Before we ping from Pod1 to Pod2, we will set up a watch on cni0 and veth pair of Pod1 and pod2 that are connected to cni0 using tcpdump.

Open three terminals and set up the tcpdump listeners:

# Terminal 1 – Watch traffic on cni0 bridge
```
sudo tcpdump -i cni0 icmp
```
# Terminal 2 – Watch traffic on veth1 (Pod1’s veth pair)
```
sudo tcpdump -i veth3a94f27 icmp
```
# Terminal 3 – Watch traffic on veth2 (Pod2’s veth pair)
```
sudo tcpdump -i veth18eb7d52 icmp
```
Exec into Pod1 and ping Pod2:
```
kubectl exec -it pod1 -- ping -c 4 <pod2-IP>
```
Watch results on veth3a94f27 pair of Pod1.

Watch results on cni0:

Watch results on veth18eb7d52 pair of Pod2:

Observing the timestamps for each request and reply on different interfaces, we get the flow of request/reply, as shown in the diagram below.

Deeper Dive into the Journey of Network Packets from One Pod to Another

We have already seen the flow of request/reply between two pods via veth interfaces connected to each other in a bridge network. In this section, we will discuss the internal details of how a network packet reaches from one pod to another.

Packet Leaving Pod1’s Network

Inside Pod1’s network namespace, the packet originates from eth0 (Pod1’s internal interface) and is sent out via its virtual ethernet interface pair in the host network. The destination address of the network packet is 10.0.0.10, which lies within the CIDR range 10.42.0.0 – 10.42.0.255 hence it matches the second route.

The packet exits Pod1’s namespace and enters the host namespace via the connected veth pair that exists in the host network. The packet arrives at bridge cni0 since it is the master of all the veth pairs that exist in the host network.

Once the packet reaches cni0, it gets forwarded to the correct veth pair connected to Pod2.

Packet Forwarding from cni0 to Pod2’s Network

When the packet reaches cni0, the job of cni0 is to forward this packet to Pod2. cni0 bridge acts as a Layer2 switch here, which just forwards the packet to the destination veth. The bridge maintains a forwarding database and dynamically learns the mapping of the destination MAC address and its corresponding veth device.

You can view forwarding database information with the following command:
```
bridge fdb show
```
In this screenshot, I have limited the result of forwarding database to just the MAC address of Pod2’s eth0
1. First column: MAC address of Pod2’s eth0
2. dev vethX: The network interface this MAC address is reachable through
3. master cni0: Indicates this entry belongs to cni0 bridge
4. Flags that may appear:
  - permanent: Static entry, manually added or system-generated
  - self: MAC address belongs to the bridge interface itself
  - No flag: The entry is Dynamically learned.
Dynamic MAC Learning Process

When a packet is generated with a payload of ICMP requests made from Pod1, it is packed as a frame at layer 2 with source MAC as the MAC address of the eth0 interface in Pod1, in order to get the destination MAC address, eth0 broadcasts an ARP request to all the network interfaces the ARP request contains the destination interface’s IP address.

This ARP request is received by all interfaces connected to the bridge, but only Pod2’s eth0 interface responds with its MAC address. The destination MAC address is then added to the frame, and the packet is sent to the cni0 bridge.

This destination MAC address is added to the frame, and it is sent to the cni0 bridge.

When this frame reaches the cni0 bridge, the bridge will open the frame and it will save the source MAC against the source interface(veth pair of pod1’s eth0 in the host network) in the forwarding table.

Now the bridge has to forward the frame to the appropriate interface where the destination lies (i.e. veth pair of Pod2 in the host network). If the forwarding table has information about veth pair of Pod2 then the bridge will forward that information to Pod2, else it will flood the frame to all the veths connected to the bridge, hence reaching Pod2.

When Pod2 sends the reply to Pod1 for the request made, the reverse path is followed. In this case, the frame leaves Pod2’s eth0 and is tunneled to cni0 via the veth pair of Pod2’s eth0 in the host network. Bridge adds the source MAC address (in this case, the source will be Pod2’s eth0) and the device from which it is reachable in the forwarding database, and forwards the reply to Pod1, hence completing the request and response cycle.

Summary and Key Takeaways

In this guide, we explored the foundational elements of Linux that play a crucial role in Kubernetes networking using K3s. Here are the key takeaways:
- Network Namespaces ensure pod isolation.
- Veth Interfaces connect pods to the host network and enable inter-pod communication.
- Bridge Networks facilitate pod-to-pod communication on the same node.
I hope you gained a deeper understanding of how Linux internals are used in Kubernetes network design and how they play a key role in pod-to-pod communication within the same node.
May 26, 2025
Taming the OpenStack Beast – A Fun & Easy Guide!
‍

So, you’ve heard about OpenStack, but it sounds like a mythical beast only cloud wizards can tame? Fear not! No magic spells or enchanted scrolls are needed—we’re breaking it down in a simple, engaging, and fun way.

Ever felt like managing cloud infrastructure is like trying to tame a wild beast? OpenStack might seem intimidating at first, but with the right approach, it’s more like training a dragon —challenging but totally worth it!

By the end of this guide, you’ll not only understand OpenStack but also be able to deploy it like a pro using Kolla-Ansible. Let’s dive in! 🚀

🤔 What Is OpenStack?

Imagine you’re running an online store. Instead of buying an entire warehouse upfront, you rent shelf space, scaling up or down based on demand. That’s exactly how OpenStack works for computing!

OpenStack is an open-source cloud platform that lets companies build, manage, and scale their own cloud infrastructure—without relying on expensive proprietary solutions.

Think of it as LEGO blocks for cloud computing—but instead of plastic bricks, you’re assembling compute, storage, and networking components to create a flexible and powerful cloud. 🧱🚀

‍🤷‍♀️ Why Should You Care?

OpenStack isn’t just another cloud platform—it’s powerful, flexible, and built for the future. Here’s why you should care:

✅ It’s Free & Open-Source – No hefty licensing fees, no vendor lock-in—just pure, community-driven innovation. Whether you’re a student, a startup, or an enterprise, OpenStack gives you the freedom to build your own cloud, your way.

✅ Trusted by Industry Giants – If OpenStack is good enough for NASA, PayPal, and CERN (yes, the guys running the Large Hadron Collider ), it’s definitely worth your time! These tech powerhouses use OpenStack to manage mission-critical workloads, proving its reliability at scale.

✅ Super Scalable – Whether you’re running a tiny home lab or a massive enterprise deployment, OpenStack grows with you. Start with a few nodes and scale to thousands as your needs evolve—without breaking a sweat.

✅ Perfect for Hands-On Learning – Want real-world cloud experience? OpenStack is a playground for learning cloud infrastructure, automation, and networking. Setting up your own OpenStack lab is like a DevOps gym—you’ll gain hands-on skills that are highly valued in the industry.

️🏗️ OpenStack Architecture in Simple Terms – The Avengers of Cloud Computing

OpenStack is a modular system. Think of it as assembling an Avengers team, where each component has a unique superpower, working together to form a powerful cloud infrastructure. Let’s break down the team:

🦾 Nova (Iron Man) – The Compute Powerhouse

Just like Iron Man powers up in his suit, Nova is the core component that spins up and manages virtual machines (VMs) in OpenStack. It ensures your cloud has enough compute power and efficiently allocates resources to different workloads.
- Acts as the brain of OpenStack, managing instances on physical servers.
- Works with different hypervisors like KVM, Xen, and VMware to create VMs.
- Supports auto-scaling, so your applications never run out of power.
️🕸️ Neutron (Spider-Man) – The Web of Connectivity

Neutron is like Spider-Man, ensuring all instances are connected via a complex web of virtual networking. It enables smooth communication between your cloud instances and the outside world.
- Provides network automation, floating IPs, and load balancing.
- Supports custom network configurations like VLANs, VXLAN, and GRE tunnels.
- Just like Spidey’s web shooters, it’s flexible, allowing integration with SDN controllers like Open vSwitch and OVN.
💪 Cinder (Hulk) – The Strength Behind Storage

Cinder is OpenStack’s block storage service, acting like the Hulk’s immense strength, giving persistent storage to VMs. When VMs need extra storage, Cinder delivers!
- Allows you to create, attach, and manage persistent block storage.
- Works with backend storage solutions like Ceph, NetApp, and LVM.
- If a VM is deleted, the data remains safe—just like Hulk’s memory, despite all the smashing.
📸 Glance (Black Widow) – The Memory Keeper

Glance is OpenStack’s image service, storing and managing operating system images, much like how Black Widow remembers every mission.
- Acts as a repository for VM images, including Ubuntu, CentOS, and custom OS images.
- Enables fast booting of instances by storing pre-configured templates.
- Works with storage backends like Swift, Ceph, or NFS.
🔑 Keystone (Nick Fury) – The Security Gatekeeper

Keystone is the authentication and identity service, much like Nick Fury, who ensures that only authorized people (or superheroes) get access to SHIELD.
- Handles user authentication and role-based access control (RBAC).
- Supports multiple authentication methods, including LDAP, OAuth, and SAML.
- Ensures that users and services only access what they are permitted to see.
‍‍🧙‍♂️ Horizon (Doctor Strange) – The All-Seeing Dashboard

Horizon provides a web-based UI for OpenStack, just like Doctor Strange’s ability to see multiple dimensions.
- Gives a graphical interface to manage instances, networks, and storage.
- Allows admins to control the entire OpenStack environment visually.
- Supports multi-user access with dashboards customized for different roles.
🚀 Additional Avengers (Other OpenStack Services)
- Swift (Thor’s Mjolnir) – Object storage, durable and resilient like Thor’s hammer.
- Heat (Wanda Maximoff) – Automates cloud resources like magic.
- Ironic (Vision) – Bare metal provisioning, a bridge between hardware and cloud.
Each of these heroes (services) communicates through APIs, working together to make OpenStack a powerful cloud platform.

apt update &&

️🛠️ How This Helps in Installation

Understanding these services will make it easier to set up OpenStack. During installation, configure each component based on your needs:
- If you need VMs, you focus on Nova, Glance, and Cinder.
- If networking is key, properly configure Neutron.
- Secure access? Keystone is your best friend.
Now that you know the Avengers of OpenStack, you’re ready to start your cloud journey. Let’s get our hands dirty with some real-world OpenStack deployment using Kolla-Ansible.

️🛠️ Hands-on: Deploying OpenStack with Kolla-Ansible

So, you’ve learned the Avengers squad of OpenStack—now it’s time to assemble your own OpenStack cluster! 💪

🔍 Pre-requisites: What You Need Before We Begin

Before we start, let’s make sure you have everything in place:

🖥️ Hardware Requirements (Minimum for a Test Setup)
- 1 Control Node + 1 Compute Node (or more for better scaling).
- At least 8GB RAM, 4 vCPUs, 100GB Disk per node (More = Better).
- Ubuntu 22.04 LTS (Recommended) or CentOS 9 Stream.
- Internet Access (for downloading dependencies).
🔧 Software & Tools Needed

✅ Python 3.10+ – Because Python runs the world.

✅ Ansible 8-9 (ansible-core 2.15-2.16) – Automating OpenStack deployment.

✅ Docker & Docker Compose – Because we’re running OpenStack in containers!

✅ Kolla-Ansible – The magic tool for OpenStack deployment.

Step-by-Step: Setting Up OpenStack with Kolla-Ansible

1️⃣ Set Up Your Environment

First, update your system and install dependencies:
```
apt update && sudo apt upgrade -y
apt-get install python3-dev libffi-dev gcc libssl-dev python3-selinux python3-setuptools python3-venv -y
```
```
python3 -m venv kolla-venv
echo "source ~/kolla-venv/bin/activate" >> ~/.bashrc
source ~/kolla-venv/bin/activate
```
Install Ansible & Docker:
```
sudo apt install python3-pip -y
pip install -U pip
pip install 'ansible-core>=2.15,<2.17' ansible
```
2️⃣ Install Kolla-Ansible
```
pip install git+https://opendev.org/openstack/kolla-ansible@master
```
3️⃣ Prepare Configuration Files

Copy default configurations to /etc/kolla:
```
mkdir -p /etc/kolla
sudo chown $USER:$USER /etc/kolla
cp -r /usr/local/share/kolla-ansible/etc/kolla/* /etc/kolla/
```
Generate passwords for OpenStack services:
```
kolla-genpwd
```
Before deploying OpenStack, let’s configure some essential settings in globals.yml. This file defines how OpenStack services are installed and interact with your infrastructure.

Run the following command to edit the file:
```
nano /etc/kolla/globals.yml
```
Here are a few key parameters you must configure:

kolla_base_distro – Defines the OS used for deployment (e.g., ubuntu or centos).

kolla_internal_vip_address – Set this to a free IP in your network. It acts as the virtual IP for OpenStack services. Example: 192.168.1.100.

network_interface – Set this to your main network interface (e.g., eth0). Kolla-Ansible will use this interface for internal communication. (Check using ip -br a)

enable_horizon – Set to yes to enable the OpenStack web dashboard (Horizon).

Once configured, save and exit the file. These settings ensure that OpenStack is properly installed in your environment.

4️⃣ Bootstrap the Nodes (Prepare Servers for Deployment)
```
kolla-ansible -i /etc/kolla/inventory/all-in-one bootstrap-servers
```
5️⃣ Deploy OpenStack! (The Moment of Truth)
```
kolla-ansible -i /etc/kolla/inventory/all-in-one deploy
```
This step takes some time (~30 minutes), so grab some ☕ and let OpenStack build itself.

6️⃣ Access Horizon (Web Dashboard)

Once deployment is complete, check the OpenStack dashboard:
```
kolla-ansible post-deploy
```
Now, find your login details:
```
cat /etc/kolla/admin-openrc.sh
```
Source the credentials and log in:
```
source /etc/kolla/admin-openrc.sh
openstack service list
```
Open your browser and try accessing: http://<your-server-ip>/dashboard/ or https://<your-server-ip>/dashboard/”

Use the username and the password from admin-openrc.sh.

️ Troubleshooting Common Issues

Deploying OpenStack isn’t always smooth sailing. Here are some common issues and how to fix them:

Kolla-Ansible Fails at Bootstrap

Solution: Run `kolla-ansible -i /etc/kolla/inventory/all-in-one prechecks` to check for missing dependencies before deployment.

Containers Keep Restarting or Failing

Solution: Run docker ps -a | grep Exit to check failed containers. Then inspect logs with:
```
docker ps --format 'table {{.ID}}\t{{.Names}}\t{{.Status}}'
docker logs $(docker ps -q --filter "status=exited")
journalctl -u docker.service --no-pager | tail -n 50
```
Horizon Dashboard Not Accessible

Solution: Ensure enable_horizon: yes is set in globals.yml and restart services with:
```
kolla-ansible -i /etc/kolla/inventory/all-in-one reconfigure
```
Missing OpenStack CLI Commands

Solution: Source the OpenStack credentials file before using the CLI:
```
source /etc/kolla/admin-openrc.sh
```
By tackling these common issues, you’ll have a much smoother OpenStack deployment experience.

🎉 Congratulations, You Now Have Your Own Cloud!

Now that your OpenStack deployment is up and running, you can start launching instances, creating networks, and exploring the endless possibilities.

What’s Next?

✅ Launch your first VM using OpenStack CLI or Horizon!

✅ Set up floating IPs and networks to make instances accessible.

✅ Experiment with Cinder storage and Neutron networking.

✅ Explore Heat for automation and Swift for object storage.

Final Thoughts

Deploying OpenStack manually can be a nightmare, but Kolla-Ansible makes it much easier. You’ve now got your own containerized OpenStack cloud running in no time.
May 26, 2025
Mastering TV App Development: Building Seamless Experiences with EnactJS and WebOS
As the world of smart TVs evolves, delivering immersive and seamless viewing experiences is more crucial than ever. At Velotio Technologies, we take pride in our proven expertise in crafting high-quality TV applications that redefine user engagement. Over the years, we have built multiple TV apps across diverse platforms, and our mastery of cutting-edge JavaScript frameworks, like EnactJS, has consistently set us apart.

Our experience extends to WebOS Open Source Edition (OSE), a versatile and innovative platform for smart device development. WebOS OSE’s seamless integration with EnactJS allows us to deliver native-quality apps optimized for smart TVs that offer advanced features like D-pad navigation, real-time communication with system APIs, and modular UI components.

This blog delves into how we harness the power of WebOS OSE and EnactJS to build scalable, performant TV apps. Learn how Velotio’s expertise in JavaScript frameworks and WebOS technologies drive innovation, creating seamless, future-ready solutions for smart TVs and beyond.

This blog begins by showcasing the unique features and capabilities of WebOS OSE and EnactJS. We then dive into the technical details of my development journey — building a TV app with a web-based UI that communicates with proprietary C++ modules. From designing the app’s architecture to overcoming platform-specific challenges, this guide is a practical resource for developers venturing into WebOS app development.

What Makes WebOS OSE and EnactJS Stand Out?
- Native-quality apps with web technologies: Develop lightweight, responsive apps using familiar HTML, CSS, and JavaScript.
- Optimized for TV and beyond: EnactJS offers seamless D-pad navigation and localization for Smart TVs, along with modularity for diverse platforms like automotive and IoT.
- Real-time integration with system APIs: Use Luna Bus to enable bidirectional communication between the UI and native services.
- Scalability and customization: Component-based architecture allows easy scaling and adaptation of designs for different use cases.
- Open source innovation: WebOS OSE provides an open, adaptable platform for developing cutting-edge applications.
What Does This Guide Cover?

The rest of this blog details my development experience, offering insights into the architecture, tools, and strategies for building TV apps:
- R&D and Designing the Architecture
- Choosing EnactJS for UI Development
- Customizing UI Components for Flexibility
- Navigation Strategy for TV Apps
- Handling Emulation and Simulation Gaps
- Setting Up the Development Machine for the Simulator
- Setting Up the Development Machine for the Emulator
- Real-Time Updates (Subscription) with Luna Bus Integration
- Packaging, Deployment, and App Updates
R&D and Designing the Architecture

The app had to connect a web-based interface (HTML, CSS, JS) to proprietary C++ services interacting with system-level processes. This setup is uncommon for WebOS OSE apps, posing two core challenges:
1. Limited documentation: Resources for WebOS app development were scarce.
2. WebAssembly infeasibility: Converting the C++ module to WebAssembly would restrict access to system-level processes.
Solution: An Intermediate C++ Service capable of interacting with both the UI and other C++ modules

To bridge these gaps, I implemented an intermediate C++ service to:
- Communicate between the UI and the proprietary C++ service.
- Use Luna Bus APIs to send and receive messages.
This approach not only solved the integration challenges but also laid a scalable foundation for future app functionality.

Architecture

The WebApp architecture employs MVVM (Model-View-ViewModel), Component-Based Architecture (CBA), and Atomic Design principles to achieve modularity, reusability, and maintainability.

App Architecture Highlights:
- WebApp frontend: Web-based UI using EnactJS.
- External native service: Intermediate C++ service (w/ Client SDK) interacting with the UI via Luna Bus.
Block Diagram of the App Architecture

‍Choosing EnactJS for UI Development

With the integration architecture in place, I focused on UI development. The D-pad compatibility required for smart TVs narrowed the choice of frameworks to EnactJS, a React-based framework optimized for WebOS apps.

Why EnactJS?
- Built-in TV compatibility: Supports remote navigation out-of-the-box.
- React-based syntax: Familiar for front-end developers.
Customizing UI Components for Flexibility

EnactJS’s default components had restrictive customization options and lacked the flexibility for the desired app design.

Solution: A Custom Design Library

I reverse-engineered EnactJS’s building blocks (e.g., Buttons, Toggles, Popovers) and created my own atomic components aligned with the app’s design.

This approach helped in two key ways:
1. Scalability: The design system allowed me to build complex screens using predefined components quickly.
2. Flexibility: Complete control over styling and functionality.
Navigation Strategy for TV Apps

In the absence of any recommended navigation tool for WebOS, I employed a straightforward navigation model using conditional-based routing:
1. High-level flow selection: Determining the current process (e.g., Home, Settings).
2. Step navigation: Tracking the user’s current step within the selected flow.
This conditional-based routing minimized complexity and avoided adding unnecessary tools like react-router.

Handling Emulation and Simulation Gaps

The WebOS OSE simulator was straightforward to use and compatible with Mac and Linux. However, testing the native C++ services needed a Linux-based emulator.

The Problem: Slow Build Times Cause Slow Development

Building and deploying code on the emulator had long cycles, drastically slowing development.

Solution: Mock Services

To mitigate this, I built a JavaScript-based mock service to replicate the native C++ functionality:
- On Mac, I used the mock service for rapid UI iterations on the Simulator.
- On Linux, I swapped the mock service with the real native service for final testing on the Emulator.
This separation of development and testing environments streamlined the process, saving hours during the UI and flow development.

Setting Up the Development Machine for the Simulator

To set up your machine for WebApp development with a simulator, ensure you install the VSCode extensions — webOS Studio, Git, Python3, NVM, and Node.js.

Install WebOS OSE CLI (ares) and configure the TV profile using ares-config. Then, clone the repository, install the dependencies, and run the WebApp in watch mode with npm run watch.

Install the “webOS Studio” extension in VSCode and set up the WebOS TV 24 Simulator via the Package Manager or manually. Finally, deploy and test the app on the simulator using the extension and inspect logs directly from the virtual remote interface.

Note: Ensure the profile is set to TV because the simulator only works only for the TV profile.
```
ares-config --profile tv
```
Setting Up the Development Machine for the Emulator

To set up your development machine for WebApp and Native Service development with an emulator, ensure you have a Linux machine and WebOS OSE CLI.

Install essential tools like Git, GCC, Make, CMake, Python3, NVM, and VirtualBox.

Build the WebOS Native Development Kit (NDK) using the build-webos repository, which may take 8–10 hours.

Configure the emulator in VirtualBox and add it as a target device using the ares-setup-device. Clone the repositories, build the WebApp and Native Service, package them into an IPK, install it on the emulator using ares-install, and launch the app with ares-launch.

Setting Up the Target Device for Ares Command to be Able to Identify the Emulator

This step is required before you can install the IPK to the emulator.

Note: To find the IP address of the WebOS Emulator, go to Settings -> Network -> Wired Connection.
```
ares-setup-device --add target -i "host=192.168.1.1" -i "port=22" -i "username=root" -i "default=true"
```
Real-Time Updates (Subscription) with Luna Bus Integration

One feature required real-time updates from the C++ module to the UI. While the Luna Bus API provided a means to establish a subscription, I encountered challenges with:
- Lifecycle Management: Re-subscriptions would fail due to improper cleanup.
Solution: Custom Subscription Management

I designed a custom logic layer for stable subscription management, ensuring seamless, real-time updates without interruptions.

Packaging, Deployment, and App Updates

Packaging

Pack a dist of the Enact app, make the native service, and then use the ares-package command to build an IPK containing both the dist and the native service builds.
```
npm run pack

cd com.example.app.controller
mkdir BUILD
cd BUILD
source /usr/local/webos-sdk-x86_64/environment-setup-core2-64-webos-linux
cmake ..
make

ares-package -n app/dist webos/com.example.app.controller/pkg_x86_64
```
Deployment

The external native service will need to be packaged with the UI code to get an IPK, which can then be installed on the WebOS platform manually.
```
ares-install com.example.app_1.0.0_all.ipk -d target
ares-launch com.example.app -d target
```
App Updates

The app updates need to be sent as Firmware-Over-the-Air (FOTA) — based on libostree.

WebOS OSE 2.0.0+ supports Firmware-Over-the-Air (FOTA) using libostree, a “git-like” system for managing Linux filesystem upgrades. It enables atomic version upgrades without reflashing by storing sysroots and tracking filesystem changes efficiently. The setup involves preparing a remote repository on a build machine, configuring webos-local.conf, and building a webos-image. Devices upgrade via commands to fetch and deploy rootfs revisions. Writable filesystem support (hotfix mode) allows temporary or persistent changes. Rollback requires manually reconfiguring boot deployment settings. Supported only on physical devices like Raspberry Pi 4, not emulators, FOTA simplifies platform updates while conserving disk space.

Key Learnings and Recommendations
1. Mock Early, Test Real: Use mock services for UI development and switch to real services only during final integration.
2. Build for Reusability: Custom components and a modular architecture saved time during iteration.
3. Plan for Roadblocks: Niche platforms like WebOS require self-reliance and patience due to limited community support.
Conclusion: Mastering WebOS Development — A Journey of Innovation

Building a WebOS TV app was a rewarding challenge. With WebOS OSE and EnactJS, developers can create native-quality apps using familiar web technologies. WebOS OSE stands out for its high performance, seamless integration, and robust localization support, making it ideal for TV app development and beyond (automotive, IOT, and robotics). Pairing it with EnactJS, a React-based framework, simplifies the process with D-pad compatibility and optimized navigation for TV experiences.

This project showed just how powerful WebOS and EnactJS can be in building apps that bridge web-based UIs and C++ backend services. Leveraging tools like Luna Bus for real-time updates, creating a custom design system, and extending EnactJS’s flexibility allowed for a smooth and scalable development process.

The biggest takeaway is that developing for niche platforms like WebOS requires persistence, creativity, and the right approach. When you face roadblocks and there’s limited help available, try to come up with your own creative solutions, and persist! Keep iterating, learning, and embracing the journey, and you’ll be able to unlock exciting possibilities.
March 5, 2025
Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

This article explains Data Governance perspective in connectivity with Data Mesh, Data Fabric and Data Lakehouse architectures.

Organizations across industries have multiple functional units and data governance is needed to oversee the data assets, data flows connected to these business units, its security and the processes governing the data products relevant to the business use-cases.  

Let’s take a deep dive into data governance as the first step.

Data Governance

Role of data governance also includes data democratization, tracks the data lineage, oversees the data quality and makes it compliant to the regional regulations.

Microsoft Purview has the differentiator on the 150+ compliance level regulations covered under Compliance Manager Portal:

Data governance utilizes Artificial Intelligence to boost the quality level as per the data profiling results and the historical data set quality experience. ‍

Master Data Management helps to store the common master data set in the organization across domain with the features of data de-duplication and maintaining the relationships across the entities giving 360-degree view. Having a unique dataset and Role based Access Control leads to add-on governance and supports business insights.

Data governance helps in creating a Data Marketplace for controlled golden quality data products exchange between the data sources and consumers, AWS Data Zone SaaS has a specialization on Data Marketplace capabilities:

Reference data set along with the Master data management helps to do the Data Standardization which is relevant in the data exchange between the organization, subsidiaries, partners as per the industry level on the Data Marketplace platform.

Remember the data governance is feasible with the correspondence between the technical and the business users.

Technical users have the role to collect the data assets from the data sources, review the metadata and the data quality, do the data quality enrichment by building up the data quality rules as applicable before storing the data.

On the other hand, the business user has a role to guide on building the business glossary on data asset to Columnlevel, defining the Critical Data Elements (CDE), specifying the sensitive data fields which should be mask or excluded before data is shared to consumers and cooperating in the data quality enrichment request.

Best practice is to follow bottom to top approach between the business and the technical users. After the data governance framework has been set up still the governance task always go through ahead which implies the business stakeholders should be well trained with the framework.

Process Automation is another stepping stone involved in the data governance, to give an example workflow need to be defined which notify the data custodians about the data set quality enrichment steps to be taken and when the data quality is revised the workflow forwards the data set again to the marketplace to be consumed by the data consumers.

Data discovery is another automation step in which the workflow scans the data sources for the metadata details as per the defined schedule and loads in the incremental data to the inventory triggering tasks in defined data flow ahead.

Data governance approach may change as per the data mesh, fabric, Lakehouse architecture. let’s get deep into this ahead.

Data Mesh vs Data Fabric vs Data Lake Architectures

Talking about the dataflow in every organization there are multiple data sources which store the data in different format and medium, once connected to this data sources the integration layer extracts, loads and transforms (ELT) the data, saves it in the storage medium and it gets consumed ahead. These data resources and consumers can be internal or external to the organization depending on the extensibility and the use case involved in the business scenario.

This lifecycle becomes heavy with the large piles of data set in the organization. The complexity increases when the data quality is poor, the apps connectors are not available, the data integration is not smooth, datasets are not discoverable.

Rather than piling all the data sets into a single warehouse, organizations segregate the data products, apps, ELT, storage and related processes across business units which we term Data Mesh Architecture.  

Data Mesh on domain level leads to de-centralized data management, clear data accountability, smooth data pipelines, and helps to discard any data silos which aren’t being used across domains.

Most of the data pipelines flow within a particular domain data set but there are pipelines which also go across the domains. Data Fabric joins the data set and pipelines across the domains in the Integrated Architecture.

Data Virtualization and the DataOrchestration techniques help to reduce the technical landscape segregation but overall, it impacts the performance and increases the complexity.

There is another setup approach which companies are interested in as part of the digital transformation, migrating datasets from segregated storage mediums on different dimensions to a CentralizedData Lakehouse.  

Data sets are loaded into a single DataLakehouse preferably in Medallion architecture starting with Bronzelayer having the raw data.

Further the data is segregated on the same storage medium but across individual domains after cleansing and transformation building up the Silver layer.

Ahead for the Analytics purpose the Goldlayer is prepared having the compatible dimensions-facts data model.

This Centralized storage is like Data Mesh adopted on Data Lakehouse setup.

Different Clouds, Microsoft Fabric, Databricks provide capabilities for the same.

Data Governance options

As for the centralized and de-centralized implementation architecture the data governance also follows the same protocol.

Federated Governance aligns with the Data Mesh and Centralized Governance fits to the DataFabric and Data Lakehouse architecture.

Federated governance is justified with thecomplex legacy setup where we are talking about a large organization having multiple branches across domains with individual Domain level local Governor officers.

These local Governor officers track thedata pipelines, govern the accessibility to involved individual storage mediums, the integration layers and apps such that as and when there’s any change in the data set the data catalog tool should be able to collect the metadata of those changes.

Centralized governance committee with data custodians handle the other two scenarios of the Data Fabric and Data Lake setup.

To take an example of the data fabric where data is spread across different storage medium as say Databricks for machine learning, snowflake for visualization reports, database/files as a data sources, cloud services for the data processing, in such scenario start to end centralized Data Governance is feasible via Data Virtualization and the Data Orchestration services.  

Similar central level governance applies where the complete implementation setup is on single platform as say AWS cloudplatform.

AWS Glue Data Catalog can be used for tracking the technical data assets and AWS DataZone for data exchange between the data sources and data consumers after tagging the business glossary to the technical assets.

Azure cloud with Microsoft Purview,Microsoft Fabric with Purview, Snowflake with Horizon, Databricks with Unity Catalog,AWS with Glue Data Catalog and DataZone, these and other platforms provide the scalability needed to store big data set, build up the Medallion architecture and easily do the Centralized data governance.

Conclusion

Overall Data Governance is relevant framework which works hand in hand with Data Mesh, Data Fabric, Data Lakehouse, Data Quality, Integration with the data sources, consumers and apps, Data Storage,MDM, Data Modeling, Data Catalog, Security, Process Automation and the AI.

Along with these technologies Data Governance requires the support of Business Stakeholders, Stewards, Data Analyst, Data Custodians, Data Operations Engineers and Chief Data Officer, these profiles build up the DataGovernance Committee.

Deciding between the Data Mesh, Data Fabric, Data Lakehouse approach depends on the organization’s current setup, the business units involved, the data distribution across the business units and the business’ use cases.

Industry current trend is for the distributed Dataset, Process Migration to the Centralized Lakehouse as the preferred approach with the Workspace for the individual domains giving the support to the adopted Data Mesh too.

This gives an upper hand to Centralized Data Governance giving capability to track the data pipelines across domains, data synchronization across the domains, column level traceability from source to consumer via the data lineage, role-based access control on the domain level data set, quick and easy searching capabilities for the datasets being on the single platform.

December 12, 2024
Protecting Your Mobile App: Effective Methods to Combat Unauthorized Access
Introduction: The Digital World’s Hidden Dangers

Imagine you’re running a popular mobile app that offers rewards to users. Sounds exciting, right? But what if a few clever users find a way to cheat the system for more rewards? This is exactly the challenge many app developers face today.

In this blog, we’ll describe a real-world story of how we fought back against digital tricksters and protected our app from fraud. It’s like a digital detective story, but instead of solving crimes, we’re stopping online cheaters.

Understanding How Fraudsters Try to Trick the System

The Sneaky World of Device Tricks

Let’s break down how users may try to outsmart mobile apps:

One way is through device ID manipulation. What is this? Think of a device ID like a unique fingerprint for your phone. Normally, each phone has its own special ID that helps apps recognize it. But some users have found ways to change this ID, kind of like wearing a disguise.

Real-world example: Imagine you’re at a carnival with a ticket that lets you ride each ride once. A fraudster might try to change their appearance to get multiple rides. In the digital world, changing a device ID is similar—it lets users create multiple accounts and get more rewards than they should.‍

How Do People Create Fake Accounts?

Users have become super creative in making multiple accounts:
- Using special apps that create virtual phone environments
- Playing with email addresses
- Using temporary email services
A simple analogy: It’s like someone trying to enter a party multiple times by wearing different costumes and using slightly different names. The goal? To get more free snacks or entry benefits.

The Detective Work: How to Catch These Digital Tricksters

Tracking User Behavior

Modern tracking tools are like having a super-smart security camera that doesn’t just record but actually understands what’s happening. Here are some powerful tools you can explore:

LogRocket: Your App’s Instant Replay Detective

LogRocket records and replays user sessions, capturing every interaction, error, and performance hiccup. It’s like having a video camera inside your app, helping developers understand exactly what users experience in real time.

Quick snapshot:
- Captures user interactions
- Tracks performance issues
- Provides detailed session replays
- Helps identify and fix bugs instantly
Mixpanel: The User Behavior Analyst

Mixpanel is a smart analytics platform that breaks down user behavior, tracking how people use your app, where they drop off, and what features they love most. It’s like having a digital detective who understands your users’ journey.

Key capabilities:
- Tracks user actions
- Creates behavior segments
- Measures conversion rates
- Provides actionable insights
What They Do:
- Notice unusual account creation patterns
- Detect suspicious activities
- Prevent potential fraud before it happens
Email Validation: The First Line of Defense

How it works:
- Recognize similar email addresses
- Prevent creating multiple accounts with slightly different emails
- Block tricks like:
  - a.bhi629@gmail.com
  - abhi.629@gmail.com
Real-life comparison: It’s like a smart mailroom that knows “John Smith” and “J. Smith” are the same person, preventing duplicate mail deliveries.

Advanced Protection Strategies

Device ID Tracking

Key Functions:
- Store unique device information
- Check if a device has already claimed rewards
- Prevent repeat bonus claims
Simple explanation: Imagine a bouncer at a club who remembers everyone who’s already entered and stops them from sneaking in again.

Stopping Fake Device Environments

Some users try to create fake device environments using apps like:
- Parallel Space
- Multiple account creators
- Game cloners
Protection method: The app identifies and blocks these applications, just like a security system that recognizes fake ID cards.

Root Device Detection

What is a Rooted Device? It’s like a phone that’s been modified to give users complete control, bypassing normal security restrictions.

Detection techniques:
- Check for special root access files
- Verify device storage
- Run specific detection commands
Analogy: It’s similar to checking if a car has been illegally modified to bypass speed limits.

Extra Security Layers

Android Version Requirements

Upgrading to newer Android versions provides additional security:
- Better detection of modified devices
- Stronger app protection
- More restricted file access
Simple explanation: It’s like upgrading your home’s security system to a more advanced model that can detect intruders more effectively.

Additional Protection Methods
- Data encryption
- Secure internet communication
- Location verification
- Encrypted local storage
Think of these as multiple locks on your digital front door, each providing an extra layer of protection.

Real-World Implementation Challenges

Why is This Important?

Every time a fraudster successfully tricks the system:
- The app loses money
- Genuine users get frustrated
- Trust in the platform decreases
Business impact: Imagine running a loyalty program where some people find ways to get 10 times more rewards than others. Not fair, right?

Practical Tips for App Developers
- Always stay updated with the latest security trends
- Regularly audit your app’s security
- Use multiple protection layers
- Be proactive, not reactive
- Learn from each attempted fraud
Common Misconceptions About App Security

Myth: “My small app doesn’t need advanced security.” Reality: Every app, regardless of size, can be a target.

Myth: “Security is a one-time setup.” Reality: Security is an ongoing process of learning and adapting.

Learning from Real Experiences

These examples come from actual developers at Velotio Technologies, who faced these challenges head-on. Their approach wasn’t about creating an unbreakable system but about making fraud increasingly difficult and expensive.

The Human Side of Technology

Behind every security feature is a human story:
- Developers protecting user experiences
- Companies maintaining trust
- Users expecting fair treatment
Looking to the Future

Technology will continue evolving, and so, too, will fraud techniques. The key is to:
- Stay curious
- Keep learning
- Never assume you know everything
Final Thoughts: Your App, Your Responsibility

Protecting your mobile app isn’t just about implementing complex technical solutions; it’s about a holistic approach that encompasses understanding user behavior, creating fair experiences, and building trust. Here’s a deeper look into these critical aspects:

Understanding User Behavior:‍

Understanding how users interact with your app is crucial. By analyzing user behavior, you can identify patterns that may indicate fraudulent activity. For instance, if a user suddenly starts claiming rewards at an unusually high rate, it could signal potential abuse.
Utilize analytics tools to gather data on user interactions. This data can help you refine your app’s design and functionality, ensuring it meets genuine user needs while also being resilient against misuse.

Creating Fair Experiences:‍

Clearly communicate your app’s rewards, account creation, and user behavior policies. Transparency helps users understand the rules and reduces the likelihood of attempts to game the system.
Consider implementing a user agreement that outlines acceptable behavior and the consequences of fraudulent actions.

Building Trust:

Maintain open lines of communication with your users. Regular updates about security measures, app improvements, and user feedback can help build trust and loyalty.
Use newsletters, social media, and in-app notifications to keep users informed about changes and enhancements.
Provide responsive customer support to address user concerns promptly. If users feel heard and valued, they are less likely to engage in fraudulent behavior.

Implement a robust support system that allows users to report suspicious activities easily and receive timely assistance.

Remember: Every small protection measure counts.

Call to Action

Are you an app developer? Start reviewing your app’s security today. Don’t wait for a fraud incident to take action.

Want to learn more?
- Follow security blogs
- Attend tech conferences
- Connect with security experts
- Never stop learning
November 27, 2024
Data Engineering: Beyond Big Data
When a data project comes to mind, the end goal is to enhance the data. It’s about building systems to curate the data in a way that can help the business.

At the dawn of their data engineering journey, people tend to familiarize themselves with the terms “extract,” transformation,” and ”loading.” These terms, along with traditional data engineering, spark the image that data engineering is about the processing and movement of large amounts of data. And why not! We’ve witnessed a tremendous evolution in these technologies, from storing information in simple spreadsheets to managing massive data warehouses and data lakes, supported by advanced infrastructure capable of ingesting and processing huge data volumes.

However, this doesn’t limit data engineering to ETL; rather, it opens so many opportunities to introduce new technologies and concepts that can and are needed to support big data processing. The expectations from a modern data system extend well beyond mere data movement. There’s a strong emphasis on privacy, especially with the vast amounts of sensitive data that need protection. Speed is crucial, particularly in real-world scenarios like satellite data processing, financial trading, and data processing in healthcare, where eliminating latency is key.

With technologies like AI and machine learning driving analysis on massive datasets, data volumes will inevitably continue to grow. We’ve seen this trend before, just as we once spoke of megabytes and now regularly discuss gigabytes. In the future, we’ll likely talk about terabytes and petabytes with the same familiarity.

These growing expectations have made data engineering a sphere with numerous supporting components, and in this article, we’ll delve into some of those components.
- Data governance
- Metadata management
- Data observability
- Data quality
- Orchestration
- Visualization
Data Governance

With huge amounts of confidential business and user data moving around, it’s a very delicate process to handle it safely. We must ensure trust in data processes, and the data itself can not be compromised. It is essential for a business onboarding users to show that their data is in safe hands. In today’s time, when a business needs sensitive information from you, you’ll be bound to ask questions such as:
- What if my data is compromised?
- Are we putting it to the right use?
- Who’s in control of this data? Are the right personnel using it?
- Is it compliant to the rules and regulations for data practices?
So, to answer these questions satisfactorily, data governance comes into the picture. The basic idea of data governance is that it’s a set of rules, policies, principles, or processes to maintain data integrity. It’s about how we can supervise our data and keep it safe. Think of data governance as a protective blanket that takes care of all the security risks, creates a habitable environment for data, and builds trust in data processing.

Data governance is very strong equipment in the data engineering arsenal. These rules and principles are consistently applied throughout all data processing activities. Wherever data flows, data governance ensures that data adheres to these established protocols. By adding a sense of trust to the activities involving data, you gain the freedom to focus on your data solution without worrying about any external or internal risks. This helps in reaching the ultimate goal—to foster a culture that prioritizes and emphasizes data responsibility.

Understanding the extensive application of data governance in data engineering clearly illustrates its significance and where it needs to be implemented in real-world scenarios. In numerous entities, such as government organizations or large corporations, data sensitivity is a top priority. Misuse of this data can have widespread negative impacts. To ensure that it doesn’t happen, we can use tools to ensure oversight and compliance. Let’s briefly explore one of those tools.

Microsoft Purview

Microsoft Purview comes with a range of solutions to protect your data. Let’s look at some of its offerings.
- Insider risk management
  - Microsoft purview takes care of data security risks from people inside your organization by identifying high-risk individuals.
  - It helps you classify data breaches into different sections and take appropriate action to prevent them.
- Data loss prevention
  - It makes applying data loss prevention policies straightforward.
  - It secures data by restricting important and sensitive data from being deleted and blocks unusual activities, like sharing sensitive data outside your organization.
- Compliance adherence
  - Microsoft Purview can help you make sure that your data processes are compliant with data regulatory bodies and organizational standards.
- Information protection
  - It provides granular control over data, allowing you to define strict accessibility rules.
  - When you need to manage what data can be shared with specific individuals, this control restricts the data visible to others.
- Know your sensitive data
  - It simplifies the process of understanding and learning about your data.
  - MS Purview features ML-based classifiers that label and categorize your sensitive data, helping you identify its specific category.
Metadata Management

Another essential aspect of big data movement is metadata management.

Metadata, simply put, is data about data. This component of data engineering makes a base for huge improvements in data systems.

You might have come across this headline a while back, which also reappeared recently.

This story is from about a decade ago, and it tells us about metadata’s longevity and how it became a base for greater things.

At the time, Instagram showed the number of likes by running a count function on the database and storing it in a cache. This method was fine because the number wouldn’t change frequently, so the request would hit the cache and get the result. Even if the number changed, the request would query the data, and because the number was small, it wouldn’t scan a lot of rows, saving the data system from being overloaded.

However, when a celebrity posted something, it’d receive so many likes that the count would be enormous and change so frequently that looking into the cache became just an extra step.

The request would trigger a query that would repeatedly scan many rows in the database, overloading the system and causing frequent crashes.

To deal with this, Instagram came up with the idea of denormalizing the tables and storing the number of likes for each post. So, the request would result in a query where the database needs to look at only one cell to get the number of likes. To handle the issue of frequent changes in the number of likes, Instagram began updating the value at small intervals. This story tells how Instagram solved this problem with a simple tweak of using metadata.

Metadata in data engineering has evolved to solve even more significant problems by adding a layer on top of the data flow that works as an interface to communicate with data. Metadata management has become a foundation of multiple data features such as:
- Data lineage: Stakeholders are interested in the results we get from data processes. Sometimes, in order to check the authenticity of data and get answers to questions like where the data originated from, we need to track back to the data source. Data lineage is a property that makes use of metadata to help with this scenario. Many data products like Atlan and data warehouses like Snowflake extensively use metadata for their services.
- Schema information: With a clear understanding of your data’s structure, including column details and data types, we can efficiently troubleshoot and resolve data modeling challenges.
- Data contracts: Metadata helps honor data contacts by keeping a common data profile, which maintains a common data structure across all data usages.
- Stats: Managing metadata can help us easily access data statistics while also giving us quick answers to questions like what the total count of a table is, how many distinct records there are, how much space it takes, and many more.
- Access control: Metadata management also includes having information about data accessibility. As we encountered it in the MS Purview features, we can associate a table with vital information and restrict the visibility of a table or even a column to the right people.
- Audit: Keeping track of information, like who accessed the data, who modified it, or who deleted it, is another important feature that a product with multiple users can benefit from.
There are many other use cases of metadata that enhance data engineering. It’s positively impacting the current landscape and shaping the future trajectory of data engineering. A very good example is a data catalog. Data catalogs focus on enriching datasets with information about data. Table formats, such as Iceberg and Delta, use catalogs to provide integration with multiple data sources, handle schema evolution, etc. Popular cloud services like AWS Glue also use metadata for features like data discovery. Tech giants like Snowflake and Databricks rely heavily on metadata for features like faster querying, time travel, and many more.

With the introduction of AI in the data domain, metadata management has a huge effect on the future trajectory of data engineering. Services such as Cortex and Fabric have integrated AI systems that use metadata for easy questioning and answering. When AI gets to know the context of data, the application of metadata becomes limitless.

Data Observability

We know how important metadata can be, and while it’s important to know your data, it’s as important to know about the processes working on it. That’s where observability enters the discussion. It is another crucial aspect of data engineering and a component we can’t miss from our data project.

Data observability is about setting up systems that can give us visibility over different services that are working on the data. Whether it’s ingestion, processing, or load operations, having visibility into data movement is essential. This not only ensures that these services remain reliable and fully operational, but it also keeps us informed about the ongoing processes. The ultimate goal is to proactively manage and optimize these operations, ensuring efficiency and smooth performance. We need to achieve this goal because it’s very likely that whenever we create a data system, multiple issues, as well as errors and bugs, will start popping out of nowhere.

So, how do we keep an eye on these services to see whether they are performing as expected? The answer to that is setting up monitoring and alerting systems.

Monitoring

Monitoring is the continuous tracking and measurement of key metrics and indicators that tells us about the system’s performance. Many cloud services offer comprehensive performance metrics, presented through interactive visuals. These tools provide valuable insights, such as throughput, which measures the volume of data processed per second, and latency, which indicates how long it takes to process the data. They track errors and error rates, detailing the types and how frequently they happen.

To lay the base for monitoring, there are tools like Prometheus and Datadog, which provide us with these monitoring features, indicating the performance of data systems and the system’s infrastructure. We also have Graylog, which gives us multiple features to monitor logs of a system, that too in real-time.

Now that we have the system that gives us visibility into the performance of processes, we need a setup that can tell us about them if anything goes sideways, a setup that can notify us.

Alerting

Setting up alerting systems allows us to receive notifications directly within the applications we use regularly, eliminating the need for someone to constantly monitor metrics on a UI or watch graphs all day, which would be a waste of time and resources. This is why alerting systems are designed to trigger notifications based on predefined thresholds, such as throughput dropping below a certain level, latency exceeding a specific duration, or the occurrence of specific errors. These alerts can be sent to channels like email or Slack, ensuring that users are immediately aware of any unusual conditions in their data processes.

Implementing observability will significantly impact data systems. By setting up monitoring and alerting, we can quickly identify issues as they arise and gain context about the nature of the errors. This insight allows us to pinpoint the source of problems, effectively debug and rectify them, and ultimately reduce downtime and service disruptions, saving valuable time and resources.

Data Quality

Knowing the data and its processes is undoubtedly important, but all this knowledge is futile if the data itself is of poor quality. That’s where the other essential component of data engineering, data quality, comes into play because data processing is one thing; preparing the data for processing is another.

In a data project involving multiple sources and formats, various discrepancies are likely to arise. These can include missing values, where essential data points are absent; outdated data, which no longer reflects current information; poorly formatted data that doesn’t conform to expected standards; incorrect data types that lead to processing errors; and duplicate rows that skew results and analyses. Addressing these issues will ensure the accuracy and reliability of the data used in the project.

Data quality involves enhancing data with key attributes. For instance, accuracy measures how closely the data reflects reality, validity ensures that the data accurately represents what we aim to measure, and completeness guarantees that no critical data is missing. Additionally, attributes like timeliness ensure the data is up to date. Ultimately, data quality is about embedding attributes that build trust in the data. For a deeper dive into this, check out Rita’s blog on Data QA: The Need of the Hour.

Data quality plays a crucial role in elevating other processes in data engineering. In a data engineering project, there are often multiple entry points for data processing, with data being refined at different stages to achieve a better state each time. Assessing data at the source of each processing stage and addressing issues early on is vital. This approach ensures that data standards are maintained throughout the data flow. As a result, by making data consistent at every step, we gain improved control over the entire data lifecycle.

Data tools like Great Expectations and data unit test libraries such as Deequ play a crucial role in safeguarding data pipelines by implementing data quality checks and validations. To gain more context on this, you might want to read Unit Testing Data at Scale using Deequ and Apache Spark by Nishant. These tools ensure that data meets predefined standards, allowing for early detection of issues and maintaining the integrity of data as it moves through the pipeline.

Orchestration

With so many processes in place, it’s essential to ensure everything happens at the right time and in the right way. Relying on someone to manually trigger processes at scheduled times every day is an inefficient use of resources. For that individual, performing the same repetitive tasks can quickly become monotonous. Beyond that, manual execution increases the risk of missing schedules or running tasks out of order, disrupting the entire workflow.

This is where orchestration comes to the rescue, automating tedious, repetitive tasks and ensuring precision in the timing of data flows. Data pipelines can be complex, involving many interconnected components that must work together seamlessly. Orchestration ensures that each component follows a defined set of rules, dictating when to start, what to do, and how to contribute to the overall process of handling data, thus maintaining smooth and efficient operations.

This automation helps reduce errors that could occur with manual execution, ensuring that data processes remain consistent by streamlining repetitive tasks. With a number of different orchestration tools and services in place, we can now monitor and manage everything from a single platform. Tools like Airflow, an open-source orchestrator, Prefect, which offers a user-friendly drag-and-drop interface, and cloud services such as Azure Data Factory, Google Cloud Composer, and AWS Step Functions, enhance our visibility and control over the entire process lifecycle, making data management more efficient and reliable. Don’t miss Shreyash’s excellent blog on Mage: Your New Go-To Tool for Data Orchestration.

Orchestration is built on a foundation of multiple concepts and technologies that make it robust and fail-safe. These underlying principles ensure that orchestration not only automates processes but also maintains reliability and resilience, even in complex and demanding data environments.
- Workflow definition: This defines how tasks in the pipeline are organized and executed. It lays out the sequence of tasks—telling it what needs to be finished before other tasks can start—and takes care of other conditions for pipeline execution. Think of it like a roadmap that guides the flow of tasks.
- Task scheduling: This determines when and how tasks are executed. Tasks might run at specific times, in response to events, or based on the completion of other tasks. It’s like scheduling appointments for tasks to ensure they happen at the right time and with the right resources.
- Dependency management: Since tasks often rely on each other, with the concepts of dependency management, we can ensure that tasks run in the correct order. It ensures that each process starts only when its prerequisites are met, like waiting for a green light before proceeding.
With these concepts, orchestration tools provide powerful features for workflow design and management, enabling the definition of complex, multi-step processes. They support parallel, sequential, and conditional execution of tasks, allowing for flexibility in how workflows are executed. Not just that, they also offer event-driven and real-time orchestration, enabling systems to respond to dynamic changes and triggers as they occur. These tools also include robust error handling and exception management, ensuring that workflows are resilient and fault-tolerant.

Visualization

The true value lies not just in collecting vast amounts of data but in interpreting it in ways that generate real business value, and this makes visualization of data a vital component to provide a clear and accurate representation of data that can be easily understood and utilized by decision-makers. The presentation of data in the right way enables businesses to get intelligence from data, which makes data engineering worth the investment and this is what guides strategic decisions, optimizes operations, and gives power to innovation.

Visualizations allow us to see patterns, trends, and anomalies that might not be apparent in raw data. Whether it’s spotting a sudden drop in sales, detecting anomalies in customer behavior, or forecasting future performance, data visualization can provide the clear context needed to make well-informed decisions. When numbers and graphs are presented effectively, it feels as though we are directly communicating with the data, and this language of communication bridges the gap between technical experts and business leaders.

Visualization Within ETL Processes

Visualization isn’t just a final output. It can also be a valuable tool within the data engineering process itself. Intermediate visualization during the ETL workflow can be a game-changer. In collaborative teams, as we go through the transformation process, visualizing it at various stages helps ensure the accuracy and relevance of the result. We can understand the datasets better, identify issues or anomalies between different stages, and make more informed decisions about the transformations needed.

Technologies like Fabric and Mage enable seamless integration of visualizations into ETL pipelines. These tools empower team members at all levels to actively engage with data, ask insightful questions, and contribute to the decision-making process. Visualizing datasets at key points provides the flexibility to verify that data is being processed correctly, develop accurate analytical formulas, and ensure that the final outputs are meaningful.

Depending on the industry and domain, there are various visualization tools suited to different use cases. For example,
- For real-time insights, which are crucial in industries like healthcare, financial trading, and air travel, tools such as Tableau and Striim are invaluable. These tools allow for immediate visualization of live data, enabling quick and informed decision-making.
- For broad data source integrations and dynamic dashboard querying, often demanded in the technology sector, tools like Power BI, Metabase, and Grafana are highly effective. These platforms support a wide range of data sources and offer flexible, interactive dashboards that facilitate deep analysis and exploration of data.
It’s Limitless

We are seeing many advancements in this domain, which are helping businesses, data science, AI and ML, and many other sectors because the potential of data is huge. If a business knows how to use data, it can be a major factor in its success. And for that reason, we have constantly seen the rise of different components in data engineering. All with one goal: to make data useful.

Recently, we’ve witnessed the introduction of numerous technologies poised to revolutionize the data engineering domain. Concepts like data mesh are enhancing data discovery, improving data ownership, and streamlining data workflows. AI-driven data engineering is rapidly advancing, with expectations to automate key processes such as data cleansing, pipeline optimization, and data validation. We’re already seeing how cloud data services have evolved to embrace AI and machine learning, ensuring seamless integration with data initiatives. The rise of real-time data processing brings new use cases and advancements, while practices like DataOps foster better collaboration among teams. Take a closer look at the modern data stack in Shivam’s detailed article, Modern Data Stack: The What, Why, and How?

These developments are accompanied by a wide array of technologies designed to support infrastructure, analytics, AI, and machine learning, alongside enterprise tools that lay the foundation for this ongoing evolution. All these elements collectively set the stage for a broader discussion on data engineering and what lies beyond big data. Big data, supported by these satellite activities, aims to extract maximum value from data, unlocking its full potential.

References:
August 30, 2024
React Native: Session Reply with Microsoft Clarity
Microsoft recently launched session replay support for iOS on both Native iOS and React Native applications. We decided to see how it performs compared to competitors like LogRocket and UXCam.

This blog discusses what session replay is, how it works, and its benefits for debugging applications and understanding user behavior. We will also quickly integrate Microsoft Clarity in React Native applications and compare its performance with competitors like LogRocket and UXCam.

Below, we will explore the key features of session replay, the steps to integrate Microsoft Clarity into your React Native application, and benchmark its performance against other popular tools.

Key Features of Session Replay

Session replay provides a visual playback of user interactions on your application. This allows developers to observe how users navigate the app, identify any issues they encounter, and understand user behavior patterns. Here are some of the standout features:
- User Interaction Tracking: Record clicks, scrolls, and navigation paths for a comprehensive view of user activities.
- Error Monitoring: Capture and analyze errors in real time to quickly diagnose and fix issues.
- Heatmaps: Visualize areas of high interaction to understand which parts of the app are most engaging.
- Anonymized Data: Ensure user privacy by anonymizing sensitive information during session recording.
Integrating Microsoft Clarity with React Native

Integrating Microsoft Clarity into your React Native application is a straightforward process. Follow these steps to get started:
1. Sign Up for Microsoft Clarity:
a. Visit the Microsoft Clarity website and sign up for a free account.

b. Create a new project and obtain your Clarity tracking code.
1. Install the Clarity SDK:
Use npm or yarn to install the Clarity SDK in your React Native project:
```
npm install clarity@latest‍ 
yarn add clarity@latest
```
1. Initialize Clarity in Your App:
Import and initialize Clarity in your main application file (e.g., App.js):
```
import Clarity from 'clarity';‍
Clarity.initialize('YOUR_CLARITY_TRACKING_CODE');
```
1. Verify Integration:
a. Run your application and navigate through various screens to ensure Clarity is capturing session data correctly.

b. Log into your Clarity dashboard to see the recorded sessions and analytics.

Benchmarking Against Competitors

To evaluate the performance of Microsoft Clarity, we’ll compare it against two popular session replay tools, LogRocket and UXCam, assessing them based on the following criteria:
- Ease of Integration: How simple is integrating the tool into a React Native application?
- Feature Set: What features does each tool offer for session replay and user behavior analysis?
- Performance Impact: How does the tool impact the app’s performance and user experience?
- Cost: What are the pricing models and how do they compare?
Detailed Comparison

Ease of Integration
- Microsoft Clarity: The integration process is straightforward and well-documented, making it easy for developers to get started.
- LogRocket: LogRocket also offers a simple integration process with comprehensive documentation and support.
- UXCam: UXCam provides detailed guides and support for integration, but it may require additional configuration steps compared to Clarity and LogRocket.
Feature Set
- Microsoft Clarity: Offers robust session replay, heatmaps, and error monitoring. However, it may lack some advanced features found in premium tools.
- LogRocket: Provides a rich set of features, including session replay, performance monitoring, Network request logs, and integration with other tools like Redux and GraphQL.
- UXCam: Focuses on mobile app analytics with features like session replay, screen flow analysis, and retention tracking.
Performance Impact
- Microsoft Clarity: Minimal impact on app performance, making it a suitable choice for most applications.
- LogRocket: Slightly heavier than Clarity but offers more advanced features. Performance impact is manageable with proper configuration.
- UXCam: Designed for mobile apps with performance optimization in mind. The impact is generally low but can vary based on app complexity.
Cost
- Microsoft Clarity: Free to use, making it an excellent option for startups and small teams.
- LogRocket: Offers tiered pricing plans, with a free tier for basic usage and paid plans for advanced features.
- UXCam: Provides a range of pricing options, including a free tier. Paid plans offer more advanced features and higher data limits.
Final Verdict

After evaluating the key aspects of session replay tools, Microsoft Clarity stands out as a strong contender, especially for teams looking for a cost-effective solution with essential features. LogRocket and UXCam offer more advanced capabilities, which may be beneficial for larger teams or more complex applications.

Ultimately, the right tool will depend on your specific needs and budget. For basic session replay and user behavior insights, Microsoft Clarity is a fantastic choice. If you require more comprehensive analytics and integrations, LogRocket or UXCam may be worth the investment.

Sample App

I have also created a basic sample app to demonstrate how to set up Microsoft Clarity for React Native apps.

Please check it out here: https://github.com/rakesho-vel/ms-rn-clarity-sample-app

This sample video shows how Microsoft Clarity records and lets you review user sessions on its dashboard.

‍

References
1. ‍https://clarity.microsoft.com/blog/clarity-sdk-release/‍
2. https://web.swipeinsight.app/posts/microsoft-clarity-finally-launches-ios-sdk-8312‍
August 29, 2024

Exploring WidgetKit: Enhancing iOS Experience with Widgets

Introduction:

In the fast-paced world of mobile technology, iOS widgets stand out as dynamic tools that enhance user engagement and convenience. With iOS 14’s introduction of widgets, Apple has empowered developers to create versatile, interactive components that provide valuable information and functionality right from the Home screen.

In this blog, we’ll delve into the world of iOS widgets, exploring the topic to create exceptional user experiences.

Understanding WidgetKit:

WidgetKit is a framework provided by Apple that simplifies creating and managing widgets for iOS, iPadOS, and macOS. It offers a set of APIs and tools that enable developers to easily design, develop, and deploy widgets. WidgetKit handles various aspects of widget development, including data management, layout rendering, and update scheduling, allowing developers to focus on creating compelling widget experiences.

Key Components of WidgetKit:

Widget Extension: A widget extension is a separate target within an iOS app project responsible for defining and managing the widget’s behavior, appearance, and data.
Widget Configuration: The widget configuration determines the appearance and behavior of the widget displayed on the Home screen. It includes attributes such as the widget’s name, description, supported sizes, and placeholder content.
Timeline Provider: The timeline provider supplies the widget with dynamic content based on predefined schedules or user interactions.
Widget Views: Widget views are SwiftUI views used to define the layout and presentation of the widget’s content.

Understanding iOS Widgets:

Widgets offer a convenient way to present timely and relevant information from your app or provide quick access to app features directly on the device’s Home screen. Introduced in iOS 14, widgets come in various sizes and can showcase a wide range of content, including weather forecasts, calendar events, news headlines, and app-specific data.

Benefits of iOS Widgets:

Enhanced Accessibility: Widgets enable users to access important information and perform tasks without navigating away from the Home screen, saving time and effort.
Increased Engagement: By displaying dynamic content and interactive elements, widgets encourage users to interact with apps more frequently, leading to higher engagement rates.
Personalization: Users can customize their Home screen by adding, resizing, and rearranging widgets to suit their preferences and priorities.
Improved Productivity: Widgets provide at-a-glance updates on calendar events, reminders, and to-do lists, helping users stay organized and productive throughout the day.

Widget Sizes

Widget sizes refer to the dimensions and layouts available for widgets on different platforms and devices. In the context of iOS, iPadOS, and macOS, widgets come in various sizes, each offering a distinct layout and content display.

These sizes are designed to accommodate different amounts of information and fit various screen sizes, ensuring a consistent user experience across devices.

Here are the common widget sizes available:

Small: This size is compact, displaying essential information in a concise format. Small widgets are ideal for providing quick updates or notifications without taking up much space on the screen.
Medium: Medium-sized widgets offer slightly more space for content display compared to small widgets. They can accommodate additional information or more detailed visualizations while remaining relatively compact.
Large: Large widgets provide ample space for displaying extensive content or detailed visuals. They offer a comprehensive view of information and may include interactive elements for enhanced functionality.
Extra Large: This size is available primarily on iPadOS and macOS, offering the most significant amount of space for content display. Extra-large widgets are suitable for showcasing extensive data or intricate visualizations, maximizing visibility and usability on larger screens.
These widget sizes cater to different user preferences and use cases, allowing developers to choose the most appropriate size based on the content and functionality of their widgets. By offering a range of sizes, developers can ensure their widgets deliver a tailored experience that meets the diverse needs of users across various devices and platforms.

Best Practices for Widget Design and Development:

Building on the existing best practices, let’s introduce additional tips:

Accessibility Considerations: Ensure that widgets are accessible to all users, including those with disabilities, by implementing features such as VoiceOver support and high contrast modes.
Localization Support: Localize widget content and interface elements to cater to users from diverse linguistic and cultural backgrounds, enhancing the app’s global reach and appeal.
Data Privacy and Security: Safeguard users’ personal information and sensitive data by implementing robust security measures and adhering to privacy best practices outlined in Apple’s guidelines.
Integration with App Clips: Explore opportunities to integrate widgets with App Clips, which are lightweight app experiences that allow users to access specific features or content without installing the full app.

Creating a Month-Wise Holiday Widget

In this example, we will create a widget that displays the holidays of a month, allowing users to quickly see the month’s holidays at a glance right on their home screen.

Initial Setup

Open Xcode: Launch Xcode on your Mac.
Create a New Project: Select “Create a new Xcode project” from the welcome screen or go to File > New > Project from the menu bar.
Choose a Template: In the template chooser window, select the “App” template under the iOS tab. Make sure to select SwiftUI as the User Interface and click “Next.”
Configure Your Project: Enter the name of your project, choose the organization identifier (usually your reverse domain name), interface as swiftUI and select Swift as the language and click “Next.”
Xcode will generate a default SwiftUI view for your app.
Add a Widget Extension: In Xcode, navigate to the File menu and select New > Target. In the template chooser window, select the “Widget Extension” template under the iOS tab and click “Next.”
Configure the Widget Extension: Enter a name for your widget extension as “Monthly Holiday” and choose the parent app for the extension (your main project). Click “Finish.”
Select “Activate” when the Activate scheme pops up.
Set Up the Widget Extension: Xcode will generate the necessary files for your widget extension, including a view file (e.g., WidgetView.swift) and a provider file (e.g., WidgetProvider.swift).

Developing the Month-Wise Holidays Widget

Implementing Provider Struct and TimelineProvider Protocol:

The TimelineProvider protocol provides the data that a widget displays over time. By conforming to this protocol, you define how and when the data for your widget should be updated.

struct Provider: TimelineProvider {
     // Provides a placeholder entry while the widget is loading.
    func placeholder(in context: Context) -> DayEntry {
        DayEntry(date: Date(), configuration: ConfigurationIntent())
    }
    // Provides a snapshot of the widget's current state.
    func getSnapshot(in context: Context, completion: @escaping (DayEntry) -> ()) {
        let entry = DayEntry(date: Date(), configuration: ConfigurationIntent())
        completion(entry)
    }
    // Provides a timeline of entries for the widget.
    func getTimeline(in context: Context, completion: @escaping (Timeline<DayEntry>) -> ()) {
        var entries: [DayEntry] = []
        
        // Generate a timeline consisting of seven entries an day apart, starting from the current date.
        let currentDate = Date()
        for dayOffset in 0 ..< 7 {
            let entryDate = Calendar.current.date(byAdding: .day, value: dayOffset, to: currentDate)!
            let startOfDate = Calendar.current.startOfDay(for: entryDate)
            let entry = DayEntry(date: startOfDate, configuration: ConfigurationIntent())
            entries.append(entry)
            
            let timeline = Timeline(entries: entries, policy: .atEnd)
            completion(timeline)
        }
    }
}

struct Provider: TimelineProvider {
     // Provides a placeholder entry while the widget is loading.
    func placeholder(in context: Context) -> DayEntry {
        DayEntry(date: Date(), configuration: ConfigurationIntent())
    }

    // Provides a snapshot of the widget's current state.
    func getSnapshot(in context: Context, completion: @escaping (DayEntry) -> ()) {
        let entry = DayEntry(date: Date(), configuration: ConfigurationIntent())
        completion(entry)
    }

    // Provides a timeline of entries for the widget.
    func getTimeline(in context: Context, completion: @escaping (Timeline<DayEntry>) -> ()) {
        var entries: [DayEntry] = []
        
        // Generate a timeline consisting of seven entries an day apart, starting from the current date.
        let currentDate = Date()
        for dayOffset in 0 ..< 7 {
            let entryDate = Calendar.current.date(byAdding: .day, value: dayOffset, to: currentDate)!
            let startOfDate = Calendar.current.startOfDay(for: entryDate)
            let entry = DayEntry(date: startOfDate, configuration: ConfigurationIntent())
            entries.append(entry)
            
            let timeline = Timeline(entries: entries, policy: .atEnd)
            completion(timeline)
        }
    }
}

Define a struct named DayEntry that conforms to the TimelineEntry protocol.

TimelineEntry is used in conjunction with TimelineProvider to manage and provide the data that the widget displays over time. By creating multiple timeline entries, you can control what your widget displays at different times throughout the day.

struct DayEntry: TimelineEntry {
    let date: Date
    let configuration: ConfigurationIntent
}

struct DayEntry: TimelineEntry {
    let date: Date
    let configuration: ConfigurationIntent
}

Define a SwiftUI view named MonthlyHolidayWidgetEntryView to display each entry in the widget.

struct MonthlyHolidayWidgetEntryView: View {
    var entry: DayEntry
    var config: MonthConfig
    
    // Custom initializer to configure the view based on the entry's date
    init(entry: DayEntry) {
        self.entry = entry
        self.config = MonthConfig.determineConfig(from: entry.date)
    }
    var body: some View {
        ZStack {
            // Background shape with gradient color based on the month configuration
            ContainerRelativeShape()
                .fill(config.backgroundColor.gradient)
            
            VStack {
                Spacer()
                // Display the date associated with the month
                HStack(spacing: 4) {
                    Text(config.dateText)
                        .foregroundColor(config.dayTextColor)
                        .font(.system(size: 25, weight: .heavy))
                }
                Spacer()
                // Display the name of the month
                Text(config.month)
                    .font(.system(size: 38, weight: .heavy))
                    .foregroundColor(config.dayTextColor)
                Spacer()
            }
            .padding()
        }
    }
}

struct MonthlyHolidayWidgetEntryView: View {
    var entry: DayEntry
    var config: MonthConfig
    
    // Custom initializer to configure the view based on the entry's date
    init(entry: DayEntry) {
        self.entry = entry
        self.config = MonthConfig.determineConfig(from: entry.date)
    }

    var body: some View {
        ZStack {
            // Background shape with gradient color based on the month configuration
            ContainerRelativeShape()
                .fill(config.backgroundColor.gradient)
            
            VStack {
                Spacer()
                // Display the date associated with the month
                HStack(spacing: 4) {
                    Text(config.dateText)
                        .foregroundColor(config.dayTextColor)
                        .font(.system(size: 25, weight: .heavy))
                }
                Spacer()
                // Display the name of the month
                Text(config.month)
                    .font(.system(size: 38, weight: .heavy))
                    .foregroundColor(config.dayTextColor)
                Spacer()
            }
            .padding()
        }
    }
}

Define a widget named MonthlyHolidayWidget using SwiftUI and WidgetKit.

struct MonthlyHolidayWidget: Widget {
    let kind: String = "MonthlyHolidaysWidget"
    var body: some WidgetConfiguration {
        StaticConfiguration(kind: kind, provider: Provider()) { entry in
            MonthlyHolidayWidgetEntryView(entry: entry)
        }
        .configurationDisplayName("Monthly style widget") // Display name for the widget in the widget gallery
        .description("The date of the widget changes based on holidays of month.") // Description of the widget's functionality
        .supportedFamilies([.systemLarge]) // Specify the widget size supported (large in this case)
    }
}

struct MonthlyHolidayWidget: Widget {
    let kind: String = "MonthlyHolidaysWidget"

    var body: some WidgetConfiguration {
        StaticConfiguration(kind: kind, provider: Provider()) { entry in
            MonthlyHolidayWidgetEntryView(entry: entry)
        }
        .configurationDisplayName("Monthly style widget") // Display name for the widget in the widget gallery
        .description("The date of the widget changes based on holidays of month.") // Description of the widget's functionality
        .supportedFamilies([.systemLarge]) // Specify the widget size supported (large in this case)
    }
}

Define a PreviewProvider struct named MonthlyHolidayWidget_Previews.

struct MonthlyHolidayWidget_Previews: PreviewProvider {
    static var previews: some View {
        // Provide a preview of the MonthlyHolidayWidgetEntryView for the widget gallery
        MonthlyHolidayWidgetEntryView(entry: DayEntry(date: dateToDisplay(month: 12, day: 22), configuration: ConfigurationIntent()))
            .previewContext(WidgetPreviewContext(family: .systemLarge))
    }
    
    // Helper function to create a date for the given month and day in the year 2024
    static func dateToDisplay(month: Int, day: Int) -> Date {
        let components = DateComponents(calendar: Calendar.current,
                                        year: 2024,
                                        month: month,
                                        day: day)
        return Calendar.current.date(from: components)!
    }
}

struct MonthlyHolidayWidget_Previews: PreviewProvider {
    static var previews: some View {
        // Provide a preview of the MonthlyHolidayWidgetEntryView for the widget gallery
        MonthlyHolidayWidgetEntryView(entry: DayEntry(date: dateToDisplay(month: 12, day: 22), configuration: ConfigurationIntent()))
            .previewContext(WidgetPreviewContext(family: .systemLarge))
    }
    
    // Helper function to create a date for the given month and day in the year 2024
    static func dateToDisplay(month: Int, day: Int) -> Date {
        let components = DateComponents(calendar: Calendar.current,
                                        year: 2024,
                                        month: month,
                                        day: day)
        return Calendar.current.date(from: components)!
    }
}

Define an extension on the Date struct, adding computed properties to format dates in a specific way.

extension Date {
    // Computed property to get the weekday in a wide format (e.g., "Monday")
    var weekDayDisplayFormat: String {
        self.formatted(.dateTime.weekday(.wide))
    }
    
    // Computed property to get the day of the month (e.g., "22")
    var dayDisplayFormat: String {
        formatted(.dateTime.day())
    }
}

extension Date {
    // Computed property to get the weekday in a wide format (e.g., "Monday")
    var weekDayDisplayFormat: String {
        self.formatted(.dateTime.weekday(.wide))
    }
    
    // Computed property to get the day of the month (e.g., "22")
    var dayDisplayFormat: String {
        formatted(.dateTime.day())
    }
}

Define `MonthConfig` struct that encapsulates configuration data.

For displaying month-specific attributes such as background color, date text, weekday text color, day text color, and month name based on a given date.

struct MonthConfig {
    let backgroundColor: Color      // Background color for the month display
    let dateText: String            // Text describing specific dates or holidays in the month
    let weekdayTextColor: Color    // Text color for weekdays
    let dayTextColor: Color        // Text color for days of the month
    let month: String              // Name of the month
    
    /// Determines and returns the configuration (MonthConfig) based on the given date.
    ///
    /// - Parameter date: The date used to determine the month configuration.
    /// - Returns: A MonthConfig object corresponding to the month of the given date.
    static func determineConfig(from date: Date) -> MonthConfig {
        let monthInt = Calendar.current.component(.month, from: date)
        
        switch monthInt {
        case 1: // January
            return MonthConfig(backgroundColor: .gray,
                               dateText: "1 and 26",
                               weekdayTextColor: .black.opacity(0.6),
                               dayTextColor: .white.opacity(0.8),
                               month: "Jan")
        case 2: // February
            return MonthConfig(backgroundColor: .palePink,
                               dateText: "No Holiday",
                               weekdayTextColor: .pink.opacity(0.5),
                               dayTextColor: .white.opacity(0.8),
                               month: "Feb")
        case 3: // March
            return MonthConfig(backgroundColor: .paleGreen,
                               dateText: "25",
                               weekdayTextColor: .black.opacity(0.7),
                               dayTextColor: .white.opacity(0.8),
                               month: "March")
        case 4: // April
            return MonthConfig(backgroundColor: .paleBlue,
                               dateText: "No Holiday",
                               weekdayTextColor: .black.opacity(0.5),
                               dayTextColor: .white.opacity(0.8),
                               month: "April")
        case 5: // May
            return MonthConfig(backgroundColor: .paleYellow,
                               dateText: "1",
                               weekdayTextColor: .black.opacity(0.5),
                               dayTextColor: .white.opacity(0.7),
                               month: "May")
        case 6: // June
            return MonthConfig(backgroundColor: .skyBlue,
                               dateText: "No Holiday",
                               weekdayTextColor: .black.opacity(0.5),
                               dayTextColor: .white.opacity(0.7),
                               month: "June")
        case 7: // July
            return MonthConfig(backgroundColor: .blue,
                               dateText: "No Holiday",
                               weekdayTextColor: .black.opacity(0.5),
                               dayTextColor: .white.opacity(0.8),
                               month: "July")
        case 8: // August
            return MonthConfig(backgroundColor: .paleOrange,
                               dateText: "15",
                               weekdayTextColor: .black.opacity(0.5),
                               dayTextColor: .white.opacity(0.8),
                               month: "August")
        case 9: // September
            return MonthConfig(backgroundColor: .paleRed,
                               dateText: "No Holiday",
                               weekdayTextColor: .black.opacity(0.5),
                               dayTextColor: .paleYellow.opacity(0.9),
                               month: "Sep")
        case 10: // October
            return MonthConfig(backgroundColor: .black,
                               dateText: "2",
                               weekdayTextColor: .white.opacity(0.6),
                               dayTextColor: .orange.opacity(0.8),
                               month: "Oct")
        case 11: // November
            return MonthConfig(backgroundColor: .paleBrown,
                               dateText: "31",
                               weekdayTextColor: .black.opacity(0.6),
                               dayTextColor: .white.opacity(0.6),
                               month: "Nov")
        case 12: // December
            return MonthConfig(backgroundColor: .paleRed,
                               dateText: "25",
                               weekdayTextColor: .white.opacity(0.6),
                               dayTextColor: .darkGreen.opacity(0.8),
                               month: "Dec")
        default:
            // Default case for unexpected month values (shouldn't typically happen)
            return MonthConfig(backgroundColor: .gray,
                               dateText: " ",
                               weekdayTextColor: .black.opacity(0.6),
                               dayTextColor: .white.opacity(0.8),
                               month: "None")
        }
    }
}

struct MonthConfig {
    let backgroundColor: Color      // Background color for the month display
    let dateText: String            // Text describing specific dates or holidays in the month
    let weekdayTextColor: Color    // Text color for weekdays
    let dayTextColor: Color        // Text color for days of the month
    let month: String              // Name of the month
    
    /// Determines and returns the configuration (MonthConfig) based on the given date.
    ///
    /// - Parameter date: The date used to determine the month configuration.
    /// - Returns: A MonthConfig object corresponding to the month of the given date.
    static func determineConfig(from date: Date) -> MonthConfig {
        let monthInt = Calendar.current.component(.month, from: date)
        
        switch monthInt {
        case 1: // January
            return MonthConfig(backgroundColor: .gray,
                               dateText: "1 and 26",
                               weekdayTextColor: .black.opacity(0.6),
                               dayTextColor: .white.opacity(0.8),
                               month: "Jan")
        case 2: // February
            return MonthConfig(backgroundColor: .palePink,
                               dateText: "No Holiday",
                               weekdayTextColor: .pink.opacity(0.5),
                               dayTextColor: .white.opacity(0.8),
                               month: "Feb")
        case 3: // March
            return MonthConfig(backgroundColor: .paleGreen,
                               dateText: "25",
                               weekdayTextColor: .black.opacity(0.7),
                               dayTextColor: .white.opacity(0.8),
                               month: "March")
        case 4: // April
            return MonthConfig(backgroundColor: .paleBlue,
                               dateText: "No Holiday",
                               weekdayTextColor: .black.opacity(0.5),
                               dayTextColor: .white.opacity(0.8),
                               month: "April")
        case 5: // May
            return MonthConfig(backgroundColor: .paleYellow,
                               dateText: "1",
                               weekdayTextColor: .black.opacity(0.5),
                               dayTextColor: .white.opacity(0.7),
                               month: "May")
        case 6: // June
            return MonthConfig(backgroundColor: .skyBlue,
                               dateText: "No Holiday",
                               weekdayTextColor: .black.opacity(0.5),
                               dayTextColor: .white.opacity(0.7),
                               month: "June")
        case 7: // July
            return MonthConfig(backgroundColor: .blue,
                               dateText: "No Holiday",
                               weekdayTextColor: .black.opacity(0.5),
                               dayTextColor: .white.opacity(0.8),
                               month: "July")
        case 8: // August
            return MonthConfig(backgroundColor: .paleOrange,
                               dateText: "15",
                               weekdayTextColor: .black.opacity(0.5),
                               dayTextColor: .white.opacity(0.8),
                               month: "August")
        case 9: // September
            return MonthConfig(backgroundColor: .paleRed,
                               dateText: "No Holiday",
                               weekdayTextColor: .black.opacity(0.5),
                               dayTextColor: .paleYellow.opacity(0.9),
                               month: "Sep")
        case 10: // October
            return MonthConfig(backgroundColor: .black,
                               dateText: "2",
                               weekdayTextColor: .white.opacity(0.6),
                               dayTextColor: .orange.opacity(0.8),
                               month: "Oct")
        case 11: // November
            return MonthConfig(backgroundColor: .paleBrown,
                               dateText: "31",
                               weekdayTextColor: .black.opacity(0.6),
                               dayTextColor: .white.opacity(0.6),
                               month: "Nov")
        case 12: // December
            return MonthConfig(backgroundColor: .paleRed,
                               dateText: "25",
                               weekdayTextColor: .white.opacity(0.6),
                               dayTextColor: .darkGreen.opacity(0.8),
                               month: "Dec")
        default:
            // Default case for unexpected month values (shouldn't typically happen)
            return MonthConfig(backgroundColor: .gray,
                               dateText: " ",
                               weekdayTextColor: .black.opacity(0.6),
                               dayTextColor: .white.opacity(0.8),
                               month: "None")
        }
    }
}

Call MonthlyHolidayWidget and MonthlyWidgetLiveActivity inside “MonthlyWidgetBundle.”

import WidgetKit
import SwiftUI
@main
struct MonthlyWidgetBundle: WidgetBundle {
    var body: some Widget {
        MonthlyHolidayWidget()
        MonthlyWidgetLiveActivity()
    }
}

import WidgetKit
import SwiftUI

@main
struct MonthlyWidgetBundle: WidgetBundle {
    var body: some Widget {
        MonthlyHolidayWidget()
        MonthlyWidgetLiveActivity()
    }
}

Now, finally add our created widget to a device.some text
- Tap on the blank area of the screen and hold it for 2 seconds.
- Then click on the plus(+) button at the top left corner.
- Then, enter the widget name in the search widgets search bar.
- Finally, select the widget name, “Monthly Holiday” in our case, to add it to the screen.

Visual effects of widgets will be as follows:

Conclusion:

iOS widgets represent a powerful tool for developers to enhance user experiences, drive engagement, and promote app adoption. By understanding the various types of widgets, implementing best practices for design and development, and exploring innovative use cases, developers can leverage their full potential to create compelling and impactful experiences for iOS users worldwide. As Apple continues to evolve the platform and introduce new features, widgets will remain a vital component of the iOS ecosystem, offering endless possibilities for innovation and creativity.

July 3, 2024

Iceberg: Features and Hands-on (Part 2)
As we have already discussed in the previous blog about Apache Iceberg’s basic concepts, setup process, and how to load data. Further, we will now delve into some of Iceberg’s advanced features, including upsert functionality, schema evolution, time travel, and partitioning.

Upsert Functionality

One of Iceberg’s key features is its support for upserts. Upsert, which stands for update and insert, allows you to efficiently manage changes to your data. With Iceberg, you can perform these operations seamlessly, ensuring that your data remains accurate and up-to-date without the need for complex and time-consuming processes.

Schema Evolution

Schema evolution is another of its powerful features. Over time, the schema of your data may need to change due to new requirements or updates. Iceberg handles schema changes gracefully, allowing you to add, remove, or modify columns without having to rewrite your entire dataset. This flexibility ensures that your data architecture can evolve in tandem with your business needs.

Time Travel

Iceberg also provides time travel capabilities, enabling you to query historical data as it existed at any given point in time. This feature is particularly useful for debugging, auditing, and compliance purposes. By leveraging snapshots, you can easily access previous states of your data and perform analyses on how it has changed over time.

Setup Iceberg on the local machine using the local catalog option or Hive

You can also configure Iceberg in your Spark session like this:
import pyspark spark = pyspark.sql.SparkSession.builder .config('spark.jars.packages','org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0') .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') .config('spark.sql.catalog.spark_catalog.type', 'hive') .config('spark.sql.catalog.local', 'org.apache.iceberg.spark.SparkCatalog') .config('spark.sql.catalog.local.type', 'hadoop') .config('spark.sql.catalog.local.warehouse', './Data-Engineering/warehouse') .getOrCreate()
```
import pyspark
spark = pyspark.sql.SparkSession.builder 
.config('spark.jars.packages','org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0') 
    .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 
    .config('spark.sql.catalog.spark_catalog.type', 'hive') 
    .config('spark.sql.catalog.local', 'org.apache.iceberg.spark.SparkCatalog') 
    .config('spark.sql.catalog.local.type', 'hadoop') 
    .config('spark.sql.catalog.local.warehouse', './Data-Engineering/warehouse') 
    .getOrCreate()
```
Some configurations must pass while setting up Iceberg.

Create Tables in Iceberg and Insert Data
```
CREATE TABLE demo.db.data_sample (index string, organization_id string, name string, website string, country string, description string, founded string, industry string, num_of_employees string) USING iceberg
```
```
df = spark.read.option("header", "true").csv("../data/input-data/organizations-100.csv")

df.writeTo("demo.db.data_sample").append()
```
We can either create the sample table using Spark SQL or directly write the data by mentioning the DB name and table name, which will create the Iceberg table for us.

‍

You can see the data we have inserted. Apart from appending, you can use the overwrite method as well as Delta Lake tables. You can also see an example of how to read the data from an iceberg table.

Handling Upserts

This Iceberg feature is similar to Delta Lake. You can update the records in existing Iceberg tables without impacting the complete data. This is also used to handle the CDC operations. We can take input from any incoming CSV and merge the data in the existing table without any duplication. It will always have a single Record for each primary key. This is how Iceberg maintains the ACID properties.

Incoming Data
```
input_data = spark.read.option("header", "true").csv("../data/input-data/organizations-11111.csv")
# Creating the temp view of that dataframe to merge
input_data.createOrReplaceTempView("input_data")
spark.sql("select * from input_data").show()
```
We will merge this data into our existing Iceberg Table using Spark SQL.
```
MERGE INTO demo.db.data_sample t
USING (SELECT * FROM input_data) s
ON t.organization_id = s.organization_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

select * from demo.db.data_sample
```
Here, we can see the data once the merge operation has taken place.

Schema Evolution

Iceberg supports the following schema evolution changes:
- Add – Add a new column to the iceberg table
- Drop – If any columns get removed from the existing tables
- Rename – Change the name of the columns from the existing table
- Update – Change the data type or partition columns of the Iceberg table
- Reorder – Change in the order of the Iceberg table
After updating the schema, there will be no need to overwrite or re-write the data again. Like previously, your table has four columns, and all of them have data. If you added two more columns, you wouldn’t need to rewrite the data now that you have six columns. You can still easily access the data. This unique feature was lacking in Delta Lake but is present here. These are just some characteristics of the Iceberg scheme evolutions.
1. If we add any columns, they won’t impact the existing columns.
2. If we delete or drop any columns, they won’t impact other columns.
3. Updating a column or field does not change values in any other column.
Iceberg uses unique IDs to track each column added to a table.

Let’s run some queries to update the schema, or let’s try to delete some columns.
```
%%sql

ALTER TABLE demo.db.data_sample
ADD COLUMN fare_per_distance_unit float AFTER num_of_employees;
```
After adding another column, if we try to access the data again from the table, we can do so without seeing any kind of error. This is also how Iceberg solves schema-related problems.

Partition Evolution and Sort Order Evolution

Iceberg came up with this option, which was missing in Delta Lake. When you evolve a partition spec, the old data written with an earlier spec remains unchanged. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately. Because of this, when you start writing queries, you get split planning. This is where each partition layout plans files separately using the filter it derives for that specific partition layout.

Similar to partition spec, Iceberg sort order can also be updated in an existing table. When you evolve a sort order, the old data written with an earlier order remains unchanged.
```
%%sql

ALTER TABLE demo.db.data_sample ADD PARTITION FIELD founded
DESCRIBE TABLE demo.db.data_sample
```
‍

Copy on write(COW) and merge on read(MOR) as well

Iceberg supports both COW and MOR while loading the data into the Iceberg table. We can set up configuration for this by either altering the table or while creating the iceberg table.

Copy-On-Write (COW) – Best for tables with frequent reads, infrequent writes/updates, or large batch updates:

When your requirement is to frequently read but less often write and update, you can configure this property in an Iceberg table. In COW, when we update or delete any rows from the table, a new data file with another version is created, and the latest version holds the latest updated data. The data is rewritten when updates or deletions occur, making it slower and can be a bottleneck when large updates occur. As its name specifies, it creates another copy on write of data.

When reading occurs, it is an ideal process as we are not updating or deleting anything we are only reading so we can read the data faster.

Merge-On-Read (MOR) – Best for tables with frequent writes/updates:

This is just opposite of the COW, as we do not rewrite the data again on the update or deletion of any rows. It creates a change log with updated records and then merges this into the original data file to create a new state of file with updated records.

‍

‍

‍

Query engine and integration supported:

Conclusion

After performing this research, we learned about the Iceberg’s features and its compatibility with various metastore for integrations. We got the basic idea of configuring Iceberg on different cloud platforms and locally well. We had some basic ideas for Upsert, schema evolution and partition evolution.
June 27, 2024
Data QA: The Need of the Hour
Have you ever encountered vague or misleading data analytics reports? Are you struggling to provide accurate data values to your end users? Have you ever experienced being misdirected by a geographical map application, leading you to the wrong destination? Imagine Amazon customers expressing dissatisfaction due to receiving the wrong product at their doorstep.

These issues stem from the use of incorrect or vague data by application/service providers. The need of the hour is to address these challenges by enhancing data quality processes and implementing robust data quality solutions. Through effective data management and validation, organizations can unlock valuable insights and make informed decisions.

“Harnessing the potential of clean data is like painting a masterpiece with accurate brushstrokes.”

Introduction

Data quality assurance (QA) is the systematic approach organizations use to ensure they have reliable, correct, consistent, and relevant data. It involves various methods, approaches, and tools to maintain good data quality from commencement to termination.

What is Data Quality?

Data quality refers to the overall utility of a dataset and its ability to be easily processed and analyzed for other uses. It is an integral part of data governance that ensures your organization’s data is fit for purpose.

How can I measure Data Quality?

What is the critical importance of Data Quality?

Remember, good data is super important! So, invest in good data—it’s the secret sauce for business success!

What are the Data Quality Challenges?

1. Data quality issues on production:

Production-specific data quality issues are primarily caused by unexpected changes in the data and infrastructure failures.

A. Source and third-party data changes:

External data sources, like websites or companies, may introduce errors or inconsistencies, making it challenging to use the data accurately. These issues can lead to system errors or missing values, which might go unnoticed without proper monitoring.

Example:
- File formats change without warning:
Imagine we’re using an API to get data in CSV format, and we’ve made a pipeline that handles it well.
```
import csv

def process_csv_data(csv_file):
    with open(csv_file, 'r') as file:
        csv_reader = csv.DictReader(file)
        for row in csv_reader:
            print(row)

csv_file = 'data.csv'
process_csv_data(csv_file)
```
The data source switched to using the JSON format, breaking our pipeline. This inconsistency can cause errors or missing data if our system can’t adapt. Monitoring and adjustments will ensure the accuracy of data analysis or applications.
- Malformed data values and schema changes:
Suppose we’re handling inventory data for an e-commerce site. The starting schema for your inventory dataset might have fields like:

Now, imagine that the inventory file’s schema changed suddenly. A “quantity” column has been renamed to “qty,” and the last_updated_at timestamp format switches to epoch timestamp.

This change might not be communicated in advance, leaving our data pipeline unprepared to handle the new field and time format.

B. Infrastructure failures:

Reliable software is crucial for processing large data volumes, but even the best tools can encounter issues. Infrastructure failures, like glitches or overloads, can disrupt data processing regardless of the software used.

Solution:

Data observability tools such as Monte Carlo, BigEye, and Great Expectations help detect these issues by monitoring for changes in data quality and infrastructure performance. These tools are essential for identifying and alerting the root causes of data problems, ensuring data reliability in production environments.

2. Data quality issues during development:

Development-specific data quality issues are primarily caused by untested code changes.

A. Incorrect parsing of data:

Data transformation bugs can occur due to mistakes in code or parsing, leading to data type mismatches or schema inaccuracies.

Example:

Imagine we’re converting a date string (“YYYY-MM-DD”) to a Unix epoch timestamp using Python. But misunderstanding the strptime() function’s format specifier leads to unexpected outcomes.
```
from datetime import datetime

timestamp_str = "2024-05-10" # %Y-%d-%m correct format from incoming data

# Incorrectly using '%d' for month (should be '%m')
format_date = "%Y-%m-%d" 
timestamp_dt = datetime.strptime(timestamp_str, format_date)

epoch_seconds = int(timestamp_dt.timestamp())
```
This error makes strptime() interpret “2024” as the year, “05” as the month (instead of the day), and “10” as the day (instead of the month), leading to inaccurate data in the timestamp_dt variable.

B. Misapplied or misunderstood requirements:

Even with the right code, data quality problems can still occur if requirements are misunderstood, resulting in logic errors and data quality issues.

Example:
Imagine we’re assigned to validate product prices in a dataset, ensuring they fall between $10 and $100.
```
product_prices = [10, 5, 25, 50, 75, 110]
valid_prices = []

for price in product_prices:
    if price >= 10 and price <= 100:
        valid_prices.append(price)

print("Valid prices:", valid_prices)
```
The requirement states prices should range from $10 to $100. But a misinterpretation leads the code to check if prices are >= $10 and <= $100. This makes $10 valid, causing a data quality problem.

C. Unaccounted downstream dependencies:

Despite careful planning and logic, data quality incidents can occur due to overlooked dependencies. Understanding data lineage and communicating effectively across all users is crucial to preventing such incidents.

Example:

Suppose we’re working on a database schema migration project for an e-commerce system. In the process, we rename the order_date column to purchase_date in the orders table. Despite careful planning and testing, a data quality issue arises due to an overlooked downstream dependency. The marketing team’s reporting dashboard relies on a SQL query referencing the order_date column, now renamed purchase_date, resulting in inaccurate reporting and potentially misinformed business decisions.

Here’s an example SQL query that represents the overlooked downstream dependency:
```
-- SQL query used by the marketing team's reporting dashboard
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    SUM(total_amount) AS total_sales
FROM 
    orders
GROUP BY 
    DATE_TRUNC('month', order_date)
```
This SQL query relies on the order_date column to calculate monthly sales metrics. After the schema migration, this column no longer exists, causing query failure and inaccurate reporting.

Solutions:

Data Quality tools like Great Expectations and Deequ proactively catch data quality issues by testing changes introduced from data-processing code, preventing issues from reaching production.

a. Testing assertions: Assertions validate data against expectations, ensuring data integrity. While useful, they require careful maintenance and should be selectively applied.

Example:
Suppose we have an “orders” table in your dbt project and need to ensure the “total_amount” column contains only numeric values; we can write a dbt test to validate this data quality rule.
```
version: 2

models:
  - name: orders
    columns:
      - name: total_amount
        tests:
          - data_type: numeric
```
In this dbt test code:
- We specify the dbt version (version: 2), model named “orders,” and “total_amount” column.
- Within the “total_amount” column definition, we add a test named “data_type” with the value “numeric,” ensuring the column contains only numeric data.
- Running the dbt test command will execute this test, checking if the “total_amount” column adheres to the numeric data type. Any failure indicates a data quality issue.
b. Comparing staging and production data: Data Diff is a CLI tool that compares datasets within or across databases, highlighting changes in data similar to how git diff highlights changes in source code. Aiding in detecting data quality issues early in the development process.

Here’s a data-diff example between staging and production databases for the payment_table.
```
data-diff 
  staging_db_connection 
  staging_payment_table 
  production_db_connection 
  production_payment_table 
  -k primary_key 
  -c “payment_amount, payment_type, payment_currency” 
  -w filter_condition(optional)
```
Source: https://docs.datafold.com/data_diff/what_is_data_diff

What are some best practices for maintaining high-quality data?
1. Establish Data Standards: Define clear data standards and guidelines for data collection, storage, and usage to ensure consistency and accuracy across the organization.
2. Data Validation: Implement validation checks to ensure data conforms to predefined rules and standards, identifying and correcting errors early in the data lifecycle.
3. Regular Data Cleansing: Schedule regular data cleansing activities to identify and correct inaccuracies, inconsistencies, and duplicates in the data, ensuring its reliability and integrity over time.
4. Data Governance: Establish data governance policies and procedures to manage data assets effectively, including roles and responsibilities, data ownership, access controls, and compliance with regulations.
5. Metadata Management: Maintain comprehensive metadata to document data lineage, definitions, and usage, providing transparency and context for data consumers and stakeholders.
6. Data Security: Implement robust data security measures to protect sensitive information from unauthorized access, ensuring data confidentiality, integrity, and availability.
7. Data Quality Monitoring: Continuously monitor data quality metrics and KPIs to track performance, detect anomalies, and identify areas for improvement, enabling proactive data quality management.
8. Data Training and Awareness: Provide data training and awareness programs for employees to enhance their understanding of data quality principles, practices, and tools, fostering a data-driven culture within the organization.
9. Collaboration and Communication: Encourage collaboration and communication among stakeholders, data stewards, and IT teams to address data quality issues effectively and promote accountability and ownership of data quality initiatives.
10. Continuous Improvement: Establish a culture of continuous improvement by regularly reviewing and refining data quality processes, tools, and strategies based on feedback, lessons learned, and evolving business needs.
Can you recommend any tools for improving data quality?
1. AWS Deequ: AWS Deequ is an open-source data quality library built on top of Apache Spark. It provides tools for defining data quality rules and validating large-scale datasets in Spark-based data processing pipelines.
1. Great Expectations: GX Cloud is a fully managed SaaS solution that simplifies deployment, scaling, and collaboration and lets you focus on data validation.
‍
1. Soda: Soda allows data engineers to test data quality early and often in pipelines to catch data quality issues before they have a downstream impact.
1. Datafold: Datafold is a cloud-based data quality platform that automates and simplifies the process of monitoring and validating data pipelines. It offers features such as automated data comparison, anomaly detection, and integration with popular data processing tools like dbt.
Considerations for Selecting a Data QA Tool:

Selecting a data QA (Quality Assurance) tool hinges on your specific needs and requirements. Consider factors such as:

1. Scalability and Performance: Ensure the tool can handle current and future data volumes efficiently, with real-time processing capabilities. some text

Example: Great Expectations help validate data in a big data environment by providing a scalable and customizable way to define and monitor data quality across different sources

2. Data Profiling and Cleansing Capabilities: Look for comprehensive data profiling and cleansing features to detect anomalies and improve data quality.some text

Example: AWS Glue DataBrew offers profiling, cleaning and normalizing, creating map data lineage, and automating data cleaning and normalization tasks.

3. Data Monitoring Features: Choose tools with continuous monitoring capabilities, allowing you to track metrics and establish data lineage.some text

Example: Datafold’s monitoring feature allows data engineers to write SQL commands to find anomalies and create automated alerts.

4. Seamless Integration with Existing Systems: Select a tool compatible with your existing systems to minimize disruption and facilitate seamless integration.some text

Example: dbt offers seamless integration with existing data infrastructure, including data warehouses and BI tools. It allows users to define data transformation pipelines using SQL, making it compatible with a wide range of data systems.

5. User-Friendly Interface: Prioritize tools with intuitive interfaces for quick adoption and minimal training requirements.some text

Example: Soda SQL is an open-source tool with a simple command line interface (CLI) and Python library to test your data through metric collection.

6. Flexibility and Customization Options: Seek tools that offer flexibility to adapt to changing data requirements and allow customization of rules and workflows.some text

Example: dbt offers flexibility and customization options for defining data transformation workflows.

7. Vendor Support and Community: Evaluate vendors based on their support reputation and active user communities for shared knowledge and resources.some text

Example: AWS Deequ is supported by Amazon Web Services (AWS) and has an active community of users. It provides comprehensive documentation, tutorials, and forums for users to seek assistance and share knowledge about data quality best practices.

8. Pricing and Licensing Options: Consider pricing models that align with your budget and expected data usage, such as subscription-based or volume-based pricing. some text

Example: Great Expectations offers flexible pricing and licensing options, including both open-source (freely available) and enterprise editions(subscription-based).

Ultimately, the right tool should effectively address your data quality challenges and seamlessly fit into your data infrastructure and workflows.

Conclusion: The Vital Role of Data Quality

In conclusion, data quality is paramount in today’s digital age. It underpins informed decisions, strategic formulation, and business success. Without it, organizations risk flawed judgments, inefficiencies, and competitiveness loss. Recognizing its vital role empowers businesses to drive innovation, enhance customer experiences, and achieve sustainable growth. Investing in robust data management, embracing technology, and fostering data integrity are essential. Prioritizing data quality is key to seizing new opportunities and staying ahead in the data-driven landscape.

References:

https://docs.getdbt.com/docs/build/data-tests

https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ

https://www.soda.io/resources/introducing-soda-sql
May 30, 2024