Category: Data and AI

  • From Specs to Self-Healing Systems – GenAI’s Full-Stack Impact on the SDLC

    From Insight to Action: What This POV Delivers – 

    • A strategic lens on GenAI’s end-to-end impact across the SDLC ,  from intelligent requirements capture to self-healing production systems.
    • Clarity on how traditional engineering roles are evolving and what new skills and responsibilities are emerging in a GenAI-first environment.
    • A technical understanding of GenAI-driven architecture, code generation, and testing—including real-world patterns, tools, and model behaviors.
    • Insights into building model-aware, feedback-driven engineering pipelines that adapt and evolve continuously post-deployment.
    • A forward-looking view of how to modernize your tech stack with PromptOps, policy-as-code, and AI-powered governance built into every layer.

  • Beyond One-Size-Fits-All: Inside the Era of AI-Personalized Learning

    In this POV, you’ll discover how AI is redefining education through personalized, adaptive learning experiences. Learn how intelligent systems like OptimaAI AptivLearn are reshaping engagement for every stakeholder.

    1. The Shift to Intelligent Learning

    • Traditional digitization isn’t enough—education needs real-time adaptability.
    • AI transforms static platforms into responsive, personalized learning environments.
    • The global EdTech market is moving towards immersive, emotionally aware ecosystems.

    2. Impact Across Personas

    • Educators: Gain real-time insights, reduce admin workload, and dynamically adjust instruction.
    • Learners: Experience adaptive paths, voice-enabled support, and gamified engagement.
    • Administrators & Parents: Access predictive dashboards, behavioral insights, and 24/7 visibility.

    3. The OptimaAI AptivLearn Advantage

    • Delivers a unified, AI-powered ecosystem tailored to each stakeholder.
    • Enables hyper-personalized content, real-time feedback, and intelligent nudging.
    • Seamlessly integrates with existing LMS, SIS, and analytics tools to future-proof learning.

  • Agentic Readiness Audit

    R Systems’ Agent Readiness Audit (ARA) helps you move from fragmented automation to a unified, agentic enterprise.

    • Assess your current AI maturity
    • Identify agent-ready use cases
    • Get a step-by-step deployment blueprint
    • Align leadership on ROI and governance

    Why Choose ARA?

    • Prepares you for 40%+ agent-driven workflows
    • Maps automation opportunities across functions
    • Delivers build vs. buy clarity
    • Accelerates time to value

    What You Get

    • Readiness Heatmap – High-impact agent zones
    • Role Mapping – Voice, analysis & reporting use cases
    • Gap Analysis – Fixes before agents scale
    • Future-State Scenarios – Visualize and plan agent-led operations

    Your Benefits

    • Readiness Score – Track where you stand
    • Opportunity Map – Know where to act
    • Deployment Plan – Phased rollout, governance, integrations
    • Executive Guidance – CXO-level strategy & ROI tools
    • Benchmark Dashboards – Compare readiness vs peers

  • Unlock the Secrets to FMCG Success with Power BI

    Struggling with manual tracking and delayed insights? Learn how our tailored solutions helped a global beverage brand:

    • Boost User Adoption: Intuitive interface increased self-service BI usage by 50%.
    • Seamless Integration: Real-time insights from CRM, ERP, and social media analytics.
    • Cost Savings: Reduced licensing costs by 40%.
    • Faster Decision Making: Interactive dashboards for quicker insights and deeper analysis.
    • Scalability: Flexible cloud architecture to meet growing reporting needs.

    Ready to Transform Your Business? Fill out the form below to access the full case study and learn how Power BI can enhance your market performance.

  • Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

    This article explains Data Governance perspective in connectivity with Data Mesh, Data Fabric and Data Lakehouse architectures.  

    Organizations across industries have multiple functional units and data governance is needed to oversee the data assets, data flows connected to these business units, its security and the processes governing the data products relevant to the business use-cases.  

    Let’s take a deep dive into data governance as the first step.  

    Data Governance

    Role of data governance also includes data democratization, tracks the data lineage, oversees the data quality and makes it compliant to the regional regulations.  

    Microsoft Purview has the differentiator on the 150+ compliance level regulations covered under Compliance Manager Portal:

    Data governance utilizes Artificial Intelligence to boost the quality level as per the data profiling results and the historical data set quality experience.

    Master Data Management helps to store the common master data set in the organization across domain with the features of data de-duplication and maintaining the relationships across the entities giving 360-degree view. Having a unique dataset and Role based Access Control leads to add-on governance and supports business insights.  

    Data governance helps in creating a Data Marketplace for controlled golden quality data products exchange between the data sources and consumers, AWS Data Zone SaaS has a specialization on Data Marketplace capabilities:  

    Reference data set along with the Master data management helps to do the Data Standardization which is relevant in the data exchange between the organization, subsidiaries, partners as per the industry level on the Data Marketplace platform.

    Remember the data governance is feasible with the correspondence between the technical and the business users.  

    Technical users have the role to collect the data assets from the data sources, review the metadata and the data quality, do the data quality enrichment by building up the data quality rules as applicable before storing the data.  

    On the other hand, the business user has a role to guide on building the business glossary on data asset to Columnlevel, defining the Critical Data Elements (CDE), specifying the sensitive data fields which should be mask or excluded before data is shared to consumers and cooperating in the data quality enrichment request.

    Best practice is to follow bottom to top approach between the business and the technical users. After the data governance framework has been set up still the governance task always go through ahead which implies the business stakeholders should be well trained with the framework.  

    Process Automation is another stepping stone involved in the data governance, to give an example workflow need to be defined which notify the data custodians about the data set quality enrichment steps to be taken and when the data quality is revised the workflow forwards the data set again to the marketplace to be consumed by the data consumers.

    Data discovery is another automation step in which the workflow scans the data sources for the metadata details as per the defined schedule and loads in the incremental data to the inventory triggering tasks in defined data flow ahead.

    Data governance approach may change as per the data mesh, fabric, Lakehouse architecture. let’s get deep into this ahead.  

    Data Mesh vs Data Fabric vs Data Lake Architectures

    Talking about the dataflow in every organization there are multiple data sources which store the data in different format and medium, once connected to this data sources the integration layer extracts, loads and transforms (ELT) the data, saves it in the storage medium and it gets consumed ahead. These data resources and consumers can be internal or external to the organization depending on the extensibility and the use case involved in the business scenario.

    This lifecycle becomes heavy with the large piles of data set in the organization. The complexity increases when the data quality is poor, the apps connectors are not available, the data integration is not smooth, datasets are not discoverable.

    Rather than piling all the data sets into a single warehouse, organizations segregate the data products, apps, ELT, storage and related processes across business units which we term Data Mesh Architecture.  

    Data Mesh on domain level leads to de-centralized data management, clear data accountability, smooth data pipelines, and helps to discard any data silos which aren’t being used across domains.

    Most of the data pipelines flow within a particular domain data set but there are pipelines which also go across the domains. Data Fabric joins the data set and pipelines across the domains in the Integrated Architecture.  

    Data Virtualization and the DataOrchestration techniques help to reduce the technical landscape segregation but overall, it impacts the performance and increases the complexity.  

    There is another setup approach which companies are interested in as part of the digital transformation, migrating datasets from segregated storage mediums on different dimensions to a CentralizedData Lakehouse.

    Data sets are loaded into a single DataLakehouse preferably in Medallion architecture starting with Bronzelayer having the raw data.  

    Further the data is segregated on the same storage medium but across individual domains after cleansing and transformation building up the Silver layer.  

    Ahead for the Analytics purpose the Goldlayer is prepared having the compatible dimensions-facts data model.  

    This Centralized storage is like Data Mesh adopted on Data Lakehouse setup.

    Different Clouds, Microsoft Fabric, Databricks provide capabilities for the same.

    Data Governance options

    As for the centralized and de-centralized implementation architecture the data governance also follows the same protocol.

    Federated Governance aligns with the Data Mesh and Centralized Governance fits to the DataFabric and Data Lakehouse architecture.  

    Federated governance is justified with thecomplex legacy setup where we are talking about a large organization having multiple branches across domains with individual Domain level local Governor officers.  

    These local Governor officers track thedata pipelines, govern the accessibility to involved individual storage mediums, the integration layers and apps such that as and when there’s any change in the data set the data catalog tool should be able to collect the metadata of those changes.  

    Centralized governance committee with data custodians handle the other two scenarios of the Data Fabric and Data Lake setup.

    To take an example of the data fabric where data is spread across different storage medium as say Databricks for machine learning, snowflake for visualization reports, database/files as a data sources, cloud services for the data processing, in such scenario start to end centralized Data Governance is feasible via Data Virtualization and the Data Orchestration services.  

    Similar central level governance applies where the complete implementation setup is on single platform as say AWS cloudplatform.  

    AWS Glue Data Catalog can be used for tracking the technical data assets and AWS DataZone for data exchange between the data sources and data consumers after tagging the business glossary to the technical assets.

    Azure cloud with Microsoft Purview,Microsoft Fabric with Purview, Snowflake with Horizon, Databricks with Unity Catalog,AWS with Glue Data Catalog and DataZone, these and other platforms provide the scalability needed to store big data set, build up the Medallion architecture and easily do the Centralized data governance.

    Conclusion

    Overall Data Governance is relevant framework which works hand in hand with Data Mesh, Data Fabric, Data Lakehouse, Data Quality, Integration with the data sources, consumers and apps, Data Storage,MDM, Data Modeling, Data Catalog, Security, Process Automation and the AI.  

    Along with these technologies Data Governance requires the support of Business Stakeholders, Stewards, Data Analyst, Data Custodians, Data Operations Engineers and Chief Data Officer, these profiles build up the DataGovernance Committee.  

    Deciding between the Data Mesh, Data Fabric, Data Lakehouse approach depends on the organization’s current setup, the business units involved, the data distribution across the business units and the business’ use cases.  

    Industry current trend is for the distributed Dataset, Process Migration to the Centralized Lakehouse as the preferred approach with the Workspace for the individual domains giving the support to the adopted Data Mesh too.  

    This gives an upper hand to Centralized Data Governance giving capability to track the data pipelines across domains, data synchronization across the domains, column level traceability from source to consumer via the data lineage, role-based access control on the domain level data set, quick and easy searching capabilities for the datasets being on the single platform.  

  • Data Engineering: Beyond Big Data

    When a data project comes to mind, the end goal is to enhance the data. It’s about building systems to curate the data in a way that can help the business.

    At the dawn of their data engineering journey, people tend to familiarize themselves with the terms “extract,” transformation,” and ”loading.” These terms, along with traditional data engineering, spark the image that data engineering is about the processing and movement of large amounts of data. And why not! We’ve witnessed a tremendous evolution in these technologies, from storing information in simple spreadsheets to managing massive data warehouses and data lakes, supported by advanced infrastructure capable of ingesting and processing huge data volumes. 

    However, this doesn’t limit data engineering to ETL; rather, it opens so many opportunities to introduce new technologies and concepts that can and are needed to support big data processing. The expectations from a modern data system extend well beyond mere data movement. There’s a strong emphasis on privacy, especially with the vast amounts of sensitive data that need protection. Speed is crucial, particularly in real-world scenarios like satellite data processing, financial trading, and data processing in healthcare, where eliminating latency is key.

    With technologies like AI and machine learning driving analysis on massive datasets, data volumes will inevitably continue to grow. We’ve seen this trend before, just as we once spoke of megabytes and now regularly discuss gigabytes. In the future, we’ll likely talk about terabytes and petabytes with the same familiarity.

    These growing expectations have made data engineering a sphere with numerous supporting components, and in this article, we’ll delve into some of those components.

    • Data governance
    • Metadata management
    • Data observability
    • Data quality
    • Orchestration
    • Visualization

    Data Governance

    With huge amounts of confidential business and user data moving around, it’s a very delicate process to handle it safely. We must ensure trust in data processes, and the data itself can not be compromised. It is essential for a business onboarding users to show that their data is in safe hands. In today’s time, when a business needs sensitive information from you, you’ll be bound to ask questions such as:

    • What if my data is compromised?
    • Are we putting it to the right use?
    • Who’s in control of this data? Are the right personnel using it?
    • Is it compliant to the rules and regulations for data practices?

    So, to answer these questions satisfactorily, data governance comes into the picture. The basic idea of data governance is that it’s a set of rules, policies, principles, or processes to maintain data integrity. It’s about how we can supervise our data and keep it safe. Think of data governance as a protective blanket that takes care of all the security risks, creates a habitable environment for data, and builds trust in data processing.

    Data governance is very strong equipment in the data engineering arsenal. These rules and principles are consistently applied throughout all data processing activities. Wherever data flows, data governance ensures that data adheres to these established protocols. By adding a sense of trust to the activities involving data, you gain the freedom to focus on your data solution without worrying about any external or internal risks. This helps in reaching the ultimate goal—to foster a culture that prioritizes and emphasizes data responsibility.

    Understanding the extensive application of data governance in data engineering clearly illustrates its significance and where it needs to be implemented in real-world scenarios. In numerous entities, such as government organizations or large corporations, data sensitivity is a top priority. Misuse of this data can have widespread negative impacts. To ensure that it doesn’t happen, we can use tools to ensure oversight and compliance. Let’s briefly explore one of those tools.

    Microsoft Purview

    Microsoft Purview comes with a range of solutions to protect your data. Let’s look at some of its offerings.

    • Insider risk management
      • Microsoft purview takes care of data security risks from people inside your organization by identifying high-risk individuals.
      • It helps you classify data breaches into different sections and take appropriate action to prevent them.
    • Data loss prevention
      • It makes applying data loss prevention policies straightforward.
      • It secures data by restricting important and sensitive data from being deleted and blocks unusual activities, like sharing sensitive data outside your organization.
    • Compliance adherence
      • Microsoft Purview can help you make sure that your data processes are compliant with data regulatory bodies and organizational standards.
    • Information protection
      • It provides granular control over data, allowing you to define strict accessibility rules.
      • When you need to manage what data can be shared with specific individuals, this control restricts the data visible to others.
    • Know your sensitive data
      • It simplifies the process of understanding and learning about your data.
      • MS Purview features ML-based classifiers that label and categorize your sensitive data, helping you identify its specific category.

    Metadata Management

    Another essential aspect of big data movement is metadata management. 

    Metadata, simply put, is data about data. This component of data engineering makes a base for huge improvements in data systems.

    You might have come across this headline a while back, which also reappeared recently.

    This story is from about a decade ago, and it tells us about metadata’s longevity and how it became a base for greater things.

    At the time, Instagram showed the number of likes by running a count function on the database and storing it in a cache. This method was fine because the number wouldn’t change frequently, so the request would hit the cache and get the result. Even if the number changed, the request would query the data, and because the number was small, it wouldn’t scan a lot of rows, saving the data system from being overloaded.

    However, when a celebrity posted something, it’d receive so many likes that the count would be enormous and change so frequently that looking into the cache became just an extra step.

    The request would trigger a query that would repeatedly scan many rows in the database, overloading the system and causing frequent crashes.

    To deal with this, Instagram came up with the idea of denormalizing the tables and storing the number of likes for each post. So, the request would result in a query where the database needs to look at only one cell to get the number of likes. To handle the issue of frequent changes in the number of likes, Instagram began updating the value at small intervals. This story tells how Instagram solved this problem with a simple tweak of using metadata. 

    Metadata in data engineering has evolved to solve even more significant problems by adding a layer on top of the data flow that works as an interface to communicate with data. Metadata management has become a foundation of multiple data features such as:

    • Data lineage: Stakeholders are interested in the results we get from data processes. Sometimes, in order to check the authenticity of data and get answers to questions like where the data originated from, we need to track back to the data source. Data lineage is a property that makes use of metadata to help with this scenario. Many data products like Atlan and data warehouses like Snowflake extensively use metadata for their services.
    • Schema information: With a clear understanding of your data’s structure, including column details and data types, we can efficiently troubleshoot and resolve data modeling challenges.
    • Data contracts: Metadata helps honor data contacts by keeping a common data profile, which maintains a common data structure across all data usages.
    • Stats: Managing metadata can help us easily access data statistics while also giving us quick answers to questions like what the total count of a table is, how many distinct records there are, how much space it takes, and many more.
    • Access control: Metadata management also includes having information about data accessibility. As we encountered it in the MS Purview features, we can associate a table with vital information and restrict the visibility of a table or even a column to the right people.
    • Audit: Keeping track of information, like who accessed the data, who modified it, or who deleted it, is another important feature that a product with multiple users can benefit from.

    There are many other use cases of metadata that enhance data engineering. It’s positively impacting the current landscape and shaping the future trajectory of data engineering. A very good example is a data catalog. Data catalogs focus on enriching datasets with information about data. Table formats, such as Iceberg and Delta, use catalogs to provide integration with multiple data sources, handle schema evolution, etc. Popular cloud services like AWS Glue also use metadata for features like data discovery. Tech giants like Snowflake and Databricks rely heavily on metadata for features like faster querying, time travel, and many more. 

    With the introduction of AI in the data domain, metadata management has a huge effect on the future trajectory of data engineering. Services such as Cortex and Fabric have integrated AI systems that use metadata for easy questioning and answering. When AI gets to know the context of data, the application of metadata becomes limitless.

    Data Observability

    We know how important metadata can be, and while it’s important to know your data, it’s as important to know about the processes working on it. That’s where observability enters the discussion. It is another crucial aspect of data engineering and a component we can’t miss from our data project. 

    Data observability is about setting up systems that can give us visibility over different services that are working on the data. Whether it’s ingestion, processing, or load operations, having visibility into data movement is essential. This not only ensures that these services remain reliable and fully operational, but it also keeps us informed about the ongoing processes. The ultimate goal is to proactively manage and optimize these operations, ensuring efficiency and smooth performance. We need to achieve this goal because it’s very likely that whenever we create a data system, multiple issues, as well as errors and bugs, will start popping out of nowhere.

    So, how do we keep an eye on these services to see whether they are performing as expected? The answer to that is setting up monitoring and alerting systems.

    Monitoring

    Monitoring is the continuous tracking and measurement of key metrics and indicators that tells us about the system’s performance. Many cloud services offer comprehensive performance metrics, presented through interactive visuals. These tools provide valuable insights, such as throughput, which measures the volume of data processed per second, and latency, which indicates how long it takes to process the data. They track errors and error rates, detailing the types and how frequently they happen.

    To lay the base for monitoring, there are tools like Prometheus and Datadog, which provide us with these monitoring features, indicating the performance of data systems and the system’s infrastructure. We also have Graylog, which gives us multiple features to monitor logs of a system, that too in real-time.

    Now that we have the system that gives us visibility into the performance of processes, we need a setup that can tell us about them if anything goes sideways, a setup that can notify us. 

    Alerting

    Setting up alerting systems allows us to receive notifications directly within the applications we use regularly, eliminating the need for someone to constantly monitor metrics on a UI or watch graphs all day, which would be a waste of time and resources. This is why alerting systems are designed to trigger notifications based on predefined thresholds, such as throughput dropping below a certain level, latency exceeding a specific duration, or the occurrence of specific errors. These alerts can be sent to channels like email or Slack, ensuring that users are immediately aware of any unusual conditions in their data processes.

    Implementing observability will significantly impact data systems. By setting up monitoring and alerting, we can quickly identify issues as they arise and gain context about the nature of the errors. This insight allows us to pinpoint the source of problems, effectively debug and rectify them, and ultimately reduce downtime and service disruptions, saving valuable time and resources.

    Data Quality

    Knowing the data and its processes is undoubtedly important, but all this knowledge is futile if the data itself is of poor quality. That’s where the other essential component of data engineering, data quality, comes into play because data processing is one thing; preparing the data for processing is another.

    In a data project involving multiple sources and formats, various discrepancies are likely to arise. These can include missing values, where essential data points are absent; outdated data, which no longer reflects current information; poorly formatted data that doesn’t conform to expected standards; incorrect data types that lead to processing errors; and duplicate rows that skew results and analyses. Addressing these issues will ensure the accuracy and reliability of the data used in the project.

    Data quality involves enhancing data with key attributes. For instance, accuracy measures how closely the data reflects reality, validity ensures that the data accurately represents what we aim to measure, and completeness guarantees that no critical data is missing. Additionally, attributes like timeliness ensure the data is up to date. Ultimately, data quality is about embedding attributes that build trust in the data. For a deeper dive into this, check out Rita’s blog on Data QA: The Need of the Hour.

    Data quality plays a crucial role in elevating other processes in data engineering. In a data engineering project, there are often multiple entry points for data processing, with data being refined at different stages to achieve a better state each time. Assessing data at the source of each processing stage and addressing issues early on is vital. This approach ensures that data standards are maintained throughout the data flow. As a result, by making data consistent at every step, we gain improved control over the entire data lifecycle. 

    Data tools like Great Expectations and data unit test libraries such as Deequ play a crucial role in safeguarding data pipelines by implementing data quality checks and validations. To gain more context on this, you might want to read Unit Testing Data at Scale using Deequ and Apache Spark by Nishant. These tools ensure that data meets predefined standards, allowing for early detection of issues and maintaining the integrity of data as it moves through the pipeline.

    Orchestration

    With so many processes in place, it’s essential to ensure everything happens at the right time and in the right way. Relying on someone to manually trigger processes at scheduled times every day is an inefficient use of resources. For that individual, performing the same repetitive tasks can quickly become monotonous. Beyond that, manual execution increases the risk of missing schedules or running tasks out of order, disrupting the entire workflow.

    This is where orchestration comes to the rescue, automating tedious, repetitive tasks and ensuring precision in the timing of data flows. Data pipelines can be complex, involving many interconnected components that must work together seamlessly. Orchestration ensures that each component follows a defined set of rules, dictating when to start, what to do, and how to contribute to the overall process of handling data, thus maintaining smooth and efficient operations.

    This automation helps reduce errors that could occur with manual execution, ensuring that data processes remain consistent by streamlining repetitive tasks. With a number of different orchestration tools and services in place, we can now monitor and manage everything from a single platform. Tools like Airflow, an open-source orchestrator, Prefect, which offers a user-friendly drag-and-drop interface, and cloud services such as Azure Data Factory, Google Cloud Composer, and AWS Step Functions, enhance our visibility and control over the entire process lifecycle, making data management more efficient and reliable. Don’t miss Shreyash’s excellent blog on Mage: Your New Go-To Tool for Data Orchestration.

    Orchestration is built on a foundation of multiple concepts and technologies that make it robust and fail-safe. These underlying principles ensure that orchestration not only automates processes but also maintains reliability and resilience, even in complex and demanding data environments.

    • Workflow definition: This defines how tasks in the pipeline are organized and executed. It lays out the sequence of tasks—telling it what needs to be finished before other tasks can start—and takes care of other conditions for pipeline execution. Think of it like a roadmap that guides the flow of tasks.
    • Task scheduling: This determines when and how tasks are executed. Tasks might run at specific times, in response to events, or based on the completion of other tasks. It’s like scheduling appointments for tasks to ensure they happen at the right time and with the right resources.
    • Dependency management: Since tasks often rely on each other, with the concepts of dependency management, we can ensure that tasks run in the correct order. It ensures that each process starts only when its prerequisites are met, like waiting for a green light before proceeding.

    With these concepts, orchestration tools provide powerful features for workflow design and management, enabling the definition of complex, multi-step processes. They support parallel, sequential, and conditional execution of tasks, allowing for flexibility in how workflows are executed. Not just that, they also offer event-driven and real-time orchestration, enabling systems to respond to dynamic changes and triggers as they occur. These tools also include robust error handling and exception management, ensuring that workflows are resilient and fault-tolerant.

    Visualization

    The true value lies not just in collecting vast amounts of data but in interpreting it in ways that generate real business value, and this makes visualization of data a vital component to provide a clear and accurate representation of data that can be easily understood and utilized by decision-makers. The presentation of data in the right way enables businesses to get intelligence from data, which makes data engineering worth the investment and this is what guides strategic decisions, optimizes operations, and gives power to innovation. 

    Visualizations allow us to see patterns, trends, and anomalies that might not be apparent in raw data. Whether it’s spotting a sudden drop in sales, detecting anomalies in customer behavior, or forecasting future performance, data visualization can provide the clear context needed to make well-informed decisions. When numbers and graphs are presented effectively, it feels as though we are directly communicating with the data, and this language of communication bridges the gap between technical experts and business leaders.

    Visualization Within ETL Processes

    Visualization isn’t just a final output. It can also be a valuable tool within the data engineering process itself. Intermediate visualization during the ETL workflow can be a game-changer. In collaborative teams, as we go through the transformation process, visualizing it at various stages helps ensure the accuracy and relevance of the result. We can understand the datasets better, identify issues or anomalies between different stages, and make more informed decisions about the transformations needed.

    Technologies like Fabric and Mage enable seamless integration of visualizations into ETL pipelines. These tools empower team members at all levels to actively engage with data, ask insightful questions, and contribute to the decision-making process. Visualizing datasets at key points provides the flexibility to verify that data is being processed correctly, develop accurate analytical formulas, and ensure that the final outputs are meaningful.

    Depending on the industry and domain, there are various visualization tools suited to different use cases. For example, 

    • For real-time insights, which are crucial in industries like healthcare, financial trading, and air travel, tools such as Tableau and Striim are invaluable. These tools allow for immediate visualization of live data, enabling quick and informed decision-making.
    • For broad data source integrations and dynamic dashboard querying, often demanded in the technology sector, tools like Power BI, Metabase, and Grafana are highly effective. These platforms support a wide range of data sources and offer flexible, interactive dashboards that facilitate deep analysis and exploration of data.

    It’s Limitless

    We are seeing many advancements in this domain, which are helping businesses, data science, AI and ML, and many other sectors because the potential of data is huge. If a business knows how to use data, it can be a major factor in its success. And for that reason, we have constantly seen the rise of different components in data engineering. All with one goal: to make data useful.

    Recently, we’ve witnessed the introduction of numerous technologies poised to revolutionize the data engineering domain. Concepts like data mesh are enhancing data discovery, improving data ownership, and streamlining data workflows. AI-driven data engineering is rapidly advancing, with expectations to automate key processes such as data cleansing, pipeline optimization, and data validation. We’re already seeing how cloud data services have evolved to embrace AI and machine learning, ensuring seamless integration with data initiatives. The rise of real-time data processing brings new use cases and advancements, while practices like DataOps foster better collaboration among teams. Take a closer look at the modern data stack in Shivam’s detailed article, Modern Data Stack: The What, Why, and How?

    These developments are accompanied by a wide array of technologies designed to support infrastructure, analytics, AI, and machine learning, alongside enterprise tools that lay the foundation for this ongoing evolution. All these elements collectively set the stage for a broader discussion on data engineering and what lies beyond big data. Big data, supported by these satellite activities, aims to extract maximum value from data, unlocking its full potential.

    References:

    1. Velotio – Data Engineering Blogs
    2. Firstmark
    3. MS Purview Data Security
    4. Tech Target – Article on data quality
    5. Splunk – Data Observability: The Complete Introduction
    6. Instagram crash story – WIRED

  • Iceberg: Features and Hands-on (Part 2)

    As we have already discussed in the previous blog about Apache Iceberg’s basic concepts, setup process, and how to load data. Further, we will now delve into some of Iceberg’s advanced features, including upsert functionality, schema evolution, time travel, and partitioning.

    Upsert Functionality

    One of Iceberg’s key features is its support for upserts. Upsert, which stands for update and insert, allows you to efficiently manage changes to your data. With Iceberg, you can perform these operations seamlessly, ensuring that your data remains accurate and up-to-date without the need for complex and time-consuming processes.

    Schema Evolution

    Schema evolution is another of its powerful features. Over time, the schema of your data may need to change due to new requirements or updates. Iceberg handles schema changes gracefully, allowing you to add, remove, or modify columns without having to rewrite your entire dataset. This flexibility ensures that your data architecture can evolve in tandem with your business needs.

    Time Travel

    Iceberg also provides time travel capabilities, enabling you to query historical data as it existed at any given point in time. This feature is particularly useful for debugging, auditing, and compliance purposes. By leveraging snapshots, you can easily access previous states of your data and perform analyses on how it has changed over time.

    Setup Iceberg on the local machine using the local catalog option or Hive

    You can also configure Iceberg in your Spark session like this:

    import pyspark
    spark = pyspark.sql.SparkSession.builder 
    .config('spark.jars.packages','org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0') 
        .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 
        .config('spark.sql.catalog.spark_catalog.type', 'hive') 
        .config('spark.sql.catalog.local', 'org.apache.iceberg.spark.SparkCatalog') 
        .config('spark.sql.catalog.local.type', 'hadoop') 
        .config('spark.sql.catalog.local.warehouse', './Data-Engineering/warehouse') 
        .getOrCreate()

    Some configurations must pass while setting up Iceberg. 

    Create Tables in Iceberg and Insert Data

    CREATE TABLE demo.db.data_sample (index string, organization_id string, name string, website string, country string, description string, founded string, industry string, num_of_employees string) USING iceberg

    df = spark.read.option("header", "true").csv("../data/input-data/organizations-100.csv")
    
    df.writeTo("demo.db.data_sample").append()

    We can either create the sample table using Spark SQL or directly write the data by mentioning the DB name and table name, which will create the Iceberg table for us.

    You can see the data we have inserted. Apart from appending, you can use the overwrite method as well as Delta Lake tables. You can also see an example of how to read the data from an iceberg table.

    Handling Upserts

    This Iceberg feature is similar to Delta Lake. You can update the records in existing Iceberg tables without impacting the complete data. This is also used to handle the CDC operations. We can take input from any incoming CSV and merge the data in the existing table without any duplication. It will always have a single Record for each primary key. This is how Iceberg maintains the ACID properties.

    Incoming Data 

    input_data = spark.read.option("header", "true").csv("../data/input-data/organizations-11111.csv")
    # Creating the temp view of that dataframe to merge
    input_data.createOrReplaceTempView("input_data")
    spark.sql("select * from input_data").show()

    We will merge this data into our existing Iceberg Table using Spark SQL.

    MERGE INTO demo.db.data_sample t
    USING (SELECT * FROM input_data) s
    ON t.organization_id = s.organization_id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
    
    select * from demo.db.data_sample

    Here, we can see the data once the merge operation has taken place.

    Schema Evolution

    Iceberg supports the following schema evolution changes:

    • Add – Add a new column to the iceberg table
    • Drop – If any columns get removed from the existing tables
    • Rename – Change the name of the columns from the existing table
    • Update – Change the data type or partition columns of the Iceberg table
    • Reorder – Change in the order of the Iceberg table

    After updating the schema, there will be no need to overwrite or re-write the data again. Like previously, your table has four columns, and all of them have data. If you added two more columns, you wouldn’t need to rewrite the data now that you have six columns. You can still easily access the data. This unique feature was lacking in Delta Lake but is present here. These are just some characteristics of the Iceberg scheme evolutions.

    1. If we add any columns, they won’t impact the existing columns.
    2. If we delete or drop any columns, they won’t impact other columns.
    3. Updating a column or field does not change values in any other column.

    Iceberg uses unique IDs to track each column added to a table.

    Let’s run some queries to update the schema, or let’s try to delete some columns.

    %%sql
    
    ALTER TABLE demo.db.data_sample
    ADD COLUMN fare_per_distance_unit float AFTER num_of_employees;

    After adding another column, if we try to access the data again from the table, we can do so without seeing any kind of error. This is also how Iceberg solves schema-related problems.

    Partition Evolution and Sort Order Evolution

    Iceberg came up with this option, which was missing in Delta Lake. When you evolve a partition spec, the old data written with an earlier spec remains unchanged. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately. Because of this, when you start writing queries, you get split planning. This is where each partition layout plans files separately using the filter it derives for that specific partition layout.

    Similar to partition spec, Iceberg sort order can also be updated in an existing table. When you evolve a sort order, the old data written with an earlier order remains unchanged.

    %%sql
    
    ALTER TABLE demo.db.data_sample ADD PARTITION FIELD founded
    DESCRIBE TABLE demo.db.data_sample

    Copy on write(COW) and merge on read(MOR) as well

    Iceberg supports both COW and MOR while loading the data into the Iceberg table. We can set up configuration for this by either altering the table or while creating the iceberg table.

    Copy-On-Write (COW) – Best for tables with frequent reads, infrequent writes/updates, or large batch updates:

    When your requirement is to frequently read but less often write and update, you can configure this property in an Iceberg table. In COW, when we update or delete any rows from the table, a new data file with another version is created, and the latest version holds the latest updated data. The data is rewritten when updates or deletions occur, making it slower and can be a bottleneck when large updates occur. As its name specifies, it creates another copy on write of data.

    When reading occurs, it is an ideal process as we are not updating or deleting anything we are only reading so we can read the data faster.

    Merge-On-Read (MOR) – Best for tables with frequent writes/updates:

    This is just opposite of the COW, as we do not rewrite the data again on the update or deletion of any rows. It creates a change log with updated records and then merges this into the original data file to create a new state of file with updated records.

    Query engine and integration supported:

    Conclusion

    After performing this research, we learned about the Iceberg’s features and its compatibility with various metastore for integrations. We got the basic idea of configuring Iceberg on different cloud platforms and locally well. We had some basic ideas for Upsert, schema evolution and partition evolution.

  • Data QA: The Need of the Hour

    Have you ever encountered vague or misleading data analytics reports? Are you struggling to provide accurate data values to your end users? Have you ever experienced being misdirected by a geographical map application, leading you to the wrong destination? Imagine Amazon customers expressing dissatisfaction due to receiving the wrong product at their doorstep.

    These issues stem from the use of incorrect or vague data by application/service providers. The need of the hour is to address these challenges by enhancing data quality processes and implementing robust data quality solutions. Through effective data management and validation, organizations can unlock valuable insights and make informed decisions.

    “Harnessing the potential of clean data is like painting a masterpiece with accurate brushstrokes.”

    Introduction

    Data quality assurance (QA) is the systematic approach organizations use to ensure they have reliable, correct, consistent, and relevant data. It involves various methods, approaches, and tools to maintain good data quality from commencement to termination.

    What is Data Quality?

    Data quality refers to the overall utility of a dataset and its ability to be easily processed and analyzed for other uses. It is an integral part of data governance that ensures your organization’s data is fit for purpose. 

    How can I measure Data Quality?

                                                 

    What is the critical importance of Data Quality?

    Remember, good data is super important! So, invest in good data—it’s the secret sauce for business success!

    What are the Data Quality Challenges?

    1. Data quality issues on production:

    Production-specific data quality issues are primarily caused by unexpected changes in the data and infrastructure failures.

    A. Source and third-party data changes:

    External data sources, like websites or companies, may introduce errors or inconsistencies, making it challenging to use the data accurately. These issues can lead to system errors or missing values, which might go unnoticed without proper monitoring.

    Example:

    • File formats change without warning:

    Imagine we’re using an API to get data in CSV format, and we’ve made a pipeline that handles it well.

    import csv
    
    def process_csv_data(csv_file):
        with open(csv_file, 'r') as file:
            csv_reader = csv.DictReader(file)
            for row in csv_reader:
                print(row)
    
    csv_file = 'data.csv'
    process_csv_data(csv_file)

    The data source switched to using the JSON format, breaking our pipeline. This inconsistency can cause errors or missing data if our system can’t adapt. Monitoring and adjustments will ensure the accuracy of data analysis or applications.

    • Malformed data values and schema changes:

    Suppose we’re handling inventory data for an e-commerce site. The starting schema for your inventory dataset might have fields like:

    Now, imagine that the inventory file’s schema changed suddenly. A “quantity” column has been renamed to “qty,” and the last_updated_at timestamp format switches to epoch timestamp.

    This change might not be communicated in advance, leaving our data pipeline unprepared to handle the new field and time format.

    B. Infrastructure failures:

    Reliable software is crucial for processing large data volumes, but even the best tools can encounter issues. Infrastructure failures, like glitches or overloads, can disrupt data processing regardless of the software used.

    Solution: 

    Data observability tools such as Monte Carlo, BigEye, and Great Expectations help detect these issues by monitoring for changes in data quality and infrastructure performance. These tools are essential for identifying and alerting the root causes of data problems, ensuring data reliability in production environments.

    2. Data quality issues during development:

    Development-specific data quality issues are primarily caused by untested code changes.

    A. Incorrect parsing of data:

    Data transformation bugs can occur due to mistakes in code or parsing, leading to data type mismatches or schema inaccuracies.

    Example:

    Imagine we’re converting a date string (“YYYY-MM-DD”) to a Unix epoch timestamp using Python. But misunderstanding the strptime() function’s format specifier leads to unexpected outcomes.

    from datetime import datetime
    
    timestamp_str = "2024-05-10" # %Y-%d-%m correct format from incoming data
    
    # Incorrectly using '%d' for month (should be '%m')
    format_date = "%Y-%m-%d" 
    timestamp_dt = datetime.strptime(timestamp_str, format_date)
    
    epoch_seconds = int(timestamp_dt.timestamp())

    This error makes strptime() interpret “2024” as the year, “05” as the month (instead of the day), and “10” as the day (instead of the month), leading to inaccurate data in the timestamp_dt variable.

    B. Misapplied or misunderstood requirements:

    Even with the right code, data quality problems can still occur if requirements are misunderstood, resulting in logic errors and data quality issues.

    Example:
    Imagine we’re assigned to validate product prices in a dataset, ensuring they fall between $10 and $100.

    product_prices = [10, 5, 25, 50, 75, 110]
    valid_prices = []
    
    for price in product_prices:
        if price >= 10 and price <= 100:
            valid_prices.append(price)
    
    print("Valid prices:", valid_prices)

    The requirement states prices should range from $10 to $100. But a misinterpretation leads the code to check if prices are >= $10 and <= $100. This makes $10 valid, causing a data quality problem.

    C. Unaccounted downstream dependencies:

    Despite careful planning and logic, data quality incidents can occur due to overlooked dependencies. Understanding data lineage and communicating effectively across all users is crucial to preventing such incidents.

    Example:

    Suppose we’re working on a database schema migration project for an e-commerce system. In the process, we rename the order_date column to purchase_date in the orders table. Despite careful planning and testing, a data quality issue arises due to an overlooked downstream dependency. The marketing team’s reporting dashboard relies on a SQL query referencing the order_date column, now renamed purchase_date, resulting in inaccurate reporting and potentially misinformed business decisions.

    Here’s an example SQL query that represents the overlooked downstream dependency:

    -- SQL query used by the marketing team's reporting dashboard
    SELECT 
        DATE_TRUNC('month', order_date) AS month,
        SUM(total_amount) AS total_sales
    FROM 
        orders
    GROUP BY 
        DATE_TRUNC('month', order_date)

    This SQL query relies on the order_date column to calculate monthly sales metrics. After the schema migration, this column no longer exists, causing query failure and inaccurate reporting.

    Solutions:

    Data Quality tools like Great Expectations and Deequ proactively catch data quality issues by testing changes introduced from data-processing code, preventing issues from reaching production.

    a. Testing assertions: Assertions validate data against expectations, ensuring data integrity. While useful, they require careful maintenance and should be selectively applied.

    Example:
    Suppose we have an “orders” table in your dbt project and need to ensure the “total_amount” column contains only numeric values; we can write a dbt test to validate this data quality rule.

    version: 2
    
    models:
      - name: orders
        columns:
          - name: total_amount
            tests:
              - data_type: numeric

    In this dbt test code:

    • We specify the dbt version (version: 2), model named “orders,” and “total_amount” column.
    • Within the “total_amount” column definition, we add a test named “data_type” with the value “numeric,” ensuring the column contains only numeric data.
    • Running the dbt test command will execute this test, checking if the “total_amount” column adheres to the numeric data type. Any failure indicates a data quality issue.

    b. Comparing staging and production data: Data Diff is a CLI tool that compares datasets within or across databases, highlighting changes in data similar to how git diff highlights changes in source code. Aiding in detecting data quality issues early in the development process.

    Here’s a data-diff example between staging and production databases for the payment_table.

    data-diff 
      staging_db_connection 
      staging_payment_table 
      production_db_connection 
      production_payment_table 
      -k primary_key 
      -c “payment_amount, payment_type, payment_currency” 
      -w filter_condition(optional)

    Source: https://docs.datafold.com/data_diff/what_is_data_diff

    What are some best practices for maintaining high-quality data?

    1. Establish Data Standards: Define clear data standards and guidelines for data collection, storage, and usage to ensure consistency and accuracy across the organization.
    2. Data Validation: Implement validation checks to ensure data conforms to predefined rules and standards, identifying and correcting errors early in the data lifecycle.
    3. Regular Data Cleansing: Schedule regular data cleansing activities to identify and correct inaccuracies, inconsistencies, and duplicates in the data, ensuring its reliability and integrity over time.
    4. Data Governance: Establish data governance policies and procedures to manage data assets effectively, including roles and responsibilities, data ownership, access controls, and compliance with regulations.
    5. Metadata Management: Maintain comprehensive metadata to document data lineage, definitions, and usage, providing transparency and context for data consumers and stakeholders.
    6. Data Security: Implement robust data security measures to protect sensitive information from unauthorized access, ensuring data confidentiality, integrity, and availability.
    7. Data Quality Monitoring: Continuously monitor data quality metrics and KPIs to track performance, detect anomalies, and identify areas for improvement, enabling proactive data quality management.
    8. Data Training and Awareness: Provide data training and awareness programs for employees to enhance their understanding of data quality principles, practices, and tools, fostering a data-driven culture within the organization.
    9. Collaboration and Communication: Encourage collaboration and communication among stakeholders, data stewards, and IT teams to address data quality issues effectively and promote accountability and ownership of data quality initiatives.
    10. Continuous Improvement: Establish a culture of continuous improvement by regularly reviewing and refining data quality processes, tools, and strategies based on feedback, lessons learned, and evolving business needs.

    Can you recommend any tools for improving data quality?

    1. AWS Deequ: AWS Deequ is an open-source data quality library built on top of Apache Spark. It provides tools for defining data quality rules and validating large-scale datasets in Spark-based data processing pipelines.
    1. Great Expectations: GX Cloud is a fully managed SaaS solution that simplifies deployment, scaling, and collaboration and lets you focus on data validation. 

    1. Soda: Soda allows data engineers to test data quality early and often in pipelines to catch data quality issues before they have a downstream impact.
    1. Datafold: Datafold is a cloud-based data quality platform that automates and simplifies the process of monitoring and validating data pipelines. It offers features such as automated data comparison, anomaly detection, and integration with popular data processing tools like dbt.

    Considerations for Selecting a Data QA Tool:

    Selecting a data QA (Quality Assurance) tool hinges on your specific needs and requirements. Consider factors such as: 

    1. Scalability and Performance: Ensure the tool can handle current and future data volumes efficiently, with real-time processing capabilities. some text

    Example: Great Expectations help validate data in a big data environment by providing a scalable and customizable way to define and monitor data quality across different sources

    2. Data Profiling and Cleansing Capabilities: Look for comprehensive data profiling and cleansing features to detect anomalies and improve data quality.some text

    Example: AWS Glue DataBrew offers profiling, cleaning and normalizing, creating map data lineage, and automating data cleaning and normalization tasks.

    3. Data Monitoring Features: Choose tools with continuous monitoring capabilities, allowing you to track metrics and establish data lineage.some text

    Example: Datafold’s monitoring feature allows data engineers to write SQL commands to find anomalies and create automated alerts.

    4. Seamless Integration with Existing Systems: Select a tool compatible with your existing systems to minimize disruption and facilitate seamless integration.some text

    Example: dbt offers seamless integration with existing data infrastructure, including data warehouses and BI tools. It allows users to define data transformation pipelines using SQL, making it compatible with a wide range of data systems.

    5. User-Friendly Interface: Prioritize tools with intuitive interfaces for quick adoption and minimal training requirements.some text

    Example: Soda SQL is an open-source tool with a simple command line interface (CLI) and Python library to test your data through metric collection.

    6. Flexibility and Customization Options: Seek tools that offer flexibility to adapt to changing data requirements and allow customization of rules and workflows.some text

    Example: dbt offers flexibility and customization options for defining data transformation workflows. 

    7. Vendor Support and Community: Evaluate vendors based on their support reputation and active user communities for shared knowledge and resources.some text

    Example: AWS Deequ is supported by Amazon Web Services (AWS) and has an active community of users. It provides comprehensive documentation, tutorials, and forums for users to seek assistance and share knowledge about data quality best practices.

    8. Pricing and Licensing Options: Consider pricing models that align with your budget and expected data usage, such as subscription-based or volume-based pricing. some text

    Example: Great Expectations offers flexible pricing and licensing options, including both open-source (freely available) and enterprise editions(subscription-based).

    Ultimately, the right tool should effectively address your data quality challenges and seamlessly fit into your data infrastructure and workflows.

    Conclusion: The Vital Role of Data Quality

    In conclusion, data quality is paramount in today’s digital age. It underpins informed decisions, strategic formulation, and business success. Without it, organizations risk flawed judgments, inefficiencies, and competitiveness loss. Recognizing its vital role empowers businesses to drive innovation, enhance customer experiences, and achieve sustainable growth. Investing in robust data management, embracing technology, and fostering data integrity are essential. Prioritizing data quality is key to seizing new opportunities and staying ahead in the data-driven landscape.

    References:

    https://docs.getdbt.com/docs/build/data-tests

    https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ

    https://www.soda.io/resources/introducing-soda-sql

  • Iceberg – Introduction and Setup (Part – 1)

    As we already discussed in our previous Delta Lake blog, there are already table formats in use, ones with very high specifications and their own benefits. Iceberg is one of them. So, in this blog, we will discuss Iceberg.

    What is Apache Iceberg?

    Iceberg, from the open-source Apache, is a table format used to handle large amounts of data stored locally or on various cloud storage platforms. Netflix developed Iceberg to solve its big data problem. After that, they donated it to Apache, and it became open source in 2018.  Iceberg now has a large number of contributors all over the world on GitHub and is the most widely used table format. 

    Iceberg mainly solves all the key problems we once faced when using the Hive table format to deal with data stored on various cloud storage like S3.

    Iceberg has similar features and capabilities, like SQL tables. Yes, it is open source, so multiple engines like Spark can operate on it to perform transformations and such. It also has all ACID properties. This is a quick introduction to  Iceberg, covering its features and initial setup.

    Why to go with Iceberg

    The main reason to use Iceberg is that it performs better when we need to load data from S3, or metadata is available on a cloud storage medium. Unlike Hive, Iceberg tracks the data at the file level rather than the folder level, which can decrease performance; that’s why we want to choose Iceberg. Here is the folder hierarchy that Iceberg uses while saving the data into its tables. Each Iceberg table is a combination of four files: snapshot metadata list, manifest list, manifest file, and data file.

    1. Snapshot Metadata File:  This file holds the metadata information about the table, such as the schema, partitions, and manifest list.
    2. Manifest List:  This list records each manifest file along with the path and metadata information. At this point, Iceberg decides which manifest files to ignore and which to read.
    3. Manifest File: This file contains the paths to real data files, which hold the real data along with the metadata.
    4. Data File: Here is the real parquet, ORC, and Avro file, along with the real data.

    Features of Iceberg:

    Some Iceberg features include:

    • Schema Evolution: Iceberg allows you to evolve your schema without having to rewrite your data. This means you can easily add, drop, or rename columns, providing flexibility to adapt to changing data requirements without impacting existing queries.
    • Partition Evolution: Iceberg supports partition evolution, enabling you to modify the partitioning scheme as your data and query patterns evolve. This feature helps maintain query performance and optimize data layout over time.
    • Time Travel: Iceberg’s time travel feature allows you to query historical versions of your data. This is particularly useful for debugging, auditing, and recreating analyses based on past data states.
    • Multiple Query Engine Support: Iceberg supports multiple query engines, including Trino, Presto, Hive, and Amazon Athena. This interoperability ensures that you can read and write data across different tools seamlessly, facilitating a more versatile and integrated data ecosystem.
    • AWS Support: Iceberg is well-integrated with AWS services, making it easy to use with Amazon S3 for storage and other AWS analytics services. This integration helps leverage the scalability and reliability of AWS infrastructure for your data lake.
    • ACID Compliance: Iceberg ensures ACID (Atomicity, Consistency, Isolation, Durability) transactions, providing reliable data consistency and integrity. This makes it suitable for complex data operations and concurrent workloads, ensuring data reliability and accuracy.
    • Hidden Partitioning: Iceberg’s hidden partitioning abstracts the complexity of managing partitions from the user, automatically handling partition management to improve query performance without manual intervention.
    • Snapshot Isolation: Iceberg supports snapshot isolation, enabling concurrent read and write operations without conflicts. This isolation ensures that users can work with consistent views of the data, even as it is being updated.
    • Support for Large Tables: Designed for high scalability, Iceberg can efficiently handle petabyte-scale tables, making it ideal for large datasets typical in big data environments.
    • Compatibility with Modern Data Lakes: Iceberg’s design is tailored for modern data lake architectures, supporting efficient data organization, metadata management, and performance optimization, aligning well with contemporary data management practices.

    These features make Iceberg a powerful and flexible table format for managing data lakes, ensuring efficient data processing, robust performance, and seamless integration with various tools and platforms. By leveraging Iceberg, organizations can achieve greater data agility, reliability, and efficiency, enhancing their data analytics capabilities and driving better business outcomes.

    Prerequisite:

    • PySpark: Ensure that you have PySpark installed and properly configured. PySpark provides the Python API for Spark, enabling you to harness the power of distributed computing with Spark using Python.
    • Python: Make sure you have Python installed on your system. Python is essential for writing and running your PySpark scripts. It’s recommended to use a virtual environment to manage your dependencies effectively.
    • Iceberg-Spark JAR: Download the appropriate Iceberg-Spark JAR file that corresponds to your Spark version. This JAR file is necessary to integrate Iceberg with Spark, allowing you to utilize Iceberg’s advanced table format capabilities within your Spark jobs.
    • Jars to Configure Cloud Storage: Obtain and configure the necessary JAR files for your specific cloud storage provider. For example, if you are using Amazon S3, you will need the hadoop-aws JAR and its dependencies. For Google Cloud Storage, you need the gcs-connector JAR. These JARs enable Spark to read from and write to cloud storage systems.
    • Spark and Hadoop Configuration: Ensure your Spark and Hadoop configurations are correctly set up to integrate with your cloud storage. This might include setting the appropriate access keys, secret keys, and endpoint configurations in your spark-defaults.conf and core-site.xml.
    • Iceberg Configuration: Configure Iceberg settings specific to your environment. This might include catalog configurations (e.g., Hive, Hadoop, AWS Glue) and other Iceberg properties that optimize performance and compatibility.
    • Development Environment: Set up a development environment with an IDE or text editor that supports Python and Spark development, such as IntelliJ IDEA with the PyCharm plugin, Visual Studio Code, or Jupyter Notebooks.
    • Data Source Access: Ensure you have access to the data sources you will be working with, whether they are in cloud storage, relational databases, or other data repositories. Proper permissions and network configurations are necessary for seamless data integration.
    • Basic Understanding of Data Lakes: A foundational understanding of data lake concepts and architectures will help effectively utilize Iceberg. Knowledge of how data lakes differ from traditional data warehouses and their benefits will also be helpful.
    • Version Control System: Use a version control system like Git to manage your codebase. This helps in tracking changes, collaborating with team members, and maintaining code quality.
    • Documentation and Resources: Familiarize yourself with Iceberg documentation and other relevant resources. This will help you troubleshoot issues, understand best practices, and leverage advanced features effectively.

    You can download the run time JAR from here —according to the Spark version installed on your machine or cluster. It will be the same as the Delta Lake setup. You can either download these JAR files to your machine or cluster, provide a Spark submit command, or you can download these while initializing the Spark session by passing these in Spark config as a JAR package, along with the appropriate version.

    To use cloud storage, we are using these JARs with the S3 bucket for reading and writing Iceberg tables. Here is the basic example of a spark session:

    AWS_ACCESS_KEY_ID = "XXXXXXXXXXXXXX"
    AWS_SECRET_ACCESS_KEY = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXpiwvahk7e"
    
    spark_jars_packages = "com.amazonaws:aws-java-sdk:1.12.246,org.apache.hadoop:hadoop-aws:3.2.2,org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0"
    
    spark = pyspark.sql.SparkSession.builder 
       .config("spark.jars.packages", spark_jars_packages) 
       .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") 
       .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog") 
       .config("spark.sql.catalog.demo.warehouse", "s3a://abhishek-test-01012023/iceberg-sample-data/") 
       .config('spark.sql.catalog.demo.type', 'hadoop') 
       .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') 
       .config("spark.driver.memory", "20g") 
       .config("spark.memory.offHeap.enabled", "true") 
       .config("spark.memory.offHeap.size", "8g") 
       .getOrCreate()
    
    spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
    spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)

    Iceberg Setup Using Docker

    You can set and configure AWS creds, as well as some database-related or stream-related configs inside the docker-compose file.

    version: "3"
    
    services:
      spark-iceberg:
        image: tabulario/spark-iceberg
        container_name: spark-iceberg
        build: spark/
        depends_on:
          - rest
          - minio
        volumes:
          - ./warehouse:/home/iceberg/warehouse
          - ./notebooks:/home/iceberg/notebooks/notebooks
          - ./data:/home/iceberg/data
        environment:
          - AWS_ACCESS_KEY_ID=admin
          - AWS_SECRET_ACCESS_KEY=password
          - AWS_REGION=us-east-1
        ports:
          - 8888:8888
          - 8080:8080
        links:
          - rest:rest
          - minio:minio
      rest:
        image: tabulario/iceberg-rest:0.1.0
        ports:
          - 8181:8181
        environment:
          - AWS_ACCESS_KEY_ID=admin
          - AWS_SECRET_ACCESS_KEY=password
          - AWS_REGION=us-east-1
          - CATALOG_WAREHOUSE=s3a://warehouse/wh/
          - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
          - CATALOG_S3_ENDPOINT=http://minio:9000
      minio:
        image: minio/minio
        container_name: minio
        environment:
          - MINIO_ROOT_USER=admin
          - MINIO_ROOT_PASSWORD=password
        ports:
          - 9001:9001
          - 9000:9000
        command: ["server", "/data", "--console-address", ":9001"]
      mc:
        depends_on:
          - minio
        image: minio/mc
        container_name: mc
        environment:
          - AWS_ACCESS_KEY_ID=admin
          - AWS_SECRET_ACCESS_KEY=password
          - AWS_REGION=us-east-1
        entrypoint: >
          /bin/sh -c "
          until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
          /usr/bin/mc rm -r --force minio/warehouse;
          /usr/bin/mc mb minio/warehouse;
          /usr/bin/mc policy set public minio/warehouse;
          exit 0;
          " 

    Save this file with docker-compose.yaml. And run the command: docker compose up. Now, you can log into your container by using this command:

    docker exec -it <container-id> bash

    You can mount the sample data directory in a container or copy it from your local machine to the container. To copy the data inside the Docker directory, we can use the CP command.

    docker cp input-data <Container ID>:/home/iceberg/data 

    Setup S3 As a Warehouse in Iceberg, Read Data from the S3, and Write Iceberg Tables in the S3 Again Using an EC2 Instance  

    We have generated 90 GB of data here using Spark Job, stored in the S3 bucket. 

    AWS_ACCESS_KEY_ID = "XXXXXXXXXXX"
    AWS_SECRET_ACCESS_KEY = "XXXXXXXXXXX+XXXXXXXXXXX"
    
    spark_jars_packages = "com.amazonaws:aws-java-sdk:1.12.246,org.apache.hadoop:hadoop-aws:3.2.2,org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0"
    
    spark = pyspark.sql.SparkSession.builder 
       .config("spark.jars.packages", spark_jars_packages) 
       .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") 
       .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog") 
       .config("spark.sql.catalog.demo.warehouse", "s3a://abhishek-test-01012023/iceberg-sample-data/") 
       .config('spark.sql.catalog.demo.type', 'hadoop') 
       .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') 
       .config("spark.driver.memory", "20g") 
       .config("spark.memory.offHeap.enabled", "true") 
       .config("spark.memory.offHeap.size", "8g") 
       .getOrCreate()
    
    spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
    spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)

    Step 1

    We read the data in Spark and create an Iceberg table out of it, storing the iceberg tables in the S3 bucket only.

    Some Iceberg functionality won’t work if we haven’t installed or used the appropriate JAR file of the Iceberg version. The Iceberg version should be compatible with the Spark version you are using; otherwise, some feature partitions will throw an error of noSuchMethod. This must be taken care of carefully while setting this up, either in EC2 or EMR.

    Create an Iceberg table on S3 and write data into that table. The sample data we have used is generated using a Spark job for Delta tables. We are using the same data and schema of the data as follows.

    Step 2

    We created Iceberg tables in the location of the S3 bucket and wrote the data with partition columns in the S3 bucket only.

    spark.sql(""" CREATE TABLE IF NOT EXISTS demo.db.iceberg_data_2(id INT, first_name String,
    last_name String, address String, pincocde INT, net_income INT, source_of_income String,
    state String, email_id String, description String, population INT, population_1 String,
    population_2 String, population_3 String, population_4 String, population_5 String, population_6 String,
    population_7 String, date INT)
    USING iceberg
    TBLPROPERTIES ('format'='parquet', 'format-version' = '2')
    PARTITIONED BY (`date`)
    location 's3a://abhishek-test-01012023/iceberg_v2/db/iceberg_data_2'""")
    
    # Read the data that need to be written
    # Reading the data from delta tables in spark Dataframe
    
    df = spark.read.parquet("s3a://abhishek-test-01012023/delta-lake-sample-data/")
    
    logging.info("Starting writing the data")
    
    df.sortWithinPartitions("date").writeTo("demo.db.iceberg_data").partitionedBy("date").createOrReplace()
    
    logging.info("Writing has been finished")
    
    logging.info("Query the data from iceberg using spark SQL")
    
    spark.sql("describe table demo.db.iceberg_data").show()
    spark.sql("Select * from demo.db.iceberg_data limit 10").show()

    This is how we can use Iceberg over S3. There is another option: We can also create Iceberg tables in the AWS Glue catalog. Most tables created in the Glue catalog using Ahena are external tables that we use externally after generating the manifest files, like Delta Lake. 

    Step 3

    We print the Iceberg table’s data along with the table descriptions. 

    Using Iceberg, we can directly create the table in the Glue catalog using Athena, and it supports all read and write operations on the data available. These are the configurations that need to use in spark while using Glue catalog.

    {
        "conf":  {
                 "spark.sql.catalog.glue_catalog1": "org.apache.iceberg.spark.SparkCatalog",
                 "spark.sql.catalog.glue_catalog1.warehouse": 
                       "s3://YOUR-BUCKET-NAME/iceberg/glue_catalog1/tables/",
                 "spark.sql.catalog.glue_catalog1.catalog-impl":    "org.apache.iceberg.aws.glue.GlueCatalog",
                 "spark.sql.catalog.glue_catalog1.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
                 "spark.sql.catalog.glue_catalog1.lock-impl": "org.apache.iceberg.aws.glue.DynamoLockManager",
                 "spark.sql.catalog.glue_catalog1.lock.table": "myGlueLockTable",
      "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
               } 
    }

    Now, we can easily create the Iceberg table using the Spark or Athena, and it will be accessible via Delta. We can perform upserts, too.

    Conclusion

    We’ve learned the basics of the Iceberg table format, its features, and the reasons for choosing Iceberg. We discussed how Iceberg provides significant advantages such as schema evolution, partition evolution, hidden partitioning, and ACID compliance, making it a robust choice for managing large-scale data. We also delved into the fundamental setup required to implement this table format, including configuration and integration with data processing engines like Apache Spark and query engines like Presto and Trino. By leveraging Iceberg, organizations can ensure efficient data management and analytics, facilitating better performance and scalability. With this knowledge, you are well-equipped to start using Iceberg for your data lake needs, ensuring a more organized, scalable, and efficient data infrastructure.

  • Policy Insights: Chatbots and RAG in Health Insurance Navigation

    Introduction

    Understanding health insurance policies can often be complicated, leaving individuals to tackle lengthy and difficult documents. The complexity introduced by these policies’ language not only adds to the confusion but also leaves policyholders uncertain about the actual extent of their coverage, the best plan for their needs, and how to seek answers to their specific policy-related questions. In response to these ongoing challenges and to facilitate better access to information, a fresh perspective is being explored—an innovative approach to revolutionize how individuals engage with their health insurance policies.

    Challenges in Health Insurance Communication

    Health insurance queries are inherently complex, often involving nuanced details that require precision. Traditional chatbots, lacking the finesse of generative AI (GenAI), struggle to handle the intricacies of healthcare-related questions. The envisioned health insurance chatbot powered by GenAI overcomes these limitations, offering a sophisticated understanding of queries and delivering responses that align with the complexities of the healthcare sphere.

    Retrieval-Augmented Generation

    Retrieval-augmented generation, or RAG, is an architectural approach that can improve the efficacy of large language model (LLM) applications by leveraging custom data. This is done by retrieving relevant data/documents relevant to a question or task and providing them as context for the LLM. RAG has shown success in supporting chatbots and Q&A systems that need to maintain up-to-date information or access domain-specific knowledge.

    To know more about this topic, check here for technical insights and additional information.

    1. https://www.oracle.com/in/artificial-intelligence/generative-ai/retrieval-augmented-generation-rag/

    2. https://research.ibm.com/blog/retrieval-augmented-generation-RAG

    The Dual Phases of RAG: Retrieval and Content Generation

    Retrieval-augmented generation (RAG) smoothly combines two essential steps, carefully blending retrieval and content generation. Initially, algorithms diligently explore external knowledge bases to find relevant data that matches user queries. This gathered information then becomes the foundation for the next phase—content generation. In this step, the large language model uses both the enhanced prompt and its internal training data to create responses that are not only accurate but also contextually appropriate.

    Advantages of Deploying RAG in AI Chatbots

    Scalability is a key advantage of RAG over traditional models. Instead of relying on a monolithic model attempting to memorize vast amounts of information, RAG models can easily scale by updating or expanding the external database. This flexibility enables them to manage and incorporate a broader range of data efficiently.

    Memory efficiency is another strength of RAG in comparison to models like GPT. While traditional models have limitations on the volume of data they can store and recall, RAG efficiently utilizes external databases. This approach allows RAG to fetch fresh, updated, or detailed information as needed, surpassing the memory constraints of conventional models.

    Moreover, RAG offers flexibility in its knowledge sources. By modifying or enlarging the external knowledge base, a RAG model can be adapted to specific domains without the need for retraining the underlying generative model. This adaptability ensures that RAG remains a versatile and efficient solution for various applications.

    The displayed image outlines the application flow. In the development of our health insurance chatbot, we follow a comprehensive training process. Initially, essential PDF documents are loaded to familiarize our model with the intricacies of health insurance. These documents undergo tokenization, breaking them into smaller units for in-depth analysis. Each of these units, referred to as tokens, is then transformed into numerical vectors through a process known as vectorization. These numerical representations are efficiently stored in ChromaDB for quick retrieval.

    When a question is posed by a user, the numerical version of the query is retrieved from ChromaDB by the chatbot. Employing a language model (LLM), the chatbot crafts a nuanced response based on this numerical representation. This method ensures a smooth and efficient conversational experience. Armed with a wealth of health insurance information, the chatbot delivers precise and contextually relevant responses to user inquiries, establishing itself as a valuable resource for navigating the complexities of health insurance queries.

    Role of Vector Embedding

    Traditional search engines mainly focus on finding specific words in your search. For example, if you search “best smartphone,” it looks for pages with exactly those words. On the other hand, semantic search is like a more understanding search engine. It tries to figure out what you really mean by considering the context of your words.

    Imagine you are planning a vacation and want to find a suitable destination, and you input the query “warm places to visit in winter.” In a traditional search, the engine would look for exact matches of these words on web pages. Results might include pages with those specific terms, but the relevance might vary.

    Text, audio, and video can be embedded:

    An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness, and large distances suggest low relatedness.

    For example:

    bat: [0.6, -0.3, 0.8, …]

    ball: [0.4, -0.2, 0.7, …]

    wicket: [-0.5, 0.6, -0.2, …]

    In this cricket-themed example, each word (bat, ball, wicket) is represented as a vector in a multi-dimensional space, capturing the semantic relationships between cricket-related terms.

    For a deeper understanding, you may explore additional insights in the following articles:

    1. https://www.datastax.com/guides/what-is-a-vector-embedding

    2. https://www.pinecone.io/learn/vector-embeddings/

    3. https://weaviate.io/blog/vector-embeddings-explained/

    A specialized type of database known as a vector database is essential for storing these numerical representations. In a vector database, data is stored as mathematical vectors, providing a unique way to store and retrieve information. This specialized database greatly facilitates machine learning models in retaining and recalling previous inputs, enabling powerful applications in search, recommendations, and text generation.

    Vector retrieval in a database involves finding the nearest neighbors or most similar vectors to a given query vector. These are metrics for finding similar vectors:

    1. The Euclidean distance metric considers both magnitudes and direction, providing a comprehensive measure for assessing the spatial separation between vectors.

    2. Cosine similarity focuses solely on the direction of vectors, offering insights into their alignment within the vector space.

    3. Dot product similarity metric takes into account both magnitudes and direction, offering a versatile approach for evaluating the relationships between vectors.

    ChromaDB, PineCone, and Milvus are a few examples of vector databases.

    For our application, we will be using LangChain, OpenAI embedding and LLM, and ChromaDB.

    1. We need to install Python packages required for this application.
    !pip install -U langchain openai chromadb langchainhub pypdf tiktoken

    A. LangChain is a tool that helps you build intelligent applications using language models. It allows you to develop chatbots, personal assistants, and applications that can summarize, analyze, or respond to questions about documents or data. It’s useful for tasks like coding assistance, working with APIs, and other activities that gain an advantage from AI technology.

    B. OpenAI is a renowned artificial intelligence research lab. Installing the OpenAI package provides access to OpenAI’s language models, including powerful ones like GPT-3. This library is crucial if you plan to integrate OpenAI’s language models into your applications.

    C. As mentioned earlier, ChromaDB is a vector database package designed to handle vector data efficiently, making it suitable for applications that involve similarity searches, clustering, or other operations on vectors.

    D. LangChainHub is a handy tool to make your language tasks easier. It begins with helpful prompts and will soon include even more features like chains and agents.

    E. PyPDF2 is a library for working with PDF files in Python. It allows reading and manipulating PDF documents, making it useful for tasks such as extracting text or merging PDF files.

    F. Tiktoken is a Python library designed for counting the number of tokens in a text string without making an API call. This can be particularly useful for managing token limits when working with language models or APIs that have usage constraints.

    1. Importing Libraries
    from langchain.chat_models import ChatOpenAI 
    from langchain.embeddings import OpenAIEmbeddings 
    from langchain.vectorstores import Chroma 
    from langchain.prompts import ChatPromptTemplate 
    from langchain.prompts import PromptTemplate
    from langchain.schema import StrOutputParser 
    from langchain.schema.runnable import RunnablePassthrough

    1. Initializing OpenAI LLM
    llm = ChatOpenAI(
    api_key=OPENAI_API_KEY”,
    model_name="gpt-4", 
    temperature=0.1
    )

    This line of code initializes a language model (LLM) using OpenAI’s GPT-4 model with 8192 tokens. Temperature parameter influences the randomness of text generated, and increased temperature results in more creative responses, while decreased temperature leads to more focused and deterministic answers.

    1. We will be loading a PDF consisting of material for training the model and also need to divide it into chunks of texts that can be fed to the model.
    from langchain.document_loaders import PyPDFLoader
    loader = PyPDFLoader(
    "HealthInsureBot_GenerativeAI_TrainingGuide.pdf") 
    docs = loader.load_and_split()

    1. We will be loading this chunk of text into the vector Database Chromadb, later used for retrieval and using OpenAI embeddings.
    vectorstore = Chroma.from_documents
    (
    documents=docs,
    embedding=OpenAIEmbeddings(api_key=OPENAI_API_KEY”)
    )

    1. Creating a retrieval object will return the top 3 similar vector matches for the query.
    retriever = vectorstore.as_retriever
    (
    search_type="similarity",
    search_kwargs={"k": 3}
    )

    7. Creating a prompt to pass to the LLM for obtaining specific information involves crafting a well-structured question or instruction that clearly outlines the desired details. RAG chain initiates with the retriever and formatted documents, progresses through the custom prompt template, involves the LLM , and concludes by utilizing a string output parser (StrOutputParser()) to handle the resulting response.

    def format_docs(docs): 
       return "nn".join(doc.page_content for doc in docs)
    
    template = """Use the following pieces of context as a virtual health insurance agent to answer the question and provide relevance score out of 10 for each response. If you don't know the answer, just say that you don't know, don't try to make up an answer. {context} Question: {question} Helpful Answer:""" 
    rag_prompt_custom = PromptTemplate.from_template(template) 
    rag_chain = ( 
    {"context": retriever | format_docs, "question": RunnablePassthrough()} | rag_prompt_custom 
    | llm 
    | StrOutputParser()
    )

    1. Create a function to get a response from the chatbot.
    def chatbot_response(user_query):
        return rag_chain.invoke(user_query)

    We can integrate the Streamlit tool for building a powerful generative app, using this function in Streamlit application to get the AI response.

    import openai 
    import streamlit as st
    from health_insurance_bot import chatbot_response 
    st.title("Health Insurance Chatbot")
    if "messages" not in st.session_state: 
       st.session_state["messages"] = 
       [{"role": "assistant", "content": "How can I help you?"}] 
    
    for msg in st.session_state.messages: 
       st.chat_message(msg["role"]).write(msg["content"]) 
    
    if prompt := st.chat_input(): 
      openai.api_key = st.secrets['openai_api_key']     
      st.session_state.messages.append({"role": "user", "content": prompt}) 
      st.chat_message(name="user").write(prompt) 
      response = chatbot_response(prompt)   
      st.session_state.messages.append({"role": "assistant", "content": response})  
      st.chat_message(name="assistant").write(response)

    Performance Insights

    Conclusion

    In our exploration of developing health insurance chatbots, we’ve dived into the innovative world of retrieval-augmented generation (RAG), where advanced technologies are seamlessly combined to reshape user interactions. The adoption of RAG has proven to be a game-changer, significantly enhancing the chatbot’s abilities to understand, retrieve, and generate contextually relevant responses. However, it’s worth mentioning a couple of limitations, including challenges in accurately calculating premium quotes and occasional inaccuracies in semantic searches.