Tag: Big Data

Data Engineering: Beyond Big Data
When a data project comes to mind, the end goal is to enhance the data. It’s about building systems to curate the data in a way that can help the business.

At the dawn of their data engineering journey, people tend to familiarize themselves with the terms “extract,” transformation,” and ”loading.” These terms, along with traditional data engineering, spark the image that data engineering is about the processing and movement of large amounts of data. And why not! We’ve witnessed a tremendous evolution in these technologies, from storing information in simple spreadsheets to managing massive data warehouses and data lakes, supported by advanced infrastructure capable of ingesting and processing huge data volumes.

However, this doesn’t limit data engineering to ETL; rather, it opens so many opportunities to introduce new technologies and concepts that can and are needed to support big data processing. The expectations from a modern data system extend well beyond mere data movement. There’s a strong emphasis on privacy, especially with the vast amounts of sensitive data that need protection. Speed is crucial, particularly in real-world scenarios like satellite data processing, financial trading, and data processing in healthcare, where eliminating latency is key.

With technologies like AI and machine learning driving analysis on massive datasets, data volumes will inevitably continue to grow. We’ve seen this trend before, just as we once spoke of megabytes and now regularly discuss gigabytes. In the future, we’ll likely talk about terabytes and petabytes with the same familiarity.

These growing expectations have made data engineering a sphere with numerous supporting components, and in this article, we’ll delve into some of those components.
- Data governance
- Metadata management
- Data observability
- Data quality
- Orchestration
- Visualization
Data Governance

With huge amounts of confidential business and user data moving around, it’s a very delicate process to handle it safely. We must ensure trust in data processes, and the data itself can not be compromised. It is essential for a business onboarding users to show that their data is in safe hands. In today’s time, when a business needs sensitive information from you, you’ll be bound to ask questions such as:
- What if my data is compromised?
- Are we putting it to the right use?
- Who’s in control of this data? Are the right personnel using it?
- Is it compliant to the rules and regulations for data practices?
So, to answer these questions satisfactorily, data governance comes into the picture. The basic idea of data governance is that it’s a set of rules, policies, principles, or processes to maintain data integrity. It’s about how we can supervise our data and keep it safe. Think of data governance as a protective blanket that takes care of all the security risks, creates a habitable environment for data, and builds trust in data processing.

Data governance is very strong equipment in the data engineering arsenal. These rules and principles are consistently applied throughout all data processing activities. Wherever data flows, data governance ensures that data adheres to these established protocols. By adding a sense of trust to the activities involving data, you gain the freedom to focus on your data solution without worrying about any external or internal risks. This helps in reaching the ultimate goal—to foster a culture that prioritizes and emphasizes data responsibility.

Understanding the extensive application of data governance in data engineering clearly illustrates its significance and where it needs to be implemented in real-world scenarios. In numerous entities, such as government organizations or large corporations, data sensitivity is a top priority. Misuse of this data can have widespread negative impacts. To ensure that it doesn’t happen, we can use tools to ensure oversight and compliance. Let’s briefly explore one of those tools.

Microsoft Purview

Microsoft Purview comes with a range of solutions to protect your data. Let’s look at some of its offerings.
- Insider risk management
  - Microsoft purview takes care of data security risks from people inside your organization by identifying high-risk individuals.
  - It helps you classify data breaches into different sections and take appropriate action to prevent them.
- Data loss prevention
  - It makes applying data loss prevention policies straightforward.
  - It secures data by restricting important and sensitive data from being deleted and blocks unusual activities, like sharing sensitive data outside your organization.
- Compliance adherence
  - Microsoft Purview can help you make sure that your data processes are compliant with data regulatory bodies and organizational standards.
- Information protection
  - It provides granular control over data, allowing you to define strict accessibility rules.
  - When you need to manage what data can be shared with specific individuals, this control restricts the data visible to others.
- Know your sensitive data
  - It simplifies the process of understanding and learning about your data.
  - MS Purview features ML-based classifiers that label and categorize your sensitive data, helping you identify its specific category.
Metadata Management

Another essential aspect of big data movement is metadata management.

Metadata, simply put, is data about data. This component of data engineering makes a base for huge improvements in data systems.

You might have come across this headline a while back, which also reappeared recently.

This story is from about a decade ago, and it tells us about metadata’s longevity and how it became a base for greater things.

At the time, Instagram showed the number of likes by running a count function on the database and storing it in a cache. This method was fine because the number wouldn’t change frequently, so the request would hit the cache and get the result. Even if the number changed, the request would query the data, and because the number was small, it wouldn’t scan a lot of rows, saving the data system from being overloaded.

However, when a celebrity posted something, it’d receive so many likes that the count would be enormous and change so frequently that looking into the cache became just an extra step.

The request would trigger a query that would repeatedly scan many rows in the database, overloading the system and causing frequent crashes.

To deal with this, Instagram came up with the idea of denormalizing the tables and storing the number of likes for each post. So, the request would result in a query where the database needs to look at only one cell to get the number of likes. To handle the issue of frequent changes in the number of likes, Instagram began updating the value at small intervals. This story tells how Instagram solved this problem with a simple tweak of using metadata.

Metadata in data engineering has evolved to solve even more significant problems by adding a layer on top of the data flow that works as an interface to communicate with data. Metadata management has become a foundation of multiple data features such as:
- Data lineage: Stakeholders are interested in the results we get from data processes. Sometimes, in order to check the authenticity of data and get answers to questions like where the data originated from, we need to track back to the data source. Data lineage is a property that makes use of metadata to help with this scenario. Many data products like Atlan and data warehouses like Snowflake extensively use metadata for their services.
- Schema information: With a clear understanding of your data’s structure, including column details and data types, we can efficiently troubleshoot and resolve data modeling challenges.
- Data contracts: Metadata helps honor data contacts by keeping a common data profile, which maintains a common data structure across all data usages.
- Stats: Managing metadata can help us easily access data statistics while also giving us quick answers to questions like what the total count of a table is, how many distinct records there are, how much space it takes, and many more.
- Access control: Metadata management also includes having information about data accessibility. As we encountered it in the MS Purview features, we can associate a table with vital information and restrict the visibility of a table or even a column to the right people.
- Audit: Keeping track of information, like who accessed the data, who modified it, or who deleted it, is another important feature that a product with multiple users can benefit from.
There are many other use cases of metadata that enhance data engineering. It’s positively impacting the current landscape and shaping the future trajectory of data engineering. A very good example is a data catalog. Data catalogs focus on enriching datasets with information about data. Table formats, such as Iceberg and Delta, use catalogs to provide integration with multiple data sources, handle schema evolution, etc. Popular cloud services like AWS Glue also use metadata for features like data discovery. Tech giants like Snowflake and Databricks rely heavily on metadata for features like faster querying, time travel, and many more.

With the introduction of AI in the data domain, metadata management has a huge effect on the future trajectory of data engineering. Services such as Cortex and Fabric have integrated AI systems that use metadata for easy questioning and answering. When AI gets to know the context of data, the application of metadata becomes limitless.

Data Observability

We know how important metadata can be, and while it’s important to know your data, it’s as important to know about the processes working on it. That’s where observability enters the discussion. It is another crucial aspect of data engineering and a component we can’t miss from our data project.

Data observability is about setting up systems that can give us visibility over different services that are working on the data. Whether it’s ingestion, processing, or load operations, having visibility into data movement is essential. This not only ensures that these services remain reliable and fully operational, but it also keeps us informed about the ongoing processes. The ultimate goal is to proactively manage and optimize these operations, ensuring efficiency and smooth performance. We need to achieve this goal because it’s very likely that whenever we create a data system, multiple issues, as well as errors and bugs, will start popping out of nowhere.

So, how do we keep an eye on these services to see whether they are performing as expected? The answer to that is setting up monitoring and alerting systems.

Monitoring

Monitoring is the continuous tracking and measurement of key metrics and indicators that tells us about the system’s performance. Many cloud services offer comprehensive performance metrics, presented through interactive visuals. These tools provide valuable insights, such as throughput, which measures the volume of data processed per second, and latency, which indicates how long it takes to process the data. They track errors and error rates, detailing the types and how frequently they happen.

To lay the base for monitoring, there are tools like Prometheus and Datadog, which provide us with these monitoring features, indicating the performance of data systems and the system’s infrastructure. We also have Graylog, which gives us multiple features to monitor logs of a system, that too in real-time.

Now that we have the system that gives us visibility into the performance of processes, we need a setup that can tell us about them if anything goes sideways, a setup that can notify us.

Alerting

Setting up alerting systems allows us to receive notifications directly within the applications we use regularly, eliminating the need for someone to constantly monitor metrics on a UI or watch graphs all day, which would be a waste of time and resources. This is why alerting systems are designed to trigger notifications based on predefined thresholds, such as throughput dropping below a certain level, latency exceeding a specific duration, or the occurrence of specific errors. These alerts can be sent to channels like email or Slack, ensuring that users are immediately aware of any unusual conditions in their data processes.

Implementing observability will significantly impact data systems. By setting up monitoring and alerting, we can quickly identify issues as they arise and gain context about the nature of the errors. This insight allows us to pinpoint the source of problems, effectively debug and rectify them, and ultimately reduce downtime and service disruptions, saving valuable time and resources.

Data Quality

Knowing the data and its processes is undoubtedly important, but all this knowledge is futile if the data itself is of poor quality. That’s where the other essential component of data engineering, data quality, comes into play because data processing is one thing; preparing the data for processing is another.

In a data project involving multiple sources and formats, various discrepancies are likely to arise. These can include missing values, where essential data points are absent; outdated data, which no longer reflects current information; poorly formatted data that doesn’t conform to expected standards; incorrect data types that lead to processing errors; and duplicate rows that skew results and analyses. Addressing these issues will ensure the accuracy and reliability of the data used in the project.

Data quality involves enhancing data with key attributes. For instance, accuracy measures how closely the data reflects reality, validity ensures that the data accurately represents what we aim to measure, and completeness guarantees that no critical data is missing. Additionally, attributes like timeliness ensure the data is up to date. Ultimately, data quality is about embedding attributes that build trust in the data. For a deeper dive into this, check out Rita’s blog on Data QA: The Need of the Hour.

Data quality plays a crucial role in elevating other processes in data engineering. In a data engineering project, there are often multiple entry points for data processing, with data being refined at different stages to achieve a better state each time. Assessing data at the source of each processing stage and addressing issues early on is vital. This approach ensures that data standards are maintained throughout the data flow. As a result, by making data consistent at every step, we gain improved control over the entire data lifecycle.

Data tools like Great Expectations and data unit test libraries such as Deequ play a crucial role in safeguarding data pipelines by implementing data quality checks and validations. To gain more context on this, you might want to read Unit Testing Data at Scale using Deequ and Apache Spark by Nishant. These tools ensure that data meets predefined standards, allowing for early detection of issues and maintaining the integrity of data as it moves through the pipeline.

Orchestration

With so many processes in place, it’s essential to ensure everything happens at the right time and in the right way. Relying on someone to manually trigger processes at scheduled times every day is an inefficient use of resources. For that individual, performing the same repetitive tasks can quickly become monotonous. Beyond that, manual execution increases the risk of missing schedules or running tasks out of order, disrupting the entire workflow.

This is where orchestration comes to the rescue, automating tedious, repetitive tasks and ensuring precision in the timing of data flows. Data pipelines can be complex, involving many interconnected components that must work together seamlessly. Orchestration ensures that each component follows a defined set of rules, dictating when to start, what to do, and how to contribute to the overall process of handling data, thus maintaining smooth and efficient operations.

This automation helps reduce errors that could occur with manual execution, ensuring that data processes remain consistent by streamlining repetitive tasks. With a number of different orchestration tools and services in place, we can now monitor and manage everything from a single platform. Tools like Airflow, an open-source orchestrator, Prefect, which offers a user-friendly drag-and-drop interface, and cloud services such as Azure Data Factory, Google Cloud Composer, and AWS Step Functions, enhance our visibility and control over the entire process lifecycle, making data management more efficient and reliable. Don’t miss Shreyash’s excellent blog on Mage: Your New Go-To Tool for Data Orchestration.

Orchestration is built on a foundation of multiple concepts and technologies that make it robust and fail-safe. These underlying principles ensure that orchestration not only automates processes but also maintains reliability and resilience, even in complex and demanding data environments.
- Workflow definition: This defines how tasks in the pipeline are organized and executed. It lays out the sequence of tasks—telling it what needs to be finished before other tasks can start—and takes care of other conditions for pipeline execution. Think of it like a roadmap that guides the flow of tasks.
- Task scheduling: This determines when and how tasks are executed. Tasks might run at specific times, in response to events, or based on the completion of other tasks. It’s like scheduling appointments for tasks to ensure they happen at the right time and with the right resources.
- Dependency management: Since tasks often rely on each other, with the concepts of dependency management, we can ensure that tasks run in the correct order. It ensures that each process starts only when its prerequisites are met, like waiting for a green light before proceeding.
With these concepts, orchestration tools provide powerful features for workflow design and management, enabling the definition of complex, multi-step processes. They support parallel, sequential, and conditional execution of tasks, allowing for flexibility in how workflows are executed. Not just that, they also offer event-driven and real-time orchestration, enabling systems to respond to dynamic changes and triggers as they occur. These tools also include robust error handling and exception management, ensuring that workflows are resilient and fault-tolerant.

Visualization

The true value lies not just in collecting vast amounts of data but in interpreting it in ways that generate real business value, and this makes visualization of data a vital component to provide a clear and accurate representation of data that can be easily understood and utilized by decision-makers. The presentation of data in the right way enables businesses to get intelligence from data, which makes data engineering worth the investment and this is what guides strategic decisions, optimizes operations, and gives power to innovation.

Visualizations allow us to see patterns, trends, and anomalies that might not be apparent in raw data. Whether it’s spotting a sudden drop in sales, detecting anomalies in customer behavior, or forecasting future performance, data visualization can provide the clear context needed to make well-informed decisions. When numbers and graphs are presented effectively, it feels as though we are directly communicating with the data, and this language of communication bridges the gap between technical experts and business leaders.

Visualization Within ETL Processes

Visualization isn’t just a final output. It can also be a valuable tool within the data engineering process itself. Intermediate visualization during the ETL workflow can be a game-changer. In collaborative teams, as we go through the transformation process, visualizing it at various stages helps ensure the accuracy and relevance of the result. We can understand the datasets better, identify issues or anomalies between different stages, and make more informed decisions about the transformations needed.

Technologies like Fabric and Mage enable seamless integration of visualizations into ETL pipelines. These tools empower team members at all levels to actively engage with data, ask insightful questions, and contribute to the decision-making process. Visualizing datasets at key points provides the flexibility to verify that data is being processed correctly, develop accurate analytical formulas, and ensure that the final outputs are meaningful.

Depending on the industry and domain, there are various visualization tools suited to different use cases. For example,
- For real-time insights, which are crucial in industries like healthcare, financial trading, and air travel, tools such as Tableau and Striim are invaluable. These tools allow for immediate visualization of live data, enabling quick and informed decision-making.
- For broad data source integrations and dynamic dashboard querying, often demanded in the technology sector, tools like Power BI, Metabase, and Grafana are highly effective. These platforms support a wide range of data sources and offer flexible, interactive dashboards that facilitate deep analysis and exploration of data.
It’s Limitless

We are seeing many advancements in this domain, which are helping businesses, data science, AI and ML, and many other sectors because the potential of data is huge. If a business knows how to use data, it can be a major factor in its success. And for that reason, we have constantly seen the rise of different components in data engineering. All with one goal: to make data useful.

Recently, we’ve witnessed the introduction of numerous technologies poised to revolutionize the data engineering domain. Concepts like data mesh are enhancing data discovery, improving data ownership, and streamlining data workflows. AI-driven data engineering is rapidly advancing, with expectations to automate key processes such as data cleansing, pipeline optimization, and data validation. We’re already seeing how cloud data services have evolved to embrace AI and machine learning, ensuring seamless integration with data initiatives. The rise of real-time data processing brings new use cases and advancements, while practices like DataOps foster better collaboration among teams. Take a closer look at the modern data stack in Shivam’s detailed article, Modern Data Stack: The What, Why, and How?

These developments are accompanied by a wide array of technologies designed to support infrastructure, analytics, AI, and machine learning, alongside enterprise tools that lay the foundation for this ongoing evolution. All these elements collectively set the stage for a broader discussion on data engineering and what lies beyond big data. Big data, supported by these satellite activities, aims to extract maximum value from data, unlocking its full potential.

References:
August 30, 2024
How Healthcare Payers Can Leverage Speech Analytics to Generate Value
Speech Analytics:

The world has entered into an unprecedented age of information and technology, wherein developing a robust patient experience roadmap has become indispensable. Payers are being incentivized to develop industry-leading skills and strategies that are at par with the changing patient needs and expectations. In order to establish strong footprint in the market, Healthcare organizations must record and monitor patients’ interaction across all the breadth of channels. Equitably, organizations need to ascertain strict adherence to privacy laws to curb fraudulent attempts and practice efficiency. Right now, the focus should be largely emphasized on whether plan enrollees are getting meaningful and swift access to the services they are seeking.

Key learnings from the whitepaper:
- Speech Analytics Solution vs. Traditional Call Drivers Evaluating Method
- Speech Analytics: A Booming Technology
- How Organizations are Leveraging this Opportunity to Maximize Value
- A Connotation to ‘WHY’ Makes a Big Difference
- How R Systems’ Anagram Cuts the Mustard
February 29, 2024

Unlocking the Potential of Knowledge Graphs: Exploring Graph Databases

There is a growing demand for data-driven insights to help businesses make better decisions and stay competitive. To meet this need, organizations are turning to knowledge graphs as a way to access and analyze complex data sets. In this blog post, I will discuss what knowledge graphs are, what graph databases are, how they differ from hierarchical databases, the benefits of graphical representation of data, and more. Lastly, we’ll discuss some of the challenges of graph databases and how they can be overcome.

What Is a Knowledge Graph?

A knowledge graph is a visual representation of data or knowledge. In order to make the relationships between various types of facts and data easy to see and understand, facts and data are organized into a graph structure. A knowledge graph typically consists of nodes, which stand in for entities like people or objects, and edges, which stand in for the relationships among these entities.

Each node in a knowledge graph has characteristics and attributes that describe it. For instance, the node of a person might contain properties like name, age, and occupation. Edges between nodes reveal information about their connections. This makes knowledge graphs a powerful tool for representing and understanding data.

Benefits of a Knowledge Graph

There are a number of benefits to using knowledge graphs.

Knowledge graphs(KG) provide a visual representation of data that can be easily understood. This makes it easier to quickly identify patterns, and correlations.
Additionally, knowledge graphs make it simple to locate linkage data by allowing us to quickly access a particular node and obtain all of its child information.
These graphs are highly scalable, meaning they can support huge volumes of data. This makes them ideal for applications such as artificial intelligence (AI) and machine learning (ML).
Finally, knowledge graphs can be used to connect various types of data, including text, images, and videos, in addition to plain text. This makes them a great tool for data mining and analysis.‍

‍What are Graph Databases?

Graph databases are used to store and manage data in the form of a graph. Unlike traditional databases, they offer a more flexible representation of data using nodes, edges, and properties. Graph databases are designed to support queries that require traversing relationships between different types of data.

Graph databases are well-suited for applications that require complex data relationships, such as AI and ML. They are also more efficient than traditional databases in queries that involve intricate data relationships, as they can quickly process data without having to make multiple queries.

Comparing Graph Databases to Hierarchical Databases

It is important to understand the differences between graph databases and hierarchical databases. But first, what is a hierarchical database? Hierarchical databases are structured in a tree-like form, with each record in the database linked to one or more other records. This structure makes hierarchical databases ideal for storing data that is organized in a hierarchical manner, such as an organizational chart. However, hierarchical databases are less efficient at handling complex data relationships. To understand with an example, suppose we have an organization with a CEO at the top, followed by several vice presidents, who are in turn responsible for several managers, who are responsible for teams of employees.

In a hierarchical database, this structure would be represented as a tree, with the CEO at the root, and each level of the organization represented by a different level of the tree. For example:

In a graph database, this same structure would be represented as a graph, with each node representing an entity (e.g., a person), and each edge representing a relationship (e.g., reporting to). For example:

(Vice President A) — reports_to –> (CEO)

(Vice President B) — reports_to –> (CEO)

(Vice President A) — manages –> (Manager A1)

(Vice President B) — manages –> (Manager B1)

(Manager A1) — manages –> (Employee A1.1)

(Manager B1) — manages –> (Employee B1.1)

(Manager B1) — manages –> (Employee B1.2)‍

As you can see, in a graph database, the relationships between entities are explicit and can be easily queried and traversed. In a hierarchical database, the relationships are implicit and can be more difficult to work with if the hierarchy becomes more complex. Hence the reason graph databases are better suited for complex data relationships is that it gives them the flexibility to easily store and query data.

Creating a Knowledge Graph from Scratch

We will now understand how to create a knowledge graph using an example below where we’ll use a simple XML file that contains information about some movies, and we’ll use an XSLT stylesheet to transform the XML data into RDF format along with some python libraries to help us in the overall process.

Let’s consider an XML file having movie information:

<movies>
  <movie id="tt0083658">
    <title>Blade Runner</title>
    <year>1982</year>
    <director rid="12341">Ridley Scott</director>
    <genre>Action</genre>
  </movie>
  <movie id="tt0087469">
    <title>Top Gun</title>
    <year>1986</year>
    <director rid="65217">Tony Hank</director>
    <genre>Thriller</genre>
  </movie>
</movies>

<movies>
  <movie id="tt0083658">
    <title>Blade Runner</title>
    <year>1982</year>
    <director rid="12341">Ridley Scott</director>
    <genre>Action</genre>
  </movie>

  <movie id="tt0087469">
    <title>Top Gun</title>
    <year>1986</year>
    <director rid="65217">Tony Hank</director>
    <genre>Thriller</genre>
  </movie>
</movies>

As discussed, to convert this data into a knowledge graph, we will be using an XSL file, now a question may arise that what is an XSL file? Well, XSL files are stylesheet documents that are used to transform XML data. To explore more on XSL, visit here, but don’t worry as we will be starting from scratch.

Moving ahead, we also need to know that to convert any data into graph data, we need to use an ontology; there are many ontologies available, like OWL ontology or EBUCore ontology. But what is an ontology? Well, in the context of knowledge graphs, an ontology is a formal specification of the relationships and constraints that exist within a specific domain or knowledge domain. It provides a vocabulary and a set of rules for representing and sharing knowledge, allowing machines to reason about the data they are working with. EBUCore is an ontology developed by the European Broadcasting Union (EBU) to provide a standardized metadata model for the broadcasting industry (OTT platforms, media companies, etc.). Further references on EBUCore can be found here.

We will be using the below XSL for transforming the above XML with movie info.

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
                xmlns_xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns_rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                xmlns_ebucore="http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#"
                >
    <xsl:template match="movies">
        <rdf:RDF>
            <xsl:apply-templates select="movie"/>
        </rdf:RDF>
    </xsl:template>
    <xsl:template match="movie">
        <ebucore:Feature>
            <xsl:template match="title">
                <ebucore:title>
                    <xsl:value-of select="."/>
                </ebucore:title>
            </xsl:template>
            <xsl:template match="year">
                <ebucore:dateBroadcast>
                    <xsl:value-of select="."/>
                </ebucore:dateBroadcast>
            </xsl:template>
            <xsl:template match="director">
                <ebucore:hasParticipatingAgent>
                    <ebucore:Agent>                                        
                        <ebucore:hasRole>Director</ebucore:hasRole>
                        <ebucore:agentName>
                            <xsl:value-of select="."/>
                        </ebucore:agentName>
                    </ebucore:Agent>
                </ebucore:hasParticipatingAgent>
            </xsl:template>
            <xsl:template match="genre">
                <ebucore:hasGenre>
                    <xsl:value-of select="."/>
                </ebucore:hasGenre>
            </xsl:template>
        </ebucore:Feature>
    </xsl:template>
</xsl:stylesheet>

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                xmlns:ebucore="http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#"
                >
    <xsl:template match="movies">
        <rdf:RDF>
            <xsl:apply-templates select="movie"/>
        </rdf:RDF>
    </xsl:template>

    <xsl:template match="movie">
        <ebucore:Feature>

            <xsl:template match="title">
                <ebucore:title>
                    <xsl:value-of select="."/>
                </ebucore:title>
            </xsl:template>

            <xsl:template match="year">
                <ebucore:dateBroadcast>
                    <xsl:value-of select="."/>
                </ebucore:dateBroadcast>
            </xsl:template>

            <xsl:template match="director">
                <ebucore:hasParticipatingAgent>
                    <ebucore:Agent>                                        
                        <ebucore:hasRole>Director</ebucore:hasRole>
                        <ebucore:agentName>
                            <xsl:value-of select="."/>
                        </ebucore:agentName>
                    </ebucore:Agent>
                </ebucore:hasParticipatingAgent>
            </xsl:template>

            <xsl:template match="genre">
                <ebucore:hasGenre>
                    <xsl:value-of select="."/>
                </ebucore:hasGenre>
            </xsl:template>

        </ebucore:Feature>
    </xsl:template>
</xsl:stylesheet>

To start with XSL, the first line, “<?xml version=”1.0″?>” defines the version of document. The second line opens a stylesheet defining the XSL version we will be using and further having XSL, RDF, EBUCore as their namespaces. These namespaces are required as we will be using elements of those classes to avoid name conflicts in our XML document. The xsl:template match defines which element to match in the XML, as we want to match from the start of the XML. Since movies are the root element of our XML, we will be using xsl:template match=”movies”.

After that, we open an RDF tag to start our knowledge graph, this element will contain all the movie details, and hence we are using xsl:apply-templates on “movie” as in our XML we have multiple <movie> elements nested inside <movies> tag. To get further details from <movie> elements, we define a template matching all movie elements, which will help us to fetch all the required details. The tag <ebucore:Feature> defines that all of our contents belong to a feature which is an alternate name for “movie” in EBUCore ontology. We then match details like title, year, genre, etc., from XML and define their corresponding value from EBUCore, like ebucore:title, ebucore:dateBroadcast, and ebucore:hasGenre respectively.

Now that we have the XSL ready, we will need to apply this XSL on our XML and get RDF data out of it by following the below Python code:

import lxml.etree as ET
import xml.dom.minidom as xm
movie_data = """
            <movies>
                <movie id="tt0083658">
                    <title>Blade Runner</title>
                    <year>1982</year>
                    <director rid="12341">Ridley Scott</director>
                    <genre>Action</genre>
                </movie>
                    <movie id="tt0087469">
                    <title>Top Gun</title>
                    <year>1986</year>
                    <director rid="65217">Tony Hank</director>
                    <genre>Thriller</genre>
                </movie>
            </movies>
            """
xslt_file = "transform_movies.xsl"
xslt_root = ET.parse(xslt_file)
transform = ET.XSLT(xslt_root)
rss_root = ET.fromstring(movie_data)
result_tree = transform(rss_root)
result_string = ET.tostring(result_tree)
# Converting bytes to string and pretty formatting
dom_transformed = xm.parseString(result_string)
pretty_xml_as_string = dom_transformed.toprettyxml()
# Saving the output
with open('Path_to_downloads/output_movie.xml', "w") as f:
    f.write(pretty_xml_as_string)
# Print Output
print(pretty_xml_as_string)

import lxml.etree as ET
import xml.dom.minidom as xm

movie_data = """
            <movies>
                <movie id="tt0083658">
                    <title>Blade Runner</title>
                    <year>1982</year>
                    <director rid="12341">Ridley Scott</director>
                    <genre>Action</genre>
                </movie>
                    <movie id="tt0087469">
                    <title>Top Gun</title>
                    <year>1986</year>
                    <director rid="65217">Tony Hank</director>
                    <genre>Thriller</genre>
                </movie>
            </movies>
            """

xslt_file = "transform_movies.xsl"
xslt_root = ET.parse(xslt_file)
transform = ET.XSLT(xslt_root)
rss_root = ET.fromstring(movie_data)
result_tree = transform(rss_root)
result_string = ET.tostring(result_tree)

# Converting bytes to string and pretty formatting
dom_transformed = xm.parseString(result_string)
pretty_xml_as_string = dom_transformed.toprettyxml()

# Saving the output
with open('Path_to_downloads/output_movie.xml', "w") as f:
    f.write(pretty_xml_as_string)
# Print Output
print(pretty_xml_as_string)

The above code will generate the following output:

<?xml version="1.0" ?>
<rdf:RDF xmlns_rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns_ebucore="http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#">
<ebucore:Feature>
   <ebucore:title>Blade Runner</ebucore:title>
   <ebucore:dateBroadcast>1982</ebucore:dateBroadcast>
   <ebucore:hasParticipatingAgent>
       <ebucore:Agent>
           <ebucore:hasRole>Director</ebucore:hasRole>
           <ebucore:agentName>Ridley Scott</ebucore:agentName>
       </ebucore:Agent>
   </ebucore:hasParticipatingAgent>
   <ebucore:hasGenre>Action</ebucore:hasGenre>
</ebucore:Feature>
<ebucore:Feature>
   <ebucore:title>Top Gun</ebucore:title>
   <ebucore:dateBroadcast>1986</ebucore:dateBroadcast>
   <ebucore:hasParticipatingAgent>
       <ebucore:Agent>
           <ebucore:hasRole>Director</ebucore:hasRole>
           <ebucore:agentName>Tony Hank</ebucore:agentName>
       </ebucore:Agent>
   </ebucore:hasParticipatingAgent>
   <ebucore:hasGenre>Thriller</ebucore:hasGenre>
</ebucore:Feature>
</rdf:RDF>

<?xml version="1.0" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:ebucore="http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#">

<ebucore:Feature>
   <ebucore:title>Blade Runner</ebucore:title>
   <ebucore:dateBroadcast>1982</ebucore:dateBroadcast>
   <ebucore:hasParticipatingAgent>
       <ebucore:Agent>
           <ebucore:hasRole>Director</ebucore:hasRole>
           <ebucore:agentName>Ridley Scott</ebucore:agentName>
       </ebucore:Agent>
   </ebucore:hasParticipatingAgent>
   <ebucore:hasGenre>Action</ebucore:hasGenre>
</ebucore:Feature>

<ebucore:Feature>
   <ebucore:title>Top Gun</ebucore:title>
   <ebucore:dateBroadcast>1986</ebucore:dateBroadcast>
   <ebucore:hasParticipatingAgent>
       <ebucore:Agent>
           <ebucore:hasRole>Director</ebucore:hasRole>
           <ebucore:agentName>Tony Hank</ebucore:agentName>
       </ebucore:Agent>
   </ebucore:hasParticipatingAgent>
   <ebucore:hasGenre>Thriller</ebucore:hasGenre>
</ebucore:Feature>

</rdf:RDF>

This output is an RDF XML, which will now be converted to a Graph and we will also visualize it using the following code:

Note: Install the following library before proceeding.

pip install graphviz rdflib

pip install graphviz rdflib

from rdflib import Graph
from rdflib import Graph, Namespace
from rdflib.tools.rdf2dot import rdf2dot
from graphviz import render
# Creating empty graph object
graph = Graph()
graph.parse(result_string, format="xml")
# Saving graph/ttl(Terse RDF Triple Language) in movies.ttl file
graph.serialize(destination=f"Downloads/movies.ttl", format="ttl")
# Steps to Visualize the generated graph
# Define a namespace for the RDF data
ns = Namespace("http://example.com/movies#")
graph.bind("ex", ns)
# Serialize the RDF data to a DOT file
dot_file = open("Downloads/movies.dot", "w")
rdf2dot(graph, dot_file, opts={"label": "Movies Graph", "rankdir": "LR"})
dot_file.close()
# Render the DOT file to a PNG image
render("dot", "png", "Downloads/movies.dot")

from rdflib import Graph
from rdflib import Graph, Namespace
from rdflib.tools.rdf2dot import rdf2dot
from graphviz import render

# Creating empty graph object
graph = Graph()
graph.parse(result_string, format="xml")
# Saving graph/ttl(Terse RDF Triple Language) in movies.ttl file
graph.serialize(destination=f"Downloads/movies.ttl", format="ttl")

# Steps to Visualize the generated graph
# Define a namespace for the RDF data
ns = Namespace("http://example.com/movies#")
graph.bind("ex", ns)

# Serialize the RDF data to a DOT file
dot_file = open("Downloads/movies.dot", "w")
rdf2dot(graph, dot_file, opts={"label": "Movies Graph", "rankdir": "LR"})
dot_file.close()

# Render the DOT file to a PNG image

render("dot", "png", "Downloads/movies.dot")

Finally, the above code will yield a movies.dot.png file in the Downloads folder location, which will look something like this:

This clearly represents the relationship between edges and nodes along with all information in a well-formatted way.

Examples of Knowledge Graphs

Now that we have knowledge of how we can create a knowledge graph, let’s explore the big players that are using such graphs for their operations.

Google Knowledge Graph: This is one of the most well-known examples of a knowledge graph. It is used by Google to enhance its search results with additional information about entities, such as people, places, and things. For example, if you search for “Barack Obama,” the Knowledge Graph will display a panel with information about his birthdate, family members, education, career, and more. All this information is stored in the form of nodes and edges making it easier for the Google search engine to retrieve related information of any topic.

DBpedia: This is a community-driven project that extracts structured data from Wikipedia and makes it available as a linked data resource. It is primarily used for graph analysis and executing SPARQL queries. It contains information on millions of entities, such as people, places, and things, and their relationships with one another. DBpedia can be used to power applications like question-answering systems, recommendation engines, and more. One of the key advantages of DBpedia is that it is an open and community-driven project, which means that anyone can contribute to it and use it for their own applications. This has led to a wide variety of applications built on top of DBpedia, from academic research to commercial products.

As we have discussed the examples of knowledge graphs, one should know that they all use SPARQL queries to retrieve data from their huge corpus of graphs. So, let’s write one such query to retrieve data from the knowledge graph created by us for movie data. We will be writing a query to retrieve all movies’ Genre information along with Movie Titles.

from rdflib.plugins.sparql import prepareQuery
# Define the SPARQL query
query = prepareQuery('''
    PREFIX ebucore: <http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#>
    SELECT ?genre ?title
    WHERE {
      ?movie a ebucore:Feature ;
             ebucore:hasGenre ?genre ;
             ebucore:title ?title .
    }
''', initNs={"ebucore": ns})
# Execute the query and print the results
results = graph.query(query)
for row in results:
    genre, title = row
    print(f"Movie Genre: {genre}, Movie Title: {title}")

from rdflib.plugins.sparql import prepareQuery

# Define the SPARQL query
query = prepareQuery('''
    PREFIX ebucore: <http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#>
    SELECT ?genre ?title
    WHERE {
      ?movie a ebucore:Feature ;
             ebucore:hasGenre ?genre ;
             ebucore:title ?title .
    }
''', initNs={"ebucore": ns})


# Execute the query and print the results
results = graph.query(query)
for row in results:
    genre, title = row
    print(f"Movie Genre: {genre}, Movie Title: {title}")

Challenges with Graph Databases:

Data Complexity: One of the primary challenges with graph databases and knowledge graphs is data complexity. As the size and complexity of the data increase, it can become challenging to manage and query the data efficiently.

Data Integration: Graph databases and knowledge graphs often need to integrate data from different sources, which can be challenging due to differences in data format, schema, and structure.

Query Performance: Knowledge graphs are often used for complex queries, which can be slow to execute, especially for large datasets.

Knowledge Representation: Representing knowledge in a graph database or knowledge graph can be challenging due to the diversity of concepts and relationships that need to be modeled accurately. One should have experience with ontologies, relationships, and business use cases to curate a perfect representation

Bonus: How to Overcome These Challenges:

Use efficient indexing and query optimization techniques to handle data complexity and improve query performance.
Use data integration tools and techniques to standardize data formats and structures to improve data integration.
Use distributed computing and partitioning techniques to scale the database horizontally.
Use caching and precomputing techniques to speed up queries.
Use ontology modeling and semantic reasoning techniques to accurately represent knowledge and relationships in the graph database or knowledge graph.

Conclusion

In conclusion, graph databases, and knowledge graphs are powerful tools that offer several advantages over traditional relational databases. They enable flexible modeling of complex data and relationships, which can be difficult to achieve using a traditional tabular structure. Moreover, they enhance query performance for complex queries and enable new use cases such as recommendation engines, fraud detection, and knowledge management.

Despite the aforementioned challenges, graph databases and knowledge graphs are gaining popularity in various industries, ranging from finance to healthcare, and are expected to continue playing a significant role in the future of data management and analysis.

December 6, 2023

A Beginner’s Guide to Edge Computing
In the world of data centers with wings and wheels, there is an opportunity to lay some work off from the centralized cloud computing by taking less compute intensive tasks to other components of the architecture. In this blog, we will explore the upcoming frontier of the web – Edge Computing.

What is the “Edge”?

The ‘Edge’ refers to having computing infrastructure closer to the source of data. It is the distributed framework where data is processed as close to the originating data source possible. This infrastructure requires effective use of resources that may not be continuously connected to a network such as laptops, smartphones, tablets, and sensors. Edge Computing covers a wide range of technologies including wireless sensor networks, cooperative distributed peer-to-peer ad-hoc networking and processing, also classifiable as local cloud/fog computing, mobile edge computing, distributed data storage and retrieval, autonomic self-healing networks, remote cloud services, augmented reality, and more.

Cloud Computing is expected to go through a phase of decentralization. Edge Computing is coming up with an ideology of bringing compute, storage and networking closer to the consumer.

But Why?

Legit question! Why do we even need Edge Computing? What are the advantages of having this new infrastructure?

Imagine a case of a self-driving car where the car is sending a live stream continuously to the central servers. Now, the car has to take a crucial decision. The consequences can be disastrous if the car waits for the central servers to process the data and respond back to it. Although algorithms like YOLO_v2 have sped up the process of object detection the latency is at that part of the system when the car has to send terabytes to the central server and then receive the response and then act! Hence, we need the basic processing like when to stop or decelerate, to be done in the car itself.

The goal of Edge Computing is to minimize the latency by bringing the public cloud capabilities to the edge. This can be achieved in two forms – custom software stack emulating the cloud services running on existing hardware, and the public cloud seamlessly extended to multiple point-of-presence (PoP) locations.

Following are some promising reasons to use Edge Computing:
1. Privacy: Avoid sending all raw data to be stored and processed on cloud servers.
2. Real-time responsiveness: Sometimes the reaction time can be a critical factor.
3. Reliability: The system is capable to work even when disconnected to cloud servers. Removes a single point of failure.
To understand the points mentioned above, let’s take the example of a device which responds to a hot keyword. Example, Jarvis from Iron Man. Imagine if your personal Jarvis sends all of your private conversations to a remote server for analysis. Instead, It is intelligent enough to respond when it is called. At the same time, it is real-time and reliable.

Intel CEO Brian Krzanich said in an event that autonomous cars will generate 40 terabytes of data for every eight hours of driving. Now with that flood of data, the time of transmission will go substantially up. In cases of self-driving cars, real-time or quick decisions are an essential need. Here edge computing infrastructure will come to rescue. These self-driving cars need to take decisions is split of a second whether to stop or not else consequences can be disastrous.

Another example can be drones or quadcopters, let’s say we are using them to identify people or deliver relief packages then the machines should be intelligent enough to take basic decisions like changing the path to avoid obstacles locally.

Forms of Edge Computing

Device Edge:

In this model, Edge Computing is taken to the customers in the existing environments. For example, AWS Greengrass and Microsoft Azure IoT Edge.

Cloud Edge:

This model of Edge Computing is basically an extension of the public cloud. Content Delivery Networks are classic examples of this topology in which the static content is cached and delivered through a geographically spread edge locations.

Vapor IO is an emerging player in this category. They are attempting to build infrastructure for cloud edge. Vapor IO has various products like Vapor Chamber. These are self-monitored. They have sensors embedded in them using which they are continuously monitored and evaluated by Vapor Software, VEC(Vapor Edge Controller). They also have built OpenDCRE, which we will see later in this blog.

The fundamental difference between device edge and cloud edge lies in the deployment and pricing models. The deployment of these models – device edge and cloud edge – are specific to different use cases. Sometimes, it may be an advantage to deploy both the models.

Edges around you

Edge Computing examples can be increasingly found around us:
1. Smart street lights
2. Automated Industrial Machines
3. Mobile devices
4. Smart Homes
5. Automated Vehicles (cars, drones etc)
Data Transmission is expensive. By bringing compute closer to the origin of data, latency is reduced as well as end users have better experience. Some of the evolving use cases of Edge Computing are Augmented Reality(AR) or Virtual Reality(VR) and the Internet of things. For example, the rush which people got while playing an Augmented Reality based pokemon game, wouldn’t have been possible if “real-timeliness” was not present in the game. It was made possible because the smartphone itself was doing AR not the central servers. Even Machine Learning(ML) can benefit greatly from Edge Computing. All the heavy-duty training of ML algorithms can be done on the cloud and the trained model can be deployed on the edge for near real-time or even real-time predictions. We can see that in today’s data-driven world edge computing is becoming a necessary component of it.

There is a lot of confusion between Edge Computing and IOT. If stated simply, Edge Computing is nothing but the intelligent Internet of things(IOT) in a way. Edge Computing actually complements traditional IOT. In the traditional model of IOT, all the devices, like sensors, mobiles, laptops etc are connected to a central server. Now let’s imagine a case where you give the command to your lamp to switch off, for such simple task, data needs to be transmitted to the cloud, analyzed there and then lamp will receive a command to switch off. Edge Computing brings computing closer to your home, that is either the fog layer present between lamp and cloud servers is smart enough to process the data or the lamp itself.

If we look at the below image, it is a standard IOT implementation where everything is centralized. While Edge Computing philosophy talks about decentralizing the architecture.

The Fog

Sandwiched between edge layer and cloud layer, there is the Fog Layer. It bridges connection between other two layers.

The difference between fog and edge computing is described in this article –
- Fog Computing – Fog computing pushes intelligence down to the local area network level of network architecture, processing data in a fog node or IoT gateway.
- Edge computing pushes the intelligence, processing power and communication capabilities of an edge gateway or appliance directly into devices like programmable automation controllers (PACs).
How do we manage Edge Computing?

The Device Relationship Management or DRM refers to managing, monitoring the interconnected components over the internet. AWS IOT Core and AWS Greengrass, Nebbiolo Technologies have developed Fog Node and Fog OS, Vapor IO has OpenDCRE using which one can control and monitor the data centers.

Following image (source – AWS) shows how to manage ML on Edge Computing using AWS infrastructure.

AWS Greengrass makes it possible for users to use Lambda functions to build IoT devices and application logic. Specifically, AWS Greengrass provides cloud-based management of applications that can be deployed for local execution. Locally deployed Lambda functions are triggered by local events, messages from the cloud, or other sources.

This GitHub repo demonstrates a traffic light example using two Greengrass devices, a light controller, and a traffic light.

Conclusion

We believe that next-gen computing will be influenced a lot by Edge Computing and will continue to explore new use-cases that will be made possible by the Edge.

References
December 12, 2022
Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow
Recently, I was involved in building an ETL (Extract-Transform-Load) pipeline. It included extracting data from MongoDB collections, perform transformations and then loading it into Redshift tables. Many ETL solutions are available in the market which kind-of solves the issue, but the key part of an ETL process lies in its ability to transform or process raw data before it is pushed to its destination.

Each ETL pipeline comes with a specific business requirement around processing data which is hard to be achieved using off-the-shelf ETL solutions. This is why a majority of ETL solutions are custom built manually, from scratch. In this blog, I am going to talk about my learning around building a custom ETL solution which involved moving data from MongoDB to Redshift using Apache Airflow.

Background:

I began by writing a Python-based command line tool which supported different phases of ETL, like extracting data from MongoDB, processing extracted data locally, uploading the processed data to S3, loading data from S3 to Redshift, post-processing and cleanup. I used the PyMongo library to interact with MongoDB and the Boto library for interacting with Redshift and S3.

I kept each operation atomic so that multiple instances of each operation can run independently of each other, which will help to achieve parallelism. One of the major challenges was to achieve parallelism while running the ETL tasks. One option was to develop our own framework based on threads or developing a distributed task scheduler tool using a message broker tool like Celery combined with RabbitMQ. After doing some research I settled for Apache Airflow. Airflow is a Python-based scheduler where you can define DAGs (Directed Acyclic Graphs), which would run as per the given schedule and run tasks in parallel in each phase of your ETL. You can define DAG as Python code and it also enables you to handle the state of your DAG run using environment variables. Features like task retries on failure handling are a plus.

We faced several challenges while getting the above ETL workflow to be near real-time and fault tolerant. We discuss the challenges faced and the solutions below:

Keeping your ETL code changes in sync with Redshift schema

While you are building the ETL tool, you may end up fetching a new field from MongoDB, but at the same time, you have to add that column to the corresponding Redshift table. If you fail to do so the ETL pipeline will start failing. In order to tackle this, I created a database migration tool which would become the first step in my ETL workflow.

The migration tool would:
- keep the migration status in a Redshift table and
- would track all migration scripts in a code directory.
In each ETL run, it would get the most recently ran migrations from Redshift and would search for any new migration script available in the code directory. If found it would run the newly found migration script after which the regular ETL tasks would run. This adds the onus on the developer to add a migration script if he is making any changes like addition or removal of a field that he is fetching from MongoDB.

Maintaining data consistency

While extracting data from MongoDB, one needs to ensure all the collections are extracted at a specific point in time else there can be data inconsistency issues. We need to solve this problem at multiple levels:
- While extracting data from MongoDB define parameters like modified date and extract data from different collections with a filter as records less than or equal to that date. This will ensure you fetch point in time data from MongoDB.
- While loading data into Redshift tables, don’t load directly to master table, instead load it to some staging table. Once you are done loading data in staging for all related collections, load it to master from staging within a single transaction. This way data is either updated in all related tables or in none of the tables.
A single bad record can break your ETL

While moving data across the ETL pipeline into Redshift, one needs to take care of field formats. For example, the Date field in the incoming data can be different than that in the Redshift schema design. Another example can be that the incoming data can exceed the length of the field in the schema. Redshift’s COPY command which is used to load data from files to redshift tables is very vulnerable to such changes in data types. Even a single incorrectly formatted record will lead to all your data getting rejected and effectively breaking the ETL pipeline.

There are multiple ways in which we can solve this problem. Either handle it in one of the transform jobs in the pipeline. Alternately we put the onus on Redshift to handle these variances. Redshift’s COPY command has many options which can help you solve these problems. Some of the very useful options are
- ACCEPTANYDATE: Allows any date format, including invalid formats such as 00/00/00 00:00:00, to be loaded without generating an error.
- ACCEPTINVCHARS: Enables loading of data into VARCHAR columns even if the data contains invalid UTF-8 characters.
- TRUNCATECOLUMNS: Truncates data in columns to the appropriate number of characters so that it fits the column specification.
Redshift going out of storage

Redshift is based on PostgreSQL and one of the common problems is when you delete records from Redshift tables it does not actually free up space. So if your ETL process is deleting and creating new records frequently, then you may run out of Redshift storage space. VACUUM operation for Redshift is the solution to this problem. Instead of making VACUUM operation a part of your main ETL flow, define a different workflow which runs on a different schedule to run VACUUM operation. VACUUM operation reclaims space and resorts rows in either a specified table or all tables in the current database. VACUUM operation can be FULL, SORT ONLY, DELETE ONLY & REINDEX. More information on VACUUM can be found here.

ETL instance going out of storage

Your ETL will be generating a lot of files by extracting data from MongoDB onto your ETL instance. It is very important to periodically delete those files otherwise you are very likely to go out of storage on your ETL server. If your data from MongoDB is huge, you might end up creating large files on your ETL server. Again, I would recommend defining a different workflow which runs on a different schedule to run a cleanup operation.

Making ETL Near Real Time

Processing only the delta rather than doing a full load in each ETL run

ETL would be faster if you keep track of the already processed data and process only the new data. If you are doing a full load of data in each ETL run, then the solution would not scale as your data scales. As a solution to this, we made it mandatory for the collection in our MongoDB to have a created and a modified date. Our ETL would check the maximum value of the modified date for the given collection from the Redshift table. It will then generate the filter query to fetch only those records from MongoDB which have modified date greater than that of the maximum value. It may be difficult for you to make changes in your product, but it’s worth the effort!

Compressing and splitting files while loading

A good approach is to write files in some compressed format. It saves your storage space on ETL server and also helps when you load data to Redshift. Redshift COPY command suggests that you provide compressed files as input. Also instead of a single huge file, you should split your files into parts and give all files to a single COPY command. This will enable Redshift to use it’s computing resources across the cluster to do the copy in parallel, leading to faster loads.

Streaming mongo data directly to S3 instead of writing it to ETL server

One of the major overhead in the ETL process is to write data first to ETL server and then uploading it to S3. In order to reduce disk IO, you should not store data to ETL server. Instead, use MongoDB’s handy stream API. For MongoDB Node driver, both the collection.find() and the collection.aggregate() function return cursors. The stream method also accepts a transform function as a parameter. All your custom transform logic could go into the transform function. AWS S3’s node library’s upload() function, also accepts readable streams. Use the stream from the MongoDB Node stream method, pipe it into zlib to gzip it, then feed the readable stream into AWS S3’s Node library. Simple! You will see a large improvement in your ETL process by this simple but important change.

Optimizing Redshift Queries

Optimizing Redshift Queries helps in making the ETL system highly scalable, efficient and also reduce the cost. Lets look at some of the approaches:

Add a distribution key

Redshift database is clustered, meaning your data is stored across cluster nodes. When you query for certain set of records, Redshift has to search for those records in each node, leading to slow queries. A distribution key is a single metric, which will decide the data distribution of all data records across your tables. If you have a single metric which is available for all your data, you can specify it as distribution key. When loading data into Redshift, all data for a certain value of distribution key will be placed on a single node of Redshift cluster. So when you query for certain records Redshift knows exactly where to search for your data. This is only useful when you are also using the distribution key to query the data.

Source: Slideshare

Generating a numeric primary key for string primary key

In MongoDB, you can have any type of field as your primary key. If your Mongo collections are having a non-numeric primary key and you are using those same keys in Redshift, your joins will end up being on string keys which are slower. Instead, generate numeric keys for your string keys and joining on it which will make queries run much faster. Redshift supports specifying a column with an attribute as IDENTITY which will auto-generate numeric unique value for the column which you can use as your primary key.

Conclusion:

In this blog, I have covered the best practices around building ETL pipelines for Redshift based on my learning. There are many more recommended practices which can be easily found in Redshift and MongoDB documentation.
December 12, 2022

Tag: Big Data

Data Engineering: Beyond Big Data

Data Governance

Microsoft Purview

Metadata Management

Data Observability

Monitoring

Alerting

Data Quality

Orchestration

Visualization

Visualization Within ETL Processes

It’s Limitless

References:

How Healthcare Payers Can Leverage Speech Analytics to Generate Value

Speech Analytics:

Key learnings from the whitepaper:

Unlocking the Potential of Knowledge Graphs: Exploring Graph Databases

What Is a Knowledge Graph?

Benefits of a Knowledge Graph

‍What are Graph Databases?

Comparing Graph Databases to Hierarchical Databases

Creating a Knowledge Graph from Scratch

Examples of Knowledge Graphs

Challenges with Graph Databases:

Bonus: How to Overcome These Challenges:

Conclusion

A Beginner’s Guide to Edge Computing

What is the “Edge”?

But Why?

Forms of Edge Computing

Device Edge:

Cloud Edge:

Edges around you

The Fog

How do we manage Edge Computing?

Conclusion

References

Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow

Background:

Keeping your ETL code changes in sync with Redshift schema

Maintaining data consistency

A single bad record can break your ETL

Redshift going out of storage

ETL instance going out of storage

Making ETL Near Real Time

Processing only the delta rather than doing a full load in each ETL run

Compressing and splitting files while loading

Streaming mongo data directly to S3 instead of writing it to ETL server

Optimizing Redshift Queries

Add a distribution key

Generating a numeric primary key for string primary key

Conclusion: