Tag: modern data stack

Data Engineering: Beyond Big Data
When a data project comes to mind, the end goal is to enhance the data. It’s about building systems to curate the data in a way that can help the business.

At the dawn of their data engineering journey, people tend to familiarize themselves with the terms “extract,” transformation,” and ”loading.” These terms, along with traditional data engineering, spark the image that data engineering is about the processing and movement of large amounts of data. And why not! We’ve witnessed a tremendous evolution in these technologies, from storing information in simple spreadsheets to managing massive data warehouses and data lakes, supported by advanced infrastructure capable of ingesting and processing huge data volumes.

However, this doesn’t limit data engineering to ETL; rather, it opens so many opportunities to introduce new technologies and concepts that can and are needed to support big data processing. The expectations from a modern data system extend well beyond mere data movement. There’s a strong emphasis on privacy, especially with the vast amounts of sensitive data that need protection. Speed is crucial, particularly in real-world scenarios like satellite data processing, financial trading, and data processing in healthcare, where eliminating latency is key.

With technologies like AI and machine learning driving analysis on massive datasets, data volumes will inevitably continue to grow. We’ve seen this trend before, just as we once spoke of megabytes and now regularly discuss gigabytes. In the future, we’ll likely talk about terabytes and petabytes with the same familiarity.

These growing expectations have made data engineering a sphere with numerous supporting components, and in this article, we’ll delve into some of those components.
- Data governance
- Metadata management
- Data observability
- Data quality
- Orchestration
- Visualization
Data Governance

With huge amounts of confidential business and user data moving around, it’s a very delicate process to handle it safely. We must ensure trust in data processes, and the data itself can not be compromised. It is essential for a business onboarding users to show that their data is in safe hands. In today’s time, when a business needs sensitive information from you, you’ll be bound to ask questions such as:
- What if my data is compromised?
- Are we putting it to the right use?
- Who’s in control of this data? Are the right personnel using it?
- Is it compliant to the rules and regulations for data practices?
So, to answer these questions satisfactorily, data governance comes into the picture. The basic idea of data governance is that it’s a set of rules, policies, principles, or processes to maintain data integrity. It’s about how we can supervise our data and keep it safe. Think of data governance as a protective blanket that takes care of all the security risks, creates a habitable environment for data, and builds trust in data processing.

Data governance is very strong equipment in the data engineering arsenal. These rules and principles are consistently applied throughout all data processing activities. Wherever data flows, data governance ensures that data adheres to these established protocols. By adding a sense of trust to the activities involving data, you gain the freedom to focus on your data solution without worrying about any external or internal risks. This helps in reaching the ultimate goal—to foster a culture that prioritizes and emphasizes data responsibility.

Understanding the extensive application of data governance in data engineering clearly illustrates its significance and where it needs to be implemented in real-world scenarios. In numerous entities, such as government organizations or large corporations, data sensitivity is a top priority. Misuse of this data can have widespread negative impacts. To ensure that it doesn’t happen, we can use tools to ensure oversight and compliance. Let’s briefly explore one of those tools.

Microsoft Purview

Microsoft Purview comes with a range of solutions to protect your data. Let’s look at some of its offerings.
- Insider risk management
  - Microsoft purview takes care of data security risks from people inside your organization by identifying high-risk individuals.
  - It helps you classify data breaches into different sections and take appropriate action to prevent them.
- Data loss prevention
  - It makes applying data loss prevention policies straightforward.
  - It secures data by restricting important and sensitive data from being deleted and blocks unusual activities, like sharing sensitive data outside your organization.
- Compliance adherence
  - Microsoft Purview can help you make sure that your data processes are compliant with data regulatory bodies and organizational standards.
- Information protection
  - It provides granular control over data, allowing you to define strict accessibility rules.
  - When you need to manage what data can be shared with specific individuals, this control restricts the data visible to others.
- Know your sensitive data
  - It simplifies the process of understanding and learning about your data.
  - MS Purview features ML-based classifiers that label and categorize your sensitive data, helping you identify its specific category.
Metadata Management

Another essential aspect of big data movement is metadata management.

Metadata, simply put, is data about data. This component of data engineering makes a base for huge improvements in data systems.

You might have come across this headline a while back, which also reappeared recently.

This story is from about a decade ago, and it tells us about metadata’s longevity and how it became a base for greater things.

At the time, Instagram showed the number of likes by running a count function on the database and storing it in a cache. This method was fine because the number wouldn’t change frequently, so the request would hit the cache and get the result. Even if the number changed, the request would query the data, and because the number was small, it wouldn’t scan a lot of rows, saving the data system from being overloaded.

However, when a celebrity posted something, it’d receive so many likes that the count would be enormous and change so frequently that looking into the cache became just an extra step.

The request would trigger a query that would repeatedly scan many rows in the database, overloading the system and causing frequent crashes.

To deal with this, Instagram came up with the idea of denormalizing the tables and storing the number of likes for each post. So, the request would result in a query where the database needs to look at only one cell to get the number of likes. To handle the issue of frequent changes in the number of likes, Instagram began updating the value at small intervals. This story tells how Instagram solved this problem with a simple tweak of using metadata.

Metadata in data engineering has evolved to solve even more significant problems by adding a layer on top of the data flow that works as an interface to communicate with data. Metadata management has become a foundation of multiple data features such as:
- Data lineage: Stakeholders are interested in the results we get from data processes. Sometimes, in order to check the authenticity of data and get answers to questions like where the data originated from, we need to track back to the data source. Data lineage is a property that makes use of metadata to help with this scenario. Many data products like Atlan and data warehouses like Snowflake extensively use metadata for their services.
- Schema information: With a clear understanding of your data’s structure, including column details and data types, we can efficiently troubleshoot and resolve data modeling challenges.
- Data contracts: Metadata helps honor data contacts by keeping a common data profile, which maintains a common data structure across all data usages.
- Stats: Managing metadata can help us easily access data statistics while also giving us quick answers to questions like what the total count of a table is, how many distinct records there are, how much space it takes, and many more.
- Access control: Metadata management also includes having information about data accessibility. As we encountered it in the MS Purview features, we can associate a table with vital information and restrict the visibility of a table or even a column to the right people.
- Audit: Keeping track of information, like who accessed the data, who modified it, or who deleted it, is another important feature that a product with multiple users can benefit from.
There are many other use cases of metadata that enhance data engineering. It’s positively impacting the current landscape and shaping the future trajectory of data engineering. A very good example is a data catalog. Data catalogs focus on enriching datasets with information about data. Table formats, such as Iceberg and Delta, use catalogs to provide integration with multiple data sources, handle schema evolution, etc. Popular cloud services like AWS Glue also use metadata for features like data discovery. Tech giants like Snowflake and Databricks rely heavily on metadata for features like faster querying, time travel, and many more.

With the introduction of AI in the data domain, metadata management has a huge effect on the future trajectory of data engineering. Services such as Cortex and Fabric have integrated AI systems that use metadata for easy questioning and answering. When AI gets to know the context of data, the application of metadata becomes limitless.

Data Observability

We know how important metadata can be, and while it’s important to know your data, it’s as important to know about the processes working on it. That’s where observability enters the discussion. It is another crucial aspect of data engineering and a component we can’t miss from our data project.

Data observability is about setting up systems that can give us visibility over different services that are working on the data. Whether it’s ingestion, processing, or load operations, having visibility into data movement is essential. This not only ensures that these services remain reliable and fully operational, but it also keeps us informed about the ongoing processes. The ultimate goal is to proactively manage and optimize these operations, ensuring efficiency and smooth performance. We need to achieve this goal because it’s very likely that whenever we create a data system, multiple issues, as well as errors and bugs, will start popping out of nowhere.

So, how do we keep an eye on these services to see whether they are performing as expected? The answer to that is setting up monitoring and alerting systems.

Monitoring

Monitoring is the continuous tracking and measurement of key metrics and indicators that tells us about the system’s performance. Many cloud services offer comprehensive performance metrics, presented through interactive visuals. These tools provide valuable insights, such as throughput, which measures the volume of data processed per second, and latency, which indicates how long it takes to process the data. They track errors and error rates, detailing the types and how frequently they happen.

To lay the base for monitoring, there are tools like Prometheus and Datadog, which provide us with these monitoring features, indicating the performance of data systems and the system’s infrastructure. We also have Graylog, which gives us multiple features to monitor logs of a system, that too in real-time.

Now that we have the system that gives us visibility into the performance of processes, we need a setup that can tell us about them if anything goes sideways, a setup that can notify us.

Alerting

Setting up alerting systems allows us to receive notifications directly within the applications we use regularly, eliminating the need for someone to constantly monitor metrics on a UI or watch graphs all day, which would be a waste of time and resources. This is why alerting systems are designed to trigger notifications based on predefined thresholds, such as throughput dropping below a certain level, latency exceeding a specific duration, or the occurrence of specific errors. These alerts can be sent to channels like email or Slack, ensuring that users are immediately aware of any unusual conditions in their data processes.

Implementing observability will significantly impact data systems. By setting up monitoring and alerting, we can quickly identify issues as they arise and gain context about the nature of the errors. This insight allows us to pinpoint the source of problems, effectively debug and rectify them, and ultimately reduce downtime and service disruptions, saving valuable time and resources.

Data Quality

Knowing the data and its processes is undoubtedly important, but all this knowledge is futile if the data itself is of poor quality. That’s where the other essential component of data engineering, data quality, comes into play because data processing is one thing; preparing the data for processing is another.

In a data project involving multiple sources and formats, various discrepancies are likely to arise. These can include missing values, where essential data points are absent; outdated data, which no longer reflects current information; poorly formatted data that doesn’t conform to expected standards; incorrect data types that lead to processing errors; and duplicate rows that skew results and analyses. Addressing these issues will ensure the accuracy and reliability of the data used in the project.

Data quality involves enhancing data with key attributes. For instance, accuracy measures how closely the data reflects reality, validity ensures that the data accurately represents what we aim to measure, and completeness guarantees that no critical data is missing. Additionally, attributes like timeliness ensure the data is up to date. Ultimately, data quality is about embedding attributes that build trust in the data. For a deeper dive into this, check out Rita’s blog on Data QA: The Need of the Hour.

Data quality plays a crucial role in elevating other processes in data engineering. In a data engineering project, there are often multiple entry points for data processing, with data being refined at different stages to achieve a better state each time. Assessing data at the source of each processing stage and addressing issues early on is vital. This approach ensures that data standards are maintained throughout the data flow. As a result, by making data consistent at every step, we gain improved control over the entire data lifecycle.

Data tools like Great Expectations and data unit test libraries such as Deequ play a crucial role in safeguarding data pipelines by implementing data quality checks and validations. To gain more context on this, you might want to read Unit Testing Data at Scale using Deequ and Apache Spark by Nishant. These tools ensure that data meets predefined standards, allowing for early detection of issues and maintaining the integrity of data as it moves through the pipeline.

Orchestration

With so many processes in place, it’s essential to ensure everything happens at the right time and in the right way. Relying on someone to manually trigger processes at scheduled times every day is an inefficient use of resources. For that individual, performing the same repetitive tasks can quickly become monotonous. Beyond that, manual execution increases the risk of missing schedules or running tasks out of order, disrupting the entire workflow.

This is where orchestration comes to the rescue, automating tedious, repetitive tasks and ensuring precision in the timing of data flows. Data pipelines can be complex, involving many interconnected components that must work together seamlessly. Orchestration ensures that each component follows a defined set of rules, dictating when to start, what to do, and how to contribute to the overall process of handling data, thus maintaining smooth and efficient operations.

This automation helps reduce errors that could occur with manual execution, ensuring that data processes remain consistent by streamlining repetitive tasks. With a number of different orchestration tools and services in place, we can now monitor and manage everything from a single platform. Tools like Airflow, an open-source orchestrator, Prefect, which offers a user-friendly drag-and-drop interface, and cloud services such as Azure Data Factory, Google Cloud Composer, and AWS Step Functions, enhance our visibility and control over the entire process lifecycle, making data management more efficient and reliable. Don’t miss Shreyash’s excellent blog on Mage: Your New Go-To Tool for Data Orchestration.

Orchestration is built on a foundation of multiple concepts and technologies that make it robust and fail-safe. These underlying principles ensure that orchestration not only automates processes but also maintains reliability and resilience, even in complex and demanding data environments.
- Workflow definition: This defines how tasks in the pipeline are organized and executed. It lays out the sequence of tasks—telling it what needs to be finished before other tasks can start—and takes care of other conditions for pipeline execution. Think of it like a roadmap that guides the flow of tasks.
- Task scheduling: This determines when and how tasks are executed. Tasks might run at specific times, in response to events, or based on the completion of other tasks. It’s like scheduling appointments for tasks to ensure they happen at the right time and with the right resources.
- Dependency management: Since tasks often rely on each other, with the concepts of dependency management, we can ensure that tasks run in the correct order. It ensures that each process starts only when its prerequisites are met, like waiting for a green light before proceeding.
With these concepts, orchestration tools provide powerful features for workflow design and management, enabling the definition of complex, multi-step processes. They support parallel, sequential, and conditional execution of tasks, allowing for flexibility in how workflows are executed. Not just that, they also offer event-driven and real-time orchestration, enabling systems to respond to dynamic changes and triggers as they occur. These tools also include robust error handling and exception management, ensuring that workflows are resilient and fault-tolerant.

Visualization

The true value lies not just in collecting vast amounts of data but in interpreting it in ways that generate real business value, and this makes visualization of data a vital component to provide a clear and accurate representation of data that can be easily understood and utilized by decision-makers. The presentation of data in the right way enables businesses to get intelligence from data, which makes data engineering worth the investment and this is what guides strategic decisions, optimizes operations, and gives power to innovation.

Visualizations allow us to see patterns, trends, and anomalies that might not be apparent in raw data. Whether it’s spotting a sudden drop in sales, detecting anomalies in customer behavior, or forecasting future performance, data visualization can provide the clear context needed to make well-informed decisions. When numbers and graphs are presented effectively, it feels as though we are directly communicating with the data, and this language of communication bridges the gap between technical experts and business leaders.

Visualization Within ETL Processes

Visualization isn’t just a final output. It can also be a valuable tool within the data engineering process itself. Intermediate visualization during the ETL workflow can be a game-changer. In collaborative teams, as we go through the transformation process, visualizing it at various stages helps ensure the accuracy and relevance of the result. We can understand the datasets better, identify issues or anomalies between different stages, and make more informed decisions about the transformations needed.

Technologies like Fabric and Mage enable seamless integration of visualizations into ETL pipelines. These tools empower team members at all levels to actively engage with data, ask insightful questions, and contribute to the decision-making process. Visualizing datasets at key points provides the flexibility to verify that data is being processed correctly, develop accurate analytical formulas, and ensure that the final outputs are meaningful.

Depending on the industry and domain, there are various visualization tools suited to different use cases. For example,
- For real-time insights, which are crucial in industries like healthcare, financial trading, and air travel, tools such as Tableau and Striim are invaluable. These tools allow for immediate visualization of live data, enabling quick and informed decision-making.
- For broad data source integrations and dynamic dashboard querying, often demanded in the technology sector, tools like Power BI, Metabase, and Grafana are highly effective. These platforms support a wide range of data sources and offer flexible, interactive dashboards that facilitate deep analysis and exploration of data.
It’s Limitless

We are seeing many advancements in this domain, which are helping businesses, data science, AI and ML, and many other sectors because the potential of data is huge. If a business knows how to use data, it can be a major factor in its success. And for that reason, we have constantly seen the rise of different components in data engineering. All with one goal: to make data useful.

Recently, we’ve witnessed the introduction of numerous technologies poised to revolutionize the data engineering domain. Concepts like data mesh are enhancing data discovery, improving data ownership, and streamlining data workflows. AI-driven data engineering is rapidly advancing, with expectations to automate key processes such as data cleansing, pipeline optimization, and data validation. We’re already seeing how cloud data services have evolved to embrace AI and machine learning, ensuring seamless integration with data initiatives. The rise of real-time data processing brings new use cases and advancements, while practices like DataOps foster better collaboration among teams. Take a closer look at the modern data stack in Shivam’s detailed article, Modern Data Stack: The What, Why, and How?

These developments are accompanied by a wide array of technologies designed to support infrastructure, analytics, AI, and machine learning, alongside enterprise tools that lay the foundation for this ongoing evolution. All these elements collectively set the stage for a broader discussion on data engineering and what lies beyond big data. Big data, supported by these satellite activities, aims to extract maximum value from data, unlocking its full potential.

References:
August 30, 2024
Modern Data Stack: The What, Why and How?
This post will provide you with a comprehensive overview of the modern data stack (MDS), including its benefits, how it’s components differ from its predecessors’, and what its future holds.

“Modern” has the connotation of being up-to-date, of being better. This is true for MDS, but how exactly is MDS better than what was before?

What was the data stack like?…

A few decades back, the map-reduce technological breakthrough made it possible to efficiently process large amounts of data in parallel on multiple machines.

It provided the backbone of a standard pipeline that looked like:

‍

It was common to see HDFS used for storage, spark for computing, and hive to perform SQL queries on top.

To run this, we had people handling the deployment and maintenance of Hadoop on their own.

This core attribute of the setup eventually became a pain point and made it complex and inefficient in the long run.

Being on-prem while facing growing heavier loads meant scalability became a huge concern.

Hence, unlike today, the process was much more manual. Adding more RAM, increasing storage, and rolling out updates manually reduced productivity

Moreover,
- The pipeline wasn’t modular; components were tightly coupled, causing failures when deciding to shift to something new.
- Teams committed to specific vendors and found themselves locked in, by design, for years.
- Setup was complex, and the infrastructure was not resilient. Random surges in data crashed the systems. (This randomness in demand has only increased since the early decade of internet, due to social media-triggered virality.)
- Self-service was non-existent. If you wanted to do anything with your data, you needed data engineers.
- Observability was a myth. Your pipeline is failing, but you’re unaware, and then you don’t know why, where, how…Your customers become your testers, knowing more about your system’s issues.
- Data protection laws weren’t as formalized, especially the lack of policies within the organization. These issues made the traditional setup inefficient in solving modern problems, and as we all know…
For an upgraded modern setup, we needed something that is scalable, has a smaller learning curve, and something that is feasible for both a seed-stage startup or a fortune 500.

Standing on the shoulders of tech innovations from the 2000s, data engineers started building a blueprint for MDS tooling with three core attributes:

Cloud Native (or the ocean)

Arguably the definitive change of the MDS era, the cloud reduces the hassle of on-prem and welcomes auto-scaling horizontally or vertically in the era of virality and spikes as technical requirements.

Modularity

The M in MDS could stand for modular.

You can integrate any MDS tool into your existing stack, like LEGO blocks.

You can test out multiple tools, whether they’re open source or managed, choose the best fit, and iteratively build out your data infrastructure.

This mindset helps instill a habit of avoiding vendor lock-in by continuously upgrading your architecture with relative ease.

By moving away from the ancient, one-size-fits-all model, MDS recognizes the uniqueness of each company’s budget, domain, data types, and maturity—and provides the correct solution for a given use case.

Ease of Use

MDS tools are easier to set up. You can start playing with these tools within a day.

Importantly, the ease of use is not limited to technical engineers.

Owing to the rise of self-serve and no-code tools like tableau—data is finally democratized for usage for all kinds of consumers. SQL remains crucial, but for basic metric calculations PMs, Sales, Marketing, etc., can use a simple drag and drop in the UI (sometimes even simpler than Excel pivot tables).

MDS also enables one to experiment with different architectural frameworks for their use case. For example, ELT vs. ETL (explained under Data Transformation).

‍

But, one might think such improvements mean MDS is the v1.1 of Data Stack, a tech upgrade that ultimately uses data to solve similar problems.

Fortunately, that’s far from the case.

MDS enables data to solve more human problems across the org—problems that employees have long been facing but could never systematically solve for, helping generate much more value from the data.

Beyond these, employees want transparency and visibility into how any metric was calculated and which data source in Snowflake was used to build what specific tableau dashboard.

Critically, with compliance finally being focused on, orgs need solutions for giving the right people the right access at the right time.

Lastly, as opposed to previous eras, these days, even startups have varied infrastructure components with data; if you’re a PM tasked with bringing insights, how do you know where to start? What data assets the organization has?

Besides these problem statements being tackled, MDS builds a culture of upskilling employees in various data concepts.

Data security, governance, and data lineage are important irrespective of department or persona in the organization.

From designers to support executives, the need for a data-driven culture is a given.

You’re probably bored of hearing how good the MDS is and want to deconstruct it into its components.

Let’s dive in.

SOURCES

In our modern era, every product is inevitably becoming a tech product

From a smart bulb to an orbiting satellite, each generates data in its own unique flavor of frequency of generation, data format, data size, etc.

Social media, microservices, IoT devices, smart devices, DBs, CRMs, ERPs, flat files, and a lot more…

‍

INGESTION

Post creation of data, how does one “ingest” or take in that data for actual usage? (the whole point of investing).

Roughly, there are three categories to help describe the ingestion solutions:

Generic tools allow us to connect various data sources with data storages.

E.g.: we can connect Google Ads or Salesforce to dump data into BigQuery or S3.

These generic tools highlight the modularity and low or no code barrier aspect in MDS.

Things are as easy as drag and drop, and one doesn’t need to be fluent in scripting.

Then we have programmable tools as well, where we get more control over how we ingest data through code

For example, we can write Apache Airflow DAGs in Python to load data from S3 and dump it to Redshift.

Intermediary – these tools cater to a specific use case or are coupled with the source itself.

E.g. – Snowpipe, a part of the data source snowflake itself, allows us to load data from files as soon as it’s available at the source.

DATA STORAGE‍

Where do you ingest data into?

Here, we’ve expanded from HDFS & SQL DBs to a wider variety of formats (noSQL, document DB).

Depending on the use case and the way you interact with data, you can choose from a DW, DB, DL, ObjectStores, etc.

You might need a standard relational DB for transactions in finance, or you might be collecting logs. You might be experimenting with your product at an early stage and be fine with noSQL without worrying about prescribing schemas.

One key feature to note is that—most are cloud-based. So, no more worrying about scalability and we pay only for what we use.

PS: Do stick around till the end for new concepts of Lake House and reverse ETL (already prevalent in the industry).

DATA TRANSFORMATION

‍

The stored raw data must be cleaned and restructured into the shape we deem best for actual usage. This slicing and dicing is different for every kind of data.

For example, we have tools for the E-T-L way, which can be categorized into SaaS and Frameworks, e.g., Fivetran and Spark respectively.

Interestingly, the cloud era has given storage computational capability such that we don’t even need an external system for transformation, sometimes.

With this rise of E-LT, we leverage the processing capabilities of cloud data warehouses or lake houses. Using tools like DBT, we write templated SQL queries to transform our data in the warehouses or lake house itself.

This is enabling analysts to perform heavy lifting of traditional DE problems

We also see stream processing where we work with applications where “micro” data is processed in real time (analyzed as soon as it’s produced, as opposed to large batches).

DATA VISUALIZATION

The ability to visually learn from data has only improved in the MDS era with advanced design, methodology, and integration.

With Embedded analytics, one can integrate analytical capabilities and data visualizations into the software application itself.

External analytics, on the other hand, are used to build using your processed data. You choose your source, create a chart, and let it run.

DATA SCIENCE, MACHINE LEARNING, MLOps

Source: https://medium.com/vertexventures/thinking-data-the-modern-data-stack-d7d59e81e8c6

In the last decade, we have moved beyond ad-hoc insight generation in Jupyter notebooks to

production-ready, real-time ML workflows, like recommendation systems and price predictions. Any startup can and does integrate ML into its products.

Most cloud service providers offer machine learning models and automated model building as a service.

MDS concepts like data observation are used to build tools for ML practitioners, whether its feature stores (a feature store is a central repository that provides entity values as of a certain time), or model monitoring (checking data drift, tracking model performance, and improving model accuracy).

This is extremely important as statisticians can focus on the business problem not infrastructure.

This is an ever-expanding field where concepts for ex MLOps (DevOps for the ML pipelines—optimizing workflows, efficient transformations) and Synthetic media (using AI to generate content itself) arrive and quickly become mainstream.

ChatGPT is the current buzz, but by the time you’re reading this, I’m sure there’s going to be an updated one—such is the pace of development.

DATA ORCHESTRATION

With a higher number of modularized tools and source systems comes complicated complexity.

More steps, processes, connections, settings, and synchronization are required.

Data orchestration in MDS needs to be Cron on steroids.

Using a wide variety of products, MDS tools help bring the right data for the right purposes based on complex logic.

DATA OBSERVABILITY

Data observability is the ability to monitor and understand the state and behavior of data as it flows through an organization’s systems.

In a traditional data stack, organizations often rely on reactive approaches to data management, only addressing issues as they arise. In contrast, data observability in an MDS involves adopting a proactive mindset, where organizations actively monitor and understand the state of their data pipelines to identify potential issues before they become critical.

Monitoring – a dashboard that provides an operational view of your pipeline or system

Alerting – both for expected events and anomalies

Tracking – ability to set and track specific events

Analysis – automated issue detection that adapts to your pipeline and data health

Logging – a record of an event in a standardized format for faster resolution

SLA Tracking – Measure data quality against predefined standards (cost, performance, reliability)

Data Lineage – graph representation of data assets showing upstream/downstream steps.

DATA GOVERNANCE & SECURITY

Data security is a critical consideration for organizations of all sizes and industries and needs to be prioritized to protect sensitive information, ensure compliance, and preserve business continuity.

The introduction of stricter data protection regulations, such as the General Data Protection Regulation (GDPR) and CCPA, introduced a huge need in the market for MDS tools, which efficiently and painlessly help organizations govern and secure their data.

DATA CATALOG

Now that we have all the components of MDS, from ingestion to BI, we have so many sources, as well as things like dashboards, reports, views, other metadata, etc., that we need a google like engine just to navigate our components.

This is where a data catalog helps; it allows people to stitch the metadata (data about your data: the #rows in your table, the column names, types, etc.) across sources.

This is necessary to help efficiently discover, understand, trust, and collaborate on data assets.

We don’t want PMs & GTM to look at different dashboards for adoption data.

‍

Previously, the sole purpose of the original data pipeline was to aggregate and upload events to Hadoop/Hive for batch processing. Chukwa collected events and wrote them to S3 in Hadoop sequence file format. In those days, end-to-end latency was up to 10 minutes. That was sufficient for batch jobs, which usually scan data at daily or hourly frequency.

With the emergence of Kafka and Elasticsearch over the last decade, there has been a growing demand for real-time analytics on Netflix. By real-time, we mean sub-minute latency. Instead of starting from scratch, Netflix was able to iteratively grow its MDS as per changes in market requirements.

Source: https://blog.transform.co/data-talks/the-metric-layer-why-you-need-it-examples-and-how-it-fits-into-your-modern-data-stack/

This is a snapshot of the MDS stack a data-mature company like Netflix had some years back where instead of a few all in one tools, each data category was solved by a specialized tool.

FUTURE COMPONENTS OF MDS?

DATA MESH

Source: https://martinfowler.com/articles/data-monolith-to-mesh.html

The top picture shows how teams currently operate, where no matter the feature or product on the Y axis, the data pipeline’s journey remains the same moving along the X. But in an ideal world of data mesh, those who know the data should own its journey.

As decentralization is the name of the game, data mesh is MDS’s response to this demand for an architecture shift where domain owners use self-service infrastructure to shape how their data is consumed.

DATA LAKEHOUSE

Source: https://www.altexsoft.com/blog/data-lakehouse/

We have talked about data warehouses and data lakes being used for data storage.

Initially, when we only needed structured data, data warehouses were used. Later, with big data, we started getting all kinds of data, structured and unstructured.

So, we started using Data Lakes, where we just dumped everything.

The lakehouse tries to combine the best of both worlds by adding an intelligent metadata layer on top of the data lake. This layer basically classifies and categorizes data such that it can be interpreted in a structured manner.

Also, all the data in the lake house is open, meaning that it can be utilized by all kinds of tools. They are generally built on top of open data formats like parquet so that they can be easily accessed by all the tools.

End users can simply run their SQLs as if they’re querying a DWH.

REVERSE ETL

Suppose you’re a salesperson using Salesforce and want to know if a lead you just got is warm or cold (warm indicating a higher chance of conversion).

The attributes about your lead, like salary and age are fetched from your OLTP into a DWH, analyzed, and then the flag “warm” is sent back to Salesforce UI, ready to be used in live operations.

METRICS LAYER

The Metric layer will be all about consistency, accessibility, and trust in the calculations of metrics.

Earlier, for metrics, you had v1 v1.1 Excels with logic scattered around.

Currently, in the modern data stack world, each team’s calculation is isolated in the tool they are used to. For example, BI would store metrics in tableau dashboards while DEs would use code.

A metric layer would exist to ensure global access of the metrics to every other tool in the data stack.

For example, DBT metrics layer helps define these in the warehouse—something accessible to both BI and engineers. Similarly, looker, mode, and others have their unique approach to it.

In summary, this blog post discussed the modern data stack and its advantages over older approaches. We examined the components of the modern data stack, including data sources, ingestion, transformation, and more, and how they work together to create an efficient and effective system for data management and analysis. We also highlighted the benefits of the modern data stack, including increased efficiency, scalability, and flexibility.

As technology continues to advance, the modern data stack will evolve and incorporate new components and capabilities.
January 4, 2023

Tag: modern data stack

Data Engineering: Beyond Big Data

Data Governance

Microsoft Purview

Metadata Management

Data Observability

Monitoring

Alerting

Data Quality

Orchestration

Visualization

Visualization Within ETL Processes

It’s Limitless

References:

Modern Data Stack: The What, Why and How?

SOURCES

DATA STORAGE‍

DATA TRANSFORMATION

DATA VISUALIZATION

DATA ORCHESTRATION

DATA OBSERVABILITY

DATA GOVERNANCE & SECURITY

DATA CATALOG

FUTURE COMPONENTS OF MDS?

DATA MESH

DATA LAKEHOUSE

REVERSE ETL

METRICS LAYER