- Infrastructure Costs cut by 30-34% monthly, optimizing resource utilization and generating substantial savings.
- Customer Onboarding Time reduced from 50 to 4 days, significantly accelerating the client’s ability to onboard new customers.
- Site Provisioning Time for existing customers reduced from weeks to a few hours, streamlining operations and improving customer satisfaction.
- Downtime affecting customers was reduced to under 30 minutes, with critical issues resolved within 1 hour and most proactively addressed before customer notification.
Category: Industry
-
Transforming Infrastructure at Scale with Azure Cloud
-
Data Engineering: Beyond Big Data
When a data project comes to mind, the end goal is to enhance the data. It’s about building systems to curate the data in a way that can help the business.
At the dawn of their data engineering journey, people tend to familiarize themselves with the terms “extract,” transformation,” and ”loading.” These terms, along with traditional data engineering, spark the image that data engineering is about the processing and movement of large amounts of data. And why not! We’ve witnessed a tremendous evolution in these technologies, from storing information in simple spreadsheets to managing massive data warehouses and data lakes, supported by advanced infrastructure capable of ingesting and processing huge data volumes.
However, this doesn’t limit data engineering to ETL; rather, it opens so many opportunities to introduce new technologies and concepts that can and are needed to support big data processing. The expectations from a modern data system extend well beyond mere data movement. There’s a strong emphasis on privacy, especially with the vast amounts of sensitive data that need protection. Speed is crucial, particularly in real-world scenarios like satellite data processing, financial trading, and data processing in healthcare, where eliminating latency is key.
With technologies like AI and machine learning driving analysis on massive datasets, data volumes will inevitably continue to grow. We’ve seen this trend before, just as we once spoke of megabytes and now regularly discuss gigabytes. In the future, we’ll likely talk about terabytes and petabytes with the same familiarity.
These growing expectations have made data engineering a sphere with numerous supporting components, and in this article, we’ll delve into some of those components.
- Data governance
- Metadata management
- Data observability
- Data quality
- Orchestration
- Visualization
Data Governance
With huge amounts of confidential business and user data moving around, it’s a very delicate process to handle it safely. We must ensure trust in data processes, and the data itself can not be compromised. It is essential for a business onboarding users to show that their data is in safe hands. In today’s time, when a business needs sensitive information from you, you’ll be bound to ask questions such as:
- What if my data is compromised?
- Are we putting it to the right use?
- Who’s in control of this data? Are the right personnel using it?
- Is it compliant to the rules and regulations for data practices?
So, to answer these questions satisfactorily, data governance comes into the picture. The basic idea of data governance is that it’s a set of rules, policies, principles, or processes to maintain data integrity. It’s about how we can supervise our data and keep it safe. Think of data governance as a protective blanket that takes care of all the security risks, creates a habitable environment for data, and builds trust in data processing.
Data governance is very strong equipment in the data engineering arsenal. These rules and principles are consistently applied throughout all data processing activities. Wherever data flows, data governance ensures that data adheres to these established protocols. By adding a sense of trust to the activities involving data, you gain the freedom to focus on your data solution without worrying about any external or internal risks. This helps in reaching the ultimate goal—to foster a culture that prioritizes and emphasizes data responsibility.
Understanding the extensive application of data governance in data engineering clearly illustrates its significance and where it needs to be implemented in real-world scenarios. In numerous entities, such as government organizations or large corporations, data sensitivity is a top priority. Misuse of this data can have widespread negative impacts. To ensure that it doesn’t happen, we can use tools to ensure oversight and compliance. Let’s briefly explore one of those tools.
Microsoft Purview
Microsoft Purview comes with a range of solutions to protect your data. Let’s look at some of its offerings.
- Insider risk management
- Microsoft purview takes care of data security risks from people inside your organization by identifying high-risk individuals.
- It helps you classify data breaches into different sections and take appropriate action to prevent them.
- Data loss prevention
- It makes applying data loss prevention policies straightforward.
- It secures data by restricting important and sensitive data from being deleted and blocks unusual activities, like sharing sensitive data outside your organization.
- Compliance adherence
- Microsoft Purview can help you make sure that your data processes are compliant with data regulatory bodies and organizational standards.
- Information protection
- It provides granular control over data, allowing you to define strict accessibility rules.
- When you need to manage what data can be shared with specific individuals, this control restricts the data visible to others.
- Know your sensitive data
- It simplifies the process of understanding and learning about your data.
- MS Purview features ML-based classifiers that label and categorize your sensitive data, helping you identify its specific category.
Metadata Management
Another essential aspect of big data movement is metadata management.
Metadata, simply put, is data about data. This component of data engineering makes a base for huge improvements in data systems.
You might have come across this headline a while back, which also reappeared recently.

This story is from about a decade ago, and it tells us about metadata’s longevity and how it became a base for greater things.
At the time, Instagram showed the number of likes by running a count function on the database and storing it in a cache. This method was fine because the number wouldn’t change frequently, so the request would hit the cache and get the result. Even if the number changed, the request would query the data, and because the number was small, it wouldn’t scan a lot of rows, saving the data system from being overloaded.
However, when a celebrity posted something, it’d receive so many likes that the count would be enormous and change so frequently that looking into the cache became just an extra step.
The request would trigger a query that would repeatedly scan many rows in the database, overloading the system and causing frequent crashes.
To deal with this, Instagram came up with the idea of denormalizing the tables and storing the number of likes for each post. So, the request would result in a query where the database needs to look at only one cell to get the number of likes. To handle the issue of frequent changes in the number of likes, Instagram began updating the value at small intervals. This story tells how Instagram solved this problem with a simple tweak of using metadata.
Metadata in data engineering has evolved to solve even more significant problems by adding a layer on top of the data flow that works as an interface to communicate with data. Metadata management has become a foundation of multiple data features such as:
- Data lineage: Stakeholders are interested in the results we get from data processes. Sometimes, in order to check the authenticity of data and get answers to questions like where the data originated from, we need to track back to the data source. Data lineage is a property that makes use of metadata to help with this scenario. Many data products like Atlan and data warehouses like Snowflake extensively use metadata for their services.
- Schema information: With a clear understanding of your data’s structure, including column details and data types, we can efficiently troubleshoot and resolve data modeling challenges.
- Data contracts: Metadata helps honor data contacts by keeping a common data profile, which maintains a common data structure across all data usages.
- Stats: Managing metadata can help us easily access data statistics while also giving us quick answers to questions like what the total count of a table is, how many distinct records there are, how much space it takes, and many more.
- Access control: Metadata management also includes having information about data accessibility. As we encountered it in the MS Purview features, we can associate a table with vital information and restrict the visibility of a table or even a column to the right people.
- Audit: Keeping track of information, like who accessed the data, who modified it, or who deleted it, is another important feature that a product with multiple users can benefit from.
There are many other use cases of metadata that enhance data engineering. It’s positively impacting the current landscape and shaping the future trajectory of data engineering. A very good example is a data catalog. Data catalogs focus on enriching datasets with information about data. Table formats, such as Iceberg and Delta, use catalogs to provide integration with multiple data sources, handle schema evolution, etc. Popular cloud services like AWS Glue also use metadata for features like data discovery. Tech giants like Snowflake and Databricks rely heavily on metadata for features like faster querying, time travel, and many more.
With the introduction of AI in the data domain, metadata management has a huge effect on the future trajectory of data engineering. Services such as Cortex and Fabric have integrated AI systems that use metadata for easy questioning and answering. When AI gets to know the context of data, the application of metadata becomes limitless.
Data Observability
We know how important metadata can be, and while it’s important to know your data, it’s as important to know about the processes working on it. That’s where observability enters the discussion. It is another crucial aspect of data engineering and a component we can’t miss from our data project.
Data observability is about setting up systems that can give us visibility over different services that are working on the data. Whether it’s ingestion, processing, or load operations, having visibility into data movement is essential. This not only ensures that these services remain reliable and fully operational, but it also keeps us informed about the ongoing processes. The ultimate goal is to proactively manage and optimize these operations, ensuring efficiency and smooth performance. We need to achieve this goal because it’s very likely that whenever we create a data system, multiple issues, as well as errors and bugs, will start popping out of nowhere.
So, how do we keep an eye on these services to see whether they are performing as expected? The answer to that is setting up monitoring and alerting systems.
Monitoring
Monitoring is the continuous tracking and measurement of key metrics and indicators that tells us about the system’s performance. Many cloud services offer comprehensive performance metrics, presented through interactive visuals. These tools provide valuable insights, such as throughput, which measures the volume of data processed per second, and latency, which indicates how long it takes to process the data. They track errors and error rates, detailing the types and how frequently they happen.
To lay the base for monitoring, there are tools like Prometheus and Datadog, which provide us with these monitoring features, indicating the performance of data systems and the system’s infrastructure. We also have Graylog, which gives us multiple features to monitor logs of a system, that too in real-time.
Now that we have the system that gives us visibility into the performance of processes, we need a setup that can tell us about them if anything goes sideways, a setup that can notify us.
Alerting
Setting up alerting systems allows us to receive notifications directly within the applications we use regularly, eliminating the need for someone to constantly monitor metrics on a UI or watch graphs all day, which would be a waste of time and resources. This is why alerting systems are designed to trigger notifications based on predefined thresholds, such as throughput dropping below a certain level, latency exceeding a specific duration, or the occurrence of specific errors. These alerts can be sent to channels like email or Slack, ensuring that users are immediately aware of any unusual conditions in their data processes.
Implementing observability will significantly impact data systems. By setting up monitoring and alerting, we can quickly identify issues as they arise and gain context about the nature of the errors. This insight allows us to pinpoint the source of problems, effectively debug and rectify them, and ultimately reduce downtime and service disruptions, saving valuable time and resources.
Data Quality
Knowing the data and its processes is undoubtedly important, but all this knowledge is futile if the data itself is of poor quality. That’s where the other essential component of data engineering, data quality, comes into play because data processing is one thing; preparing the data for processing is another.
In a data project involving multiple sources and formats, various discrepancies are likely to arise. These can include missing values, where essential data points are absent; outdated data, which no longer reflects current information; poorly formatted data that doesn’t conform to expected standards; incorrect data types that lead to processing errors; and duplicate rows that skew results and analyses. Addressing these issues will ensure the accuracy and reliability of the data used in the project.
Data quality involves enhancing data with key attributes. For instance, accuracy measures how closely the data reflects reality, validity ensures that the data accurately represents what we aim to measure, and completeness guarantees that no critical data is missing. Additionally, attributes like timeliness ensure the data is up to date. Ultimately, data quality is about embedding attributes that build trust in the data. For a deeper dive into this, check out Rita’s blog on Data QA: The Need of the Hour.
Data quality plays a crucial role in elevating other processes in data engineering. In a data engineering project, there are often multiple entry points for data processing, with data being refined at different stages to achieve a better state each time. Assessing data at the source of each processing stage and addressing issues early on is vital. This approach ensures that data standards are maintained throughout the data flow. As a result, by making data consistent at every step, we gain improved control over the entire data lifecycle.
Data tools like Great Expectations and data unit test libraries such as Deequ play a crucial role in safeguarding data pipelines by implementing data quality checks and validations. To gain more context on this, you might want to read Unit Testing Data at Scale using Deequ and Apache Spark by Nishant. These tools ensure that data meets predefined standards, allowing for early detection of issues and maintaining the integrity of data as it moves through the pipeline.
Orchestration
With so many processes in place, it’s essential to ensure everything happens at the right time and in the right way. Relying on someone to manually trigger processes at scheduled times every day is an inefficient use of resources. For that individual, performing the same repetitive tasks can quickly become monotonous. Beyond that, manual execution increases the risk of missing schedules or running tasks out of order, disrupting the entire workflow.
This is where orchestration comes to the rescue, automating tedious, repetitive tasks and ensuring precision in the timing of data flows. Data pipelines can be complex, involving many interconnected components that must work together seamlessly. Orchestration ensures that each component follows a defined set of rules, dictating when to start, what to do, and how to contribute to the overall process of handling data, thus maintaining smooth and efficient operations.
This automation helps reduce errors that could occur with manual execution, ensuring that data processes remain consistent by streamlining repetitive tasks. With a number of different orchestration tools and services in place, we can now monitor and manage everything from a single platform. Tools like Airflow, an open-source orchestrator, Prefect, which offers a user-friendly drag-and-drop interface, and cloud services such as Azure Data Factory, Google Cloud Composer, and AWS Step Functions, enhance our visibility and control over the entire process lifecycle, making data management more efficient and reliable. Don’t miss Shreyash’s excellent blog on Mage: Your New Go-To Tool for Data Orchestration.
Orchestration is built on a foundation of multiple concepts and technologies that make it robust and fail-safe. These underlying principles ensure that orchestration not only automates processes but also maintains reliability and resilience, even in complex and demanding data environments.
- Workflow definition: This defines how tasks in the pipeline are organized and executed. It lays out the sequence of tasks—telling it what needs to be finished before other tasks can start—and takes care of other conditions for pipeline execution. Think of it like a roadmap that guides the flow of tasks.
- Task scheduling: This determines when and how tasks are executed. Tasks might run at specific times, in response to events, or based on the completion of other tasks. It’s like scheduling appointments for tasks to ensure they happen at the right time and with the right resources.
- Dependency management: Since tasks often rely on each other, with the concepts of dependency management, we can ensure that tasks run in the correct order. It ensures that each process starts only when its prerequisites are met, like waiting for a green light before proceeding.
With these concepts, orchestration tools provide powerful features for workflow design and management, enabling the definition of complex, multi-step processes. They support parallel, sequential, and conditional execution of tasks, allowing for flexibility in how workflows are executed. Not just that, they also offer event-driven and real-time orchestration, enabling systems to respond to dynamic changes and triggers as they occur. These tools also include robust error handling and exception management, ensuring that workflows are resilient and fault-tolerant.
Visualization
The true value lies not just in collecting vast amounts of data but in interpreting it in ways that generate real business value, and this makes visualization of data a vital component to provide a clear and accurate representation of data that can be easily understood and utilized by decision-makers. The presentation of data in the right way enables businesses to get intelligence from data, which makes data engineering worth the investment and this is what guides strategic decisions, optimizes operations, and gives power to innovation.
Visualizations allow us to see patterns, trends, and anomalies that might not be apparent in raw data. Whether it’s spotting a sudden drop in sales, detecting anomalies in customer behavior, or forecasting future performance, data visualization can provide the clear context needed to make well-informed decisions. When numbers and graphs are presented effectively, it feels as though we are directly communicating with the data, and this language of communication bridges the gap between technical experts and business leaders.
Visualization Within ETL Processes
Visualization isn’t just a final output. It can also be a valuable tool within the data engineering process itself. Intermediate visualization during the ETL workflow can be a game-changer. In collaborative teams, as we go through the transformation process, visualizing it at various stages helps ensure the accuracy and relevance of the result. We can understand the datasets better, identify issues or anomalies between different stages, and make more informed decisions about the transformations needed.
Technologies like Fabric and Mage enable seamless integration of visualizations into ETL pipelines. These tools empower team members at all levels to actively engage with data, ask insightful questions, and contribute to the decision-making process. Visualizing datasets at key points provides the flexibility to verify that data is being processed correctly, develop accurate analytical formulas, and ensure that the final outputs are meaningful.
Depending on the industry and domain, there are various visualization tools suited to different use cases. For example,
- For real-time insights, which are crucial in industries like healthcare, financial trading, and air travel, tools such as Tableau and Striim are invaluable. These tools allow for immediate visualization of live data, enabling quick and informed decision-making.
- For broad data source integrations and dynamic dashboard querying, often demanded in the technology sector, tools like Power BI, Metabase, and Grafana are highly effective. These platforms support a wide range of data sources and offer flexible, interactive dashboards that facilitate deep analysis and exploration of data.
It’s Limitless
We are seeing many advancements in this domain, which are helping businesses, data science, AI and ML, and many other sectors because the potential of data is huge. If a business knows how to use data, it can be a major factor in its success. And for that reason, we have constantly seen the rise of different components in data engineering. All with one goal: to make data useful.
Recently, we’ve witnessed the introduction of numerous technologies poised to revolutionize the data engineering domain. Concepts like data mesh are enhancing data discovery, improving data ownership, and streamlining data workflows. AI-driven data engineering is rapidly advancing, with expectations to automate key processes such as data cleansing, pipeline optimization, and data validation. We’re already seeing how cloud data services have evolved to embrace AI and machine learning, ensuring seamless integration with data initiatives. The rise of real-time data processing brings new use cases and advancements, while practices like DataOps foster better collaboration among teams. Take a closer look at the modern data stack in Shivam’s detailed article, Modern Data Stack: The What, Why, and How?
These developments are accompanied by a wide array of technologies designed to support infrastructure, analytics, AI, and machine learning, alongside enterprise tools that lay the foundation for this ongoing evolution. All these elements collectively set the stage for a broader discussion on data engineering and what lies beyond big data. Big data, supported by these satellite activities, aims to extract maximum value from data, unlocking its full potential.
References:
-
React Native: Session Reply with Microsoft Clarity
Microsoft recently launched session replay support for iOS on both Native iOS and React Native applications. We decided to see how it performs compared to competitors like LogRocket and UXCam.
This blog discusses what session replay is, how it works, and its benefits for debugging applications and understanding user behavior. We will also quickly integrate Microsoft Clarity in React Native applications and compare its performance with competitors like LogRocket and UXCam.
Below, we will explore the key features of session replay, the steps to integrate Microsoft Clarity into your React Native application, and benchmark its performance against other popular tools.
Key Features of Session Replay
Session replay provides a visual playback of user interactions on your application. This allows developers to observe how users navigate the app, identify any issues they encounter, and understand user behavior patterns. Here are some of the standout features:
- User Interaction Tracking: Record clicks, scrolls, and navigation paths for a comprehensive view of user activities.
- Error Monitoring: Capture and analyze errors in real time to quickly diagnose and fix issues.
- Heatmaps: Visualize areas of high interaction to understand which parts of the app are most engaging.
- Anonymized Data: Ensure user privacy by anonymizing sensitive information during session recording.
Integrating Microsoft Clarity with React Native
Integrating Microsoft Clarity into your React Native application is a straightforward process. Follow these steps to get started:
- Sign Up for Microsoft Clarity:
a. Visit the Microsoft Clarity website and sign up for a free account.
b. Create a new project and obtain your Clarity tracking code.
- Install the Clarity SDK:
Use npm or yarn to install the Clarity SDK in your React Native project:
npm install clarity@latest yarn add clarity@latest- Initialize Clarity in Your App:
Import and initialize Clarity in your main application file (e.g., App.js):
import Clarity from 'clarity'; Clarity.initialize('YOUR_CLARITY_TRACKING_CODE');- Verify Integration:
a. Run your application and navigate through various screens to ensure Clarity is capturing session data correctly.
b. Log into your Clarity dashboard to see the recorded sessions and analytics.
Benchmarking Against Competitors
To evaluate the performance of Microsoft Clarity, we’ll compare it against two popular session replay tools, LogRocket and UXCam, assessing them based on the following criteria:
- Ease of Integration: How simple is integrating the tool into a React Native application?
- Feature Set: What features does each tool offer for session replay and user behavior analysis?
- Performance Impact: How does the tool impact the app’s performance and user experience?
- Cost: What are the pricing models and how do they compare?
Detailed Comparison
Ease of Integration
- Microsoft Clarity: The integration process is straightforward and well-documented, making it easy for developers to get started.
- LogRocket: LogRocket also offers a simple integration process with comprehensive documentation and support.
- UXCam: UXCam provides detailed guides and support for integration, but it may require additional configuration steps compared to Clarity and LogRocket.
Feature Set
- Microsoft Clarity: Offers robust session replay, heatmaps, and error monitoring. However, it may lack some advanced features found in premium tools.
- LogRocket: Provides a rich set of features, including session replay, performance monitoring, Network request logs, and integration with other tools like Redux and GraphQL.
- UXCam: Focuses on mobile app analytics with features like session replay, screen flow analysis, and retention tracking.
Performance Impact
- Microsoft Clarity: Minimal impact on app performance, making it a suitable choice for most applications.
- LogRocket: Slightly heavier than Clarity but offers more advanced features. Performance impact is manageable with proper configuration.
- UXCam: Designed for mobile apps with performance optimization in mind. The impact is generally low but can vary based on app complexity.
Cost
- Microsoft Clarity: Free to use, making it an excellent option for startups and small teams.
- LogRocket: Offers tiered pricing plans, with a free tier for basic usage and paid plans for advanced features.
- UXCam: Provides a range of pricing options, including a free tier. Paid plans offer more advanced features and higher data limits.
Final Verdict
After evaluating the key aspects of session replay tools, Microsoft Clarity stands out as a strong contender, especially for teams looking for a cost-effective solution with essential features. LogRocket and UXCam offer more advanced capabilities, which may be beneficial for larger teams or more complex applications.
Ultimately, the right tool will depend on your specific needs and budget. For basic session replay and user behavior insights, Microsoft Clarity is a fantastic choice. If you require more comprehensive analytics and integrations, LogRocket or UXCam may be worth the investment.
Sample App
I have also created a basic sample app to demonstrate how to set up Microsoft Clarity for React Native apps.
Please check it out here: https://github.com/rakesho-vel/ms-rn-clarity-sample-app
This sample video shows how Microsoft Clarity records and lets you review user sessions on its dashboard.
References
-
Iceberg: Features and Hands-on (Part 2)
As we have already discussed in the previous blog about Apache Iceberg’s basic concepts, setup process, and how to load data. Further, we will now delve into some of Iceberg’s advanced features, including upsert functionality, schema evolution, time travel, and partitioning.
Upsert Functionality
One of Iceberg’s key features is its support for upserts. Upsert, which stands for update and insert, allows you to efficiently manage changes to your data. With Iceberg, you can perform these operations seamlessly, ensuring that your data remains accurate and up-to-date without the need for complex and time-consuming processes.
Schema Evolution
Schema evolution is another of its powerful features. Over time, the schema of your data may need to change due to new requirements or updates. Iceberg handles schema changes gracefully, allowing you to add, remove, or modify columns without having to rewrite your entire dataset. This flexibility ensures that your data architecture can evolve in tandem with your business needs.
Time Travel
Iceberg also provides time travel capabilities, enabling you to query historical data as it existed at any given point in time. This feature is particularly useful for debugging, auditing, and compliance purposes. By leveraging snapshots, you can easily access previous states of your data and perform analyses on how it has changed over time.
Setup Iceberg on the local machine using the local catalog option or Hive
You can also configure Iceberg in your Spark session like this:
import pyspark spark = pyspark.sql.SparkSession.builder .config('spark.jars.packages','org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0') .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') .config('spark.sql.catalog.spark_catalog.type', 'hive') .config('spark.sql.catalog.local', 'org.apache.iceberg.spark.SparkCatalog') .config('spark.sql.catalog.local.type', 'hadoop') .config('spark.sql.catalog.local.warehouse', './Data-Engineering/warehouse') .getOrCreate()Some configurations must pass while setting up Iceberg.
Create Tables in Iceberg and Insert Data
CREATE TABLE demo.db.data_sample (index string, organization_id string, name string, website string, country string, description string, founded string, industry string, num_of_employees string) USING icebergdf = spark.read.option("header", "true").csv("../data/input-data/organizations-100.csv") df.writeTo("demo.db.data_sample").append()We can either create the sample table using Spark SQL or directly write the data by mentioning the DB name and table name, which will create the Iceberg table for us.

You can see the data we have inserted. Apart from appending, you can use the overwrite method as well as Delta Lake tables. You can also see an example of how to read the data from an iceberg table.
Handling Upserts
This Iceberg feature is similar to Delta Lake. You can update the records in existing Iceberg tables without impacting the complete data. This is also used to handle the CDC operations. We can take input from any incoming CSV and merge the data in the existing table without any duplication. It will always have a single Record for each primary key. This is how Iceberg maintains the ACID properties.
Incoming Data
input_data = spark.read.option("header", "true").csv("../data/input-data/organizations-11111.csv") # Creating the temp view of that dataframe to merge input_data.createOrReplaceTempView("input_data") spark.sql("select * from input_data").show()
We will merge this data into our existing Iceberg Table using Spark SQL.
MERGE INTO demo.db.data_sample t USING (SELECT * FROM input_data) s ON t.organization_id = s.organization_id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * select * from demo.db.data_sampleHere, we can see the data once the merge operation has taken place.

Schema Evolution
Iceberg supports the following schema evolution changes:
- Add – Add a new column to the iceberg table
- Drop – If any columns get removed from the existing tables
- Rename – Change the name of the columns from the existing table
- Update – Change the data type or partition columns of the Iceberg table
- Reorder – Change in the order of the Iceberg table
After updating the schema, there will be no need to overwrite or re-write the data again. Like previously, your table has four columns, and all of them have data. If you added two more columns, you wouldn’t need to rewrite the data now that you have six columns. You can still easily access the data. This unique feature was lacking in Delta Lake but is present here. These are just some characteristics of the Iceberg scheme evolutions.
- If we add any columns, they won’t impact the existing columns.
- If we delete or drop any columns, they won’t impact other columns.
- Updating a column or field does not change values in any other column.
Iceberg uses unique IDs to track each column added to a table.
Let’s run some queries to update the schema, or let’s try to delete some columns.
%%sql ALTER TABLE demo.db.data_sample ADD COLUMN fare_per_distance_unit float AFTER num_of_employees;After adding another column, if we try to access the data again from the table, we can do so without seeing any kind of error. This is also how Iceberg solves schema-related problems.
Partition Evolution and Sort Order Evolution
Iceberg came up with this option, which was missing in Delta Lake. When you evolve a partition spec, the old data written with an earlier spec remains unchanged. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately. Because of this, when you start writing queries, you get split planning. This is where each partition layout plans files separately using the filter it derives for that specific partition layout.
Similar to partition spec, Iceberg sort order can also be updated in an existing table. When you evolve a sort order, the old data written with an earlier order remains unchanged.
%%sql ALTER TABLE demo.db.data_sample ADD PARTITION FIELD founded DESCRIBE TABLE demo.db.data_sample
Copy on write(COW) and merge on read(MOR) as well
Iceberg supports both COW and MOR while loading the data into the Iceberg table. We can set up configuration for this by either altering the table or while creating the iceberg table.
Copy-On-Write (COW) – Best for tables with frequent reads, infrequent writes/updates, or large batch updates:
When your requirement is to frequently read but less often write and update, you can configure this property in an Iceberg table. In COW, when we update or delete any rows from the table, a new data file with another version is created, and the latest version holds the latest updated data. The data is rewritten when updates or deletions occur, making it slower and can be a bottleneck when large updates occur. As its name specifies, it creates another copy on write of data.
When reading occurs, it is an ideal process as we are not updating or deleting anything we are only reading so we can read the data faster.

Merge-On-Read (MOR) – Best for tables with frequent writes/updates:
This is just opposite of the COW, as we do not rewrite the data again on the update or deletion of any rows. It creates a change log with updated records and then merges this into the original data file to create a new state of file with updated records.


Query engine and integration supported:

Conclusion
After performing this research, we learned about the Iceberg’s features and its compatibility with various metastore for integrations. We got the basic idea of configuring Iceberg on different cloud platforms and locally well. We had some basic ideas for Upsert, schema evolution and partition evolution.
-
Data QA: The Need of the Hour
Have you ever encountered vague or misleading data analytics reports? Are you struggling to provide accurate data values to your end users? Have you ever experienced being misdirected by a geographical map application, leading you to the wrong destination? Imagine Amazon customers expressing dissatisfaction due to receiving the wrong product at their doorstep.
These issues stem from the use of incorrect or vague data by application/service providers. The need of the hour is to address these challenges by enhancing data quality processes and implementing robust data quality solutions. Through effective data management and validation, organizations can unlock valuable insights and make informed decisions.
“Harnessing the potential of clean data is like painting a masterpiece with accurate brushstrokes.”
Introduction
Data quality assurance (QA) is the systematic approach organizations use to ensure they have reliable, correct, consistent, and relevant data. It involves various methods, approaches, and tools to maintain good data quality from commencement to termination.
What is Data Quality?
Data quality refers to the overall utility of a dataset and its ability to be easily processed and analyzed for other uses. It is an integral part of data governance that ensures your organization’s data is fit for purpose.
How can I measure Data Quality?

What is the critical importance of Data Quality?

Remember, good data is super important! So, invest in good data—it’s the secret sauce for business success!
What are the Data Quality Challenges?
1. Data quality issues on production:
Production-specific data quality issues are primarily caused by unexpected changes in the data and infrastructure failures.
A. Source and third-party data changes:
External data sources, like websites or companies, may introduce errors or inconsistencies, making it challenging to use the data accurately. These issues can lead to system errors or missing values, which might go unnoticed without proper monitoring.
Example:
- File formats change without warning:
Imagine we’re using an API to get data in CSV format, and we’ve made a pipeline that handles it well.
import csv def process_csv_data(csv_file): with open(csv_file, 'r') as file: csv_reader = csv.DictReader(file) for row in csv_reader: print(row) csv_file = 'data.csv' process_csv_data(csv_file)The data source switched to using the JSON format, breaking our pipeline. This inconsistency can cause errors or missing data if our system can’t adapt. Monitoring and adjustments will ensure the accuracy of data analysis or applications.
- Malformed data values and schema changes:
Suppose we’re handling inventory data for an e-commerce site. The starting schema for your inventory dataset might have fields like:

Now, imagine that the inventory file’s schema changed suddenly. A “quantity” column has been renamed to “qty,” and the last_updated_at timestamp format switches to epoch timestamp.

This change might not be communicated in advance, leaving our data pipeline unprepared to handle the new field and time format.
B. Infrastructure failures:
Reliable software is crucial for processing large data volumes, but even the best tools can encounter issues. Infrastructure failures, like glitches or overloads, can disrupt data processing regardless of the software used.
Solution:
Data observability tools such as Monte Carlo, BigEye, and Great Expectations help detect these issues by monitoring for changes in data quality and infrastructure performance. These tools are essential for identifying and alerting the root causes of data problems, ensuring data reliability in production environments.
2. Data quality issues during development:
Development-specific data quality issues are primarily caused by untested code changes.
A. Incorrect parsing of data:
Data transformation bugs can occur due to mistakes in code or parsing, leading to data type mismatches or schema inaccuracies.
Example:
Imagine we’re converting a date string (“YYYY-MM-DD”) to a Unix epoch timestamp using Python. But misunderstanding the strptime() function’s format specifier leads to unexpected outcomes.
from datetime import datetime timestamp_str = "2024-05-10" # %Y-%d-%m correct format from incoming data # Incorrectly using '%d' for month (should be '%m') format_date = "%Y-%m-%d" timestamp_dt = datetime.strptime(timestamp_str, format_date) epoch_seconds = int(timestamp_dt.timestamp())This error makes strptime() interpret “2024” as the year, “05” as the month (instead of the day), and “10” as the day (instead of the month), leading to inaccurate data in the timestamp_dt variable.
B. Misapplied or misunderstood requirements:
Even with the right code, data quality problems can still occur if requirements are misunderstood, resulting in logic errors and data quality issues.
Example:
Imagine we’re assigned to validate product prices in a dataset, ensuring they fall between $10 and $100.product_prices = [10, 5, 25, 50, 75, 110] valid_prices = [] for price in product_prices: if price >= 10 and price <= 100: valid_prices.append(price) print("Valid prices:", valid_prices)The requirement states prices should range from $10 to $100. But a misinterpretation leads the code to check if prices are >= $10 and <= $100. This makes $10 valid, causing a data quality problem.
C. Unaccounted downstream dependencies:
Despite careful planning and logic, data quality incidents can occur due to overlooked dependencies. Understanding data lineage and communicating effectively across all users is crucial to preventing such incidents.
Example:
Suppose we’re working on a database schema migration project for an e-commerce system. In the process, we rename the order_date column to purchase_date in the orders table. Despite careful planning and testing, a data quality issue arises due to an overlooked downstream dependency. The marketing team’s reporting dashboard relies on a SQL query referencing the order_date column, now renamed purchase_date, resulting in inaccurate reporting and potentially misinformed business decisions.
Here’s an example SQL query that represents the overlooked downstream dependency:
-- SQL query used by the marketing team's reporting dashboard SELECT DATE_TRUNC('month', order_date) AS month, SUM(total_amount) AS total_sales FROM orders GROUP BY DATE_TRUNC('month', order_date)This SQL query relies on the order_date column to calculate monthly sales metrics. After the schema migration, this column no longer exists, causing query failure and inaccurate reporting.
Solutions:
Data Quality tools like Great Expectations and Deequ proactively catch data quality issues by testing changes introduced from data-processing code, preventing issues from reaching production.
a. Testing assertions: Assertions validate data against expectations, ensuring data integrity. While useful, they require careful maintenance and should be selectively applied.
Example:
Suppose we have an “orders” table in your dbt project and need to ensure the “total_amount” column contains only numeric values; we can write a dbt test to validate this data quality rule.version: 2 models: - name: orders columns: - name: total_amount tests: - data_type: numericIn this dbt test code:
- We specify the dbt version (version: 2), model named “orders,” and “total_amount” column.
- Within the “total_amount” column definition, we add a test named “data_type” with the value “numeric,” ensuring the column contains only numeric data.
- Running the dbt test command will execute this test, checking if the “total_amount” column adheres to the numeric data type. Any failure indicates a data quality issue.
b. Comparing staging and production data: Data Diff is a CLI tool that compares datasets within or across databases, highlighting changes in data similar to how git diff highlights changes in source code. Aiding in detecting data quality issues early in the development process.
Here’s a data-diff example between staging and production databases for the payment_table.
data-diff staging_db_connection staging_payment_table production_db_connection production_payment_table -k primary_key -c “payment_amount, payment_type, payment_currency” -w filter_condition(optional)
Source: https://docs.datafold.com/data_diff/what_is_data_diff What are some best practices for maintaining high-quality data?
- Establish Data Standards: Define clear data standards and guidelines for data collection, storage, and usage to ensure consistency and accuracy across the organization.
- Data Validation: Implement validation checks to ensure data conforms to predefined rules and standards, identifying and correcting errors early in the data lifecycle.
- Regular Data Cleansing: Schedule regular data cleansing activities to identify and correct inaccuracies, inconsistencies, and duplicates in the data, ensuring its reliability and integrity over time.
- Data Governance: Establish data governance policies and procedures to manage data assets effectively, including roles and responsibilities, data ownership, access controls, and compliance with regulations.
- Metadata Management: Maintain comprehensive metadata to document data lineage, definitions, and usage, providing transparency and context for data consumers and stakeholders.
- Data Security: Implement robust data security measures to protect sensitive information from unauthorized access, ensuring data confidentiality, integrity, and availability.
- Data Quality Monitoring: Continuously monitor data quality metrics and KPIs to track performance, detect anomalies, and identify areas for improvement, enabling proactive data quality management.
- Data Training and Awareness: Provide data training and awareness programs for employees to enhance their understanding of data quality principles, practices, and tools, fostering a data-driven culture within the organization.
- Collaboration and Communication: Encourage collaboration and communication among stakeholders, data stewards, and IT teams to address data quality issues effectively and promote accountability and ownership of data quality initiatives.
- Continuous Improvement: Establish a culture of continuous improvement by regularly reviewing and refining data quality processes, tools, and strategies based on feedback, lessons learned, and evolving business needs.
Can you recommend any tools for improving data quality?
- AWS Deequ: AWS Deequ is an open-source data quality library built on top of Apache Spark. It provides tools for defining data quality rules and validating large-scale datasets in Spark-based data processing pipelines.

- Great Expectations: GX Cloud is a fully managed SaaS solution that simplifies deployment, scaling, and collaboration and lets you focus on data validation.

- Soda: Soda allows data engineers to test data quality early and often in pipelines to catch data quality issues before they have a downstream impact.

- Datafold: Datafold is a cloud-based data quality platform that automates and simplifies the process of monitoring and validating data pipelines. It offers features such as automated data comparison, anomaly detection, and integration with popular data processing tools like dbt.
Considerations for Selecting a Data QA Tool:
Selecting a data QA (Quality Assurance) tool hinges on your specific needs and requirements. Consider factors such as:
1. Scalability and Performance: Ensure the tool can handle current and future data volumes efficiently, with real-time processing capabilities. some text
Example: Great Expectations help validate data in a big data environment by providing a scalable and customizable way to define and monitor data quality across different sources
2. Data Profiling and Cleansing Capabilities: Look for comprehensive data profiling and cleansing features to detect anomalies and improve data quality.some text
Example: AWS Glue DataBrew offers profiling, cleaning and normalizing, creating map data lineage, and automating data cleaning and normalization tasks.
3. Data Monitoring Features: Choose tools with continuous monitoring capabilities, allowing you to track metrics and establish data lineage.some text
Example: Datafold’s monitoring feature allows data engineers to write SQL commands to find anomalies and create automated alerts.
4. Seamless Integration with Existing Systems: Select a tool compatible with your existing systems to minimize disruption and facilitate seamless integration.some text
Example: dbt offers seamless integration with existing data infrastructure, including data warehouses and BI tools. It allows users to define data transformation pipelines using SQL, making it compatible with a wide range of data systems.
5. User-Friendly Interface: Prioritize tools with intuitive interfaces for quick adoption and minimal training requirements.some text
Example: Soda SQL is an open-source tool with a simple command line interface (CLI) and Python library to test your data through metric collection.
6. Flexibility and Customization Options: Seek tools that offer flexibility to adapt to changing data requirements and allow customization of rules and workflows.some text
Example: dbt offers flexibility and customization options for defining data transformation workflows.
7. Vendor Support and Community: Evaluate vendors based on their support reputation and active user communities for shared knowledge and resources.some text
Example: AWS Deequ is supported by Amazon Web Services (AWS) and has an active community of users. It provides comprehensive documentation, tutorials, and forums for users to seek assistance and share knowledge about data quality best practices.
8. Pricing and Licensing Options: Consider pricing models that align with your budget and expected data usage, such as subscription-based or volume-based pricing. some text
Example: Great Expectations offers flexible pricing and licensing options, including both open-source (freely available) and enterprise editions(subscription-based).
Ultimately, the right tool should effectively address your data quality challenges and seamlessly fit into your data infrastructure and workflows.
Conclusion: The Vital Role of Data Quality
In conclusion, data quality is paramount in today’s digital age. It underpins informed decisions, strategic formulation, and business success. Without it, organizations risk flawed judgments, inefficiencies, and competitiveness loss. Recognizing its vital role empowers businesses to drive innovation, enhance customer experiences, and achieve sustainable growth. Investing in robust data management, embracing technology, and fostering data integrity are essential. Prioritizing data quality is key to seizing new opportunities and staying ahead in the data-driven landscape.
References:
https://docs.getdbt.com/docs/build/data-tests
https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ
-
Iceberg – Introduction and Setup (Part – 1)
As we already discussed in our previous Delta Lake blog, there are already table formats in use, ones with very high specifications and their own benefits. Iceberg is one of them. So, in this blog, we will discuss Iceberg.
What is Apache Iceberg?
Iceberg, from the open-source Apache, is a table format used to handle large amounts of data stored locally or on various cloud storage platforms. Netflix developed Iceberg to solve its big data problem. After that, they donated it to Apache, and it became open source in 2018. Iceberg now has a large number of contributors all over the world on GitHub and is the most widely used table format.
Iceberg mainly solves all the key problems we once faced when using the Hive table format to deal with data stored on various cloud storage like S3.
Iceberg has similar features and capabilities, like SQL tables. Yes, it is open source, so multiple engines like Spark can operate on it to perform transformations and such. It also has all ACID properties. This is a quick introduction to Iceberg, covering its features and initial setup.
Why to go with Iceberg
The main reason to use Iceberg is that it performs better when we need to load data from S3, or metadata is available on a cloud storage medium. Unlike Hive, Iceberg tracks the data at the file level rather than the folder level, which can decrease performance; that’s why we want to choose Iceberg. Here is the folder hierarchy that Iceberg uses while saving the data into its tables. Each Iceberg table is a combination of four files: snapshot metadata list, manifest list, manifest file, and data file.

- Snapshot Metadata File: This file holds the metadata information about the table, such as the schema, partitions, and manifest list.
- Manifest List: This list records each manifest file along with the path and metadata information. At this point, Iceberg decides which manifest files to ignore and which to read.
- Manifest File: This file contains the paths to real data files, which hold the real data along with the metadata.
- Data File: Here is the real parquet, ORC, and Avro file, along with the real data.
Features of Iceberg:
Some Iceberg features include:
- Schema Evolution: Iceberg allows you to evolve your schema without having to rewrite your data. This means you can easily add, drop, or rename columns, providing flexibility to adapt to changing data requirements without impacting existing queries.
- Partition Evolution: Iceberg supports partition evolution, enabling you to modify the partitioning scheme as your data and query patterns evolve. This feature helps maintain query performance and optimize data layout over time.
- Time Travel: Iceberg’s time travel feature allows you to query historical versions of your data. This is particularly useful for debugging, auditing, and recreating analyses based on past data states.
- Multiple Query Engine Support: Iceberg supports multiple query engines, including Trino, Presto, Hive, and Amazon Athena. This interoperability ensures that you can read and write data across different tools seamlessly, facilitating a more versatile and integrated data ecosystem.
- AWS Support: Iceberg is well-integrated with AWS services, making it easy to use with Amazon S3 for storage and other AWS analytics services. This integration helps leverage the scalability and reliability of AWS infrastructure for your data lake.
- ACID Compliance: Iceberg ensures ACID (Atomicity, Consistency, Isolation, Durability) transactions, providing reliable data consistency and integrity. This makes it suitable for complex data operations and concurrent workloads, ensuring data reliability and accuracy.
- Hidden Partitioning: Iceberg’s hidden partitioning abstracts the complexity of managing partitions from the user, automatically handling partition management to improve query performance without manual intervention.
- Snapshot Isolation: Iceberg supports snapshot isolation, enabling concurrent read and write operations without conflicts. This isolation ensures that users can work with consistent views of the data, even as it is being updated.
- Support for Large Tables: Designed for high scalability, Iceberg can efficiently handle petabyte-scale tables, making it ideal for large datasets typical in big data environments.
- Compatibility with Modern Data Lakes: Iceberg’s design is tailored for modern data lake architectures, supporting efficient data organization, metadata management, and performance optimization, aligning well with contemporary data management practices.
These features make Iceberg a powerful and flexible table format for managing data lakes, ensuring efficient data processing, robust performance, and seamless integration with various tools and platforms. By leveraging Iceberg, organizations can achieve greater data agility, reliability, and efficiency, enhancing their data analytics capabilities and driving better business outcomes.
Prerequisite:
- PySpark: Ensure that you have PySpark installed and properly configured. PySpark provides the Python API for Spark, enabling you to harness the power of distributed computing with Spark using Python.
- Python: Make sure you have Python installed on your system. Python is essential for writing and running your PySpark scripts. It’s recommended to use a virtual environment to manage your dependencies effectively.
- Iceberg-Spark JAR: Download the appropriate Iceberg-Spark JAR file that corresponds to your Spark version. This JAR file is necessary to integrate Iceberg with Spark, allowing you to utilize Iceberg’s advanced table format capabilities within your Spark jobs.
- Jars to Configure Cloud Storage: Obtain and configure the necessary JAR files for your specific cloud storage provider. For example, if you are using Amazon S3, you will need the hadoop-aws JAR and its dependencies. For Google Cloud Storage, you need the gcs-connector JAR. These JARs enable Spark to read from and write to cloud storage systems.
- Spark and Hadoop Configuration: Ensure your Spark and Hadoop configurations are correctly set up to integrate with your cloud storage. This might include setting the appropriate access keys, secret keys, and endpoint configurations in your spark-defaults.conf and core-site.xml.
- Iceberg Configuration: Configure Iceberg settings specific to your environment. This might include catalog configurations (e.g., Hive, Hadoop, AWS Glue) and other Iceberg properties that optimize performance and compatibility.
- Development Environment: Set up a development environment with an IDE or text editor that supports Python and Spark development, such as IntelliJ IDEA with the PyCharm plugin, Visual Studio Code, or Jupyter Notebooks.
- Data Source Access: Ensure you have access to the data sources you will be working with, whether they are in cloud storage, relational databases, or other data repositories. Proper permissions and network configurations are necessary for seamless data integration.
- Basic Understanding of Data Lakes: A foundational understanding of data lake concepts and architectures will help effectively utilize Iceberg. Knowledge of how data lakes differ from traditional data warehouses and their benefits will also be helpful.
- Version Control System: Use a version control system like Git to manage your codebase. This helps in tracking changes, collaborating with team members, and maintaining code quality.
- Documentation and Resources: Familiarize yourself with Iceberg documentation and other relevant resources. This will help you troubleshoot issues, understand best practices, and leverage advanced features effectively.
You can download the run time JAR from here —according to the Spark version installed on your machine or cluster. It will be the same as the Delta Lake setup. You can either download these JAR files to your machine or cluster, provide a Spark submit command, or you can download these while initializing the Spark session by passing these in Spark config as a JAR package, along with the appropriate version.
To use cloud storage, we are using these JARs with the S3 bucket for reading and writing Iceberg tables. Here is the basic example of a spark session:
AWS_ACCESS_KEY_ID = "XXXXXXXXXXXXXX" AWS_SECRET_ACCESS_KEY = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXpiwvahk7e" spark_jars_packages = "com.amazonaws:aws-java-sdk:1.12.246,org.apache.hadoop:hadoop-aws:3.2.2,org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0" spark = pyspark.sql.SparkSession.builder .config("spark.jars.packages", spark_jars_packages) .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.demo.warehouse", "s3a://abhishek-test-01012023/iceberg-sample-data/") .config('spark.sql.catalog.demo.type', 'hadoop') .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') .config("spark.driver.memory", "20g") .config("spark.memory.offHeap.enabled", "true") .config("spark.memory.offHeap.size", "8g") .getOrCreate() spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID) spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)Iceberg Setup Using Docker
You can set and configure AWS creds, as well as some database-related or stream-related configs inside the docker-compose file.
version: "3" services: spark-iceberg: image: tabulario/spark-iceberg container_name: spark-iceberg build: spark/ depends_on: - rest - minio volumes: - ./warehouse:/home/iceberg/warehouse - ./notebooks:/home/iceberg/notebooks/notebooks - ./data:/home/iceberg/data environment: - AWS_ACCESS_KEY_ID=admin - AWS_SECRET_ACCESS_KEY=password - AWS_REGION=us-east-1 ports: - 8888:8888 - 8080:8080 links: - rest:rest - minio:minio rest: image: tabulario/iceberg-rest:0.1.0 ports: - 8181:8181 environment: - AWS_ACCESS_KEY_ID=admin - AWS_SECRET_ACCESS_KEY=password - AWS_REGION=us-east-1 - CATALOG_WAREHOUSE=s3a://warehouse/wh/ - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO - CATALOG_S3_ENDPOINT=http://minio:9000 minio: image: minio/minio container_name: minio environment: - MINIO_ROOT_USER=admin - MINIO_ROOT_PASSWORD=password ports: - 9001:9001 - 9000:9000 command: ["server", "/data", "--console-address", ":9001"] mc: depends_on: - minio image: minio/mc container_name: mc environment: - AWS_ACCESS_KEY_ID=admin - AWS_SECRET_ACCESS_KEY=password - AWS_REGION=us-east-1 entrypoint: > /bin/sh -c " until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done; /usr/bin/mc rm -r --force minio/warehouse; /usr/bin/mc mb minio/warehouse; /usr/bin/mc policy set public minio/warehouse; exit 0; "Save this file with docker-compose.yaml. And run the command: docker compose up. Now, you can log into your container by using this command:
docker exec -it <container-id> bashYou can mount the sample data directory in a container or copy it from your local machine to the container. To copy the data inside the Docker directory, we can use the CP command.
docker cp input-data <Container ID>:/home/iceberg/dataSetup S3 As a Warehouse in Iceberg, Read Data from the S3, and Write Iceberg Tables in the S3 Again Using an EC2 Instance
We have generated 90 GB of data here using Spark Job, stored in the S3 bucket.
AWS_ACCESS_KEY_ID = "XXXXXXXXXXX" AWS_SECRET_ACCESS_KEY = "XXXXXXXXXXX+XXXXXXXXXXX" spark_jars_packages = "com.amazonaws:aws-java-sdk:1.12.246,org.apache.hadoop:hadoop-aws:3.2.2,org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0" spark = pyspark.sql.SparkSession.builder .config("spark.jars.packages", spark_jars_packages) .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.demo.warehouse", "s3a://abhishek-test-01012023/iceberg-sample-data/") .config('spark.sql.catalog.demo.type', 'hadoop') .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') .config("spark.driver.memory", "20g") .config("spark.memory.offHeap.enabled", "true") .config("spark.memory.offHeap.size", "8g") .getOrCreate() spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID) spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)Step 1
We read the data in Spark and create an Iceberg table out of it, storing the iceberg tables in the S3 bucket only.
Some Iceberg functionality won’t work if we haven’t installed or used the appropriate JAR file of the Iceberg version. The Iceberg version should be compatible with the Spark version you are using; otherwise, some feature partitions will throw an error of noSuchMethod. This must be taken care of carefully while setting this up, either in EC2 or EMR.
Create an Iceberg table on S3 and write data into that table. The sample data we have used is generated using a Spark job for Delta tables. We are using the same data and schema of the data as follows.
Step 2
We created Iceberg tables in the location of the S3 bucket and wrote the data with partition columns in the S3 bucket only.
spark.sql(""" CREATE TABLE IF NOT EXISTS demo.db.iceberg_data_2(id INT, first_name String, last_name String, address String, pincocde INT, net_income INT, source_of_income String, state String, email_id String, description String, population INT, population_1 String, population_2 String, population_3 String, population_4 String, population_5 String, population_6 String, population_7 String, date INT) USING iceberg TBLPROPERTIES ('format'='parquet', 'format-version' = '2') PARTITIONED BY (`date`) location 's3a://abhishek-test-01012023/iceberg_v2/db/iceberg_data_2'""") # Read the data that need to be written # Reading the data from delta tables in spark Dataframe df = spark.read.parquet("s3a://abhishek-test-01012023/delta-lake-sample-data/") logging.info("Starting writing the data") df.sortWithinPartitions("date").writeTo("demo.db.iceberg_data").partitionedBy("date").createOrReplace() logging.info("Writing has been finished") logging.info("Query the data from iceberg using spark SQL") spark.sql("describe table demo.db.iceberg_data").show() spark.sql("Select * from demo.db.iceberg_data limit 10").show()This is how we can use Iceberg over S3. There is another option: We can also create Iceberg tables in the AWS Glue catalog. Most tables created in the Glue catalog using Ahena are external tables that we use externally after generating the manifest files, like Delta Lake.
Step 3
We print the Iceberg table’s data along with the table descriptions.


Using Iceberg, we can directly create the table in the Glue catalog using Athena, and it supports all read and write operations on the data available. These are the configurations that need to use in spark while using Glue catalog.
{ "conf": { "spark.sql.catalog.glue_catalog1": "org.apache.iceberg.spark.SparkCatalog", "spark.sql.catalog.glue_catalog1.warehouse": "s3://YOUR-BUCKET-NAME/iceberg/glue_catalog1/tables/", "spark.sql.catalog.glue_catalog1.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog", "spark.sql.catalog.glue_catalog1.io-impl": "org.apache.iceberg.aws.s3.S3FileIO", "spark.sql.catalog.glue_catalog1.lock-impl": "org.apache.iceberg.aws.glue.DynamoLockManager", "spark.sql.catalog.glue_catalog1.lock.table": "myGlueLockTable", "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" } }Now, we can easily create the Iceberg table using the Spark or Athena, and it will be accessible via Delta. We can perform upserts, too.
Conclusion
We’ve learned the basics of the Iceberg table format, its features, and the reasons for choosing Iceberg. We discussed how Iceberg provides significant advantages such as schema evolution, partition evolution, hidden partitioning, and ACID compliance, making it a robust choice for managing large-scale data. We also delved into the fundamental setup required to implement this table format, including configuration and integration with data processing engines like Apache Spark and query engines like Presto and Trino. By leveraging Iceberg, organizations can ensure efficient data management and analytics, facilitating better performance and scalability. With this knowledge, you are well-equipped to start using Iceberg for your data lake needs, ensuring a more organized, scalable, and efficient data infrastructure.
-
Exploring Marvels of Webpack : Ep 1 – React Project without CRA
Hands up if you’ve ever built a React project with Create-React-App (CRA)—and that’s all of us, isn’t it? Now, how about we pull back the curtain and see what’s actually going on behind the scenes? Buckle up, it’s time to understand what CRA really is and explore the wild, untamed world of creating a React project without it. Sounds exciting, huh?
What is CRA?
CRA—Create React App (https://create-react-app.dev/)—is a command line utility provided by Facebook for creating react apps with preconfigured setup. CRA provides an abstraction layer over the nitty-gritty details of configuring tools like Babel and Webpack, allowing us to focus on writing code. Apart from this, it basically comes with everything preconfigured, and developers don’t need to worry about anything but code.
That’s all well and good, but why do we need to learn about manual configuration? At some point in your career, you’ll likely have to adjust webpack configurations. And if that’s not a convincing reason, how about satisfying your curiosity? 🙂
Let’s begin our journey.
Webpack
As per the official docs (https://webpack.js.org/concepts/):
“At its core, webpack is a static module bundler for modern JavaScript applications.”
But what does that actually mean? Let’s break it down:
static:It refers to the static assets (HTML, CSS, JS, images) on our application.module:It refers to a piece of code in one of our files. In a large application, it’s not usually possible to write everything in a single file, so we have multiple modules piled up together.bundler:It is a tool (which is webpack in our case), that bundles up everything we have used in our project and converts it to native, browser understandable JS, CSS, HTML (static assets).
Source:https://webpack.js.org/ So, in essence, webpack takes our application’s static assets (like JavaScript modules, CSS files, and more) and bundles them together, resolving dependencies and optimizing the final output.
Webpack is preconfigured in our Create-React-App (CRA), and for most use cases, we don’t need to adjust it. You’ll find that many tutorials begin a React project with CRA. However, to truly understand webpack and its functionalities, we need to configure it ourselves. In this guide, we’ll attempt to do just that.
Let’s break this whole process into multiple steps:
Step 1: Let us name our new project
Create a new project folder and navigate into it:
mkdir react-webpack-way cd react-webpack-wayStep 2: Initialize npm
Run the following command to initialize a new npm project. Answer the prompts or press Enter to accept the default values.
npm init # if you are patient enough to answer the prompts :) Or npm init -yThis will generate a package.json for us.
Step 3: Install React and ReactDOM
Install React and ReactDOM as dependencies:
npm install react react-domStep 4: Create project structure
You can create any folder structure that you are used to. But for the sake of simplicity, let’s stick to the following structure:
|- src |- index.js |- public |- index.htmlStep 5: Set up React components
Let’s populate our index.js:
// src/index.js import React from 'react'; import ReactDOM from 'react-dom'; const App = () => { return <h1>Hello, React with Webpack!</h1>; }; ReactDOM.render(<App />, document.getElementById('root'));Step 6: Let’s deal with the HTML file
Add the following content to index.html:
<!-- public/index.html --> <!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8" /> <title>React with Webpack</title> </head> <body> <div id="root"></div> <!-- Do not miss this one --> </body> </html>Step 7: Install Webpack and Babel
Install Webpack, Babel, and html-webpack-plugin as development dependencies:
npm install --save-dev webpack webpack-cli webpack-dev-server @babel/core @babel/preset-react @babel/preset-env babel-loader html-webpack-pluginOr
If this looks verbose to you, you can do these in steps:
npm install --save-dev webpack webpack-cli webpack-dev-server # webpack npm install --save-dev @babel/core @babel/preset-react @babel/preset-env babel-loader #babel npm install --save-dev html-webpack-pluginWhy babel? Read more: https://babeljs.io/docs/
In a nutshell, some of the reasons we use Babel are:
JavaScript ECMAScript Compatibility:- Babel allows developers to use the latest ECMAScript (ES) features in their code, even if the browser or Node.js environment doesn’t yet support them. This is achieved through the process of transpiling, where Babel converts modern JavaScript code (ES6 and beyond) into a version that is compatible with a wider range of browsers and environments.
- Babel allows developers to use the latest ECMAScript (ES) features in their code, even if the browser or Node.js environment doesn’t yet support them. This is achieved through the process of transpiling, where Babel converts modern JavaScript code (ES6 and beyond) into a version that is compatible with a wider range of browsers and environments.
JSX Transformation:- JSX (JavaScript XML) is a syntax extension for JavaScript used with React. Babel is required to transform JSX syntax into plain JavaScript, as browsers do not understand JSX directly. This transformation is necessary for React components to be properly rendered in the browser.
- JSX (JavaScript XML) is a syntax extension for JavaScript used with React. Babel is required to transform JSX syntax into plain JavaScript, as browsers do not understand JSX directly. This transformation is necessary for React components to be properly rendered in the browser.
Module System Transformation:- Babel helps in transforming the module system used in JavaScript. It can convert code written using the ES6 module syntax (import and export) into the CommonJS or AMD syntax that browsers and older environments understand.
- Babel helps in transforming the module system used in JavaScript. It can convert code written using the ES6 module syntax (import and export) into the CommonJS or AMD syntax that browsers and older environments understand.
Polyfilling:- Babel can include polyfills for features not present in the target environment. This ensures your application can use newer language features or APIs even if they are not supported natively.
- Babel can include polyfills for features not present in the target environment. This ensures your application can use newer language features or APIs even if they are not supported natively.
Browser Compatibility:- Different browsers have varying levels of support for JavaScript features. Babel helps address these compatibility issues by allowing developers to write code using the latest features and then automatically transforming it to a version that works across different browsers.
Why html-webpack-plugin? Read more: https://webpack.js.org/plugins/html-webpack-plugin/
The html-webpack-plugin is a popular webpack plugin that simplifies the process of creating an HTML file to serve your bundled JavaScript files. It automatically injects the bundled script(s) into the HTML file, saving you from having to manually update the script tags every time your bundle changes. To put it in perspective, if you don’t have this plugin, you won’t see your React index file injected into the HTML file.
Step 8: Configure Babel
Create a .babelrc file in the project root and add the following configuration:
// .babelrc { "presets": ["@babel/preset-react", "@babel/preset-env"] }Step 9: Configure Webpack
Create a webpack.config.js file in the project root:
// webpack.config.js const path = require('path'); const HtmlWebpackPlugin = require('html-webpack-plugin'); module.exports = { entry: './src/index.js', output: { path: path.resolve(__dirname, 'dist'), filename: 'bundle.js', }, module: { rules: [ { test: /\.(js|jsx)$/, exclude: /node_modules/, use: 'babel-loader', }, ], }, plugins: [ new HtmlWebpackPlugin({ template: 'public/index.html', }), ], devServer: { static: path.resolve(__dirname, 'public'), port: 3000, }, };Step 10: Update package.json scripts
Update the “scripts” section in your package.json file:
"scripts": { "start": "webpack serve --mode development --open", "build": "webpack --mode production" }Note: Do not replace the contents of package.json here. Just update the scripts section.
Step 11: This is where our hard work pays off
Now you can run your React project using the following command:
npm startVisit http://localhost:3000 in your browser, and you should see your React app up and running.
This is it. This is a very basic version of our CRA.
There’s more
Stick around if you want to understand what we exactly did in the webpack.config.js.
At this point, our webpack config looks like this:
// webpack.config.js const path = require('path'); const HtmlWebpackPlugin = require('html-webpack-plugin'); module.exports = { entry: './src/index.js', output: { path: path.resolve(__dirname, 'dist'), filename: 'bundle.js', }, module: { rules: [ { test: /\.(js|jsx)$/, exclude: /node_modules/, use: 'babel-loader', }, ], }, plugins: [ new HtmlWebpackPlugin({ template: 'public/index.html', }), ], devServer: { static: path.resolve(__dirname, 'public'), port: 3000, }, };Let’s go through each section of the provided webpack.config.js file and explain what each keyword means:
const path = require('path');- This line imports the Node.js path module, which provides utilities for working with file and directory paths. Our webpack configuration ensures that file paths are specified correctly and consistently across different operating systems.
- This line imports the Node.js path module, which provides utilities for working with file and directory paths. Our webpack configuration ensures that file paths are specified correctly and consistently across different operating systems.
const HtmlWebpackPlugin = require('html-webpack-plugin');- This line imports the HtmlWebpackPlugin module. This webpack plugin simplifies the process of creating an HTML file to include the bundled JavaScript files. It’s a convenient way of automatically generating an HTML file that includes the correct script tags for our React application.
- This line imports the HtmlWebpackPlugin module. This webpack plugin simplifies the process of creating an HTML file to include the bundled JavaScript files. It’s a convenient way of automatically generating an HTML file that includes the correct script tags for our React application.
module.exports = { ... };- This line exports a JavaScript object, which contains the configuration for webpack. It specifies how webpack should bundle and process your code.
- This line exports a JavaScript object, which contains the configuration for webpack. It specifies how webpack should bundle and process your code.
entry: './src/index.js',- This configuration tells webpack the entry point of your application, which is the main JavaScript file where the bundling process begins. In this case, it’s ./src/index.js.
- This configuration tells webpack the entry point of your application, which is the main JavaScript file where the bundling process begins. In this case, it’s ./src/index.js.
output: { path: path.resolve(__dirname, 'dist'), filename: 'bundle.js', },- This configuration specifies where the bundled JavaScript file should be output: path is the directory, and filename is the name of the output file. In this case, it will be placed in the dist directory with the name bundle.js.
- This configuration specifies where the bundled JavaScript file should be output: path is the directory, and filename is the name of the output file. In this case, it will be placed in the dist directory with the name bundle.js.
module: { rules: [ ... ], },- This section defines rules for how webpack should process different types of files. In this case, it specifies a rule for JavaScript and JSX files (those ending with .js or .jsx). The babel-loader is used to transpile these files using Babel, excluding files in the node_modules directory.
- This section defines rules for how webpack should process different types of files. In this case, it specifies a rule for JavaScript and JSX files (those ending with .js or .jsx). The babel-loader is used to transpile these files using Babel, excluding files in the node_modules directory.
plugins: [ new HtmlWebpackPlugin({ template: 'public/index.html', }), ],- This section includes an array of webpack plugins. In particular, it adds the HtmlWebpackPlugin, configured to use the public/index.html file as a template. This plugin will automatically generate an HTML file with the correct script tags for the bundled JavaScript.
- This section includes an array of webpack plugins. In particular, it adds the HtmlWebpackPlugin, configured to use the public/index.html file as a template. This plugin will automatically generate an HTML file with the correct script tags for the bundled JavaScript.
devServer: { static: path.resolve(__dirname, 'public'), port: 3000, },- This configuration is for the webpack development server. It specifies the base directory for serving static files (public in this case) and the port number (3000) on which the development server will run. The development server provides features like hot-reloading during development.
And there you have it! We’ve just scratched the surface of the wild world of webpack. But don’t worry, this is just the opening act. Grab your gear, because in the upcoming articles, we’re going to plunge into the deep end, exploring the advanced terrains of webpack. Stay tuned!
-
Strategies for Cost Optimization Across Amazon EKS Clusters
Fast-growing tech companies rely heavily on Amazon EKS clusters to host a variety of microservices and applications. The pairing of Amazon EKS for managing the Kubernetes Control Plane and Amazon EC2 for flexible Kubernetes nodes creates an optimal environment for running containerized workloads.
With the increasing scale of operations, optimizing costs across multiple EKS clusters has become a critical priority. This blog will demonstrate how we can leverage various tools and strategies to analyze, optimize, and manage EKS costs effectively while maintaining performance and reliability.
Cost Analysis:
Working on cost optimization becomes absolutely necessary for cost analysis. Data plays an important role, and trust your data. The total cost of operating an EKS cluster encompasses several components. The EKS Control Plane (or Master Node) incurs a fixed cost of $0.20 per hour, offering straightforward pricing.
Meanwhile, EC2 instances, serving as the cluster’s nodes, introduce various cost factors, such as block storage and data transfer, which can vary significantly based on workload characteristics. For this discussion, we’ll focus primarily on two aspects of EC2 cost: instance hours and instance pricing. Let’s look at how to do the cost analysis on your EKS cluster.
- Tool Selection: We can begin our cost analysis journey by selecting Kubecost, a powerful tool specifically designed for Kubernetes cost analysis. Kubecost provides granular insights into resource utilization and costs across our EKS clusters.
- Deployment and Usage: Deploying Kubecost is straightforward. We can integrate it with our Kubernetes clusters following the provided documentation. Kubecost’s intuitive dashboard allowed us to visualize resource usage, cost breakdowns, and cost allocation by namespace, pod, or label. Once deployed, you can see the Kubecost overview page in your browser by port-forwarding the Kubecost k8s service. It might take 5-10 minutes for Kubecost to gather metrics. You can see your Amazon EKS spend, including cumulative cluster costs, associated Kubernetes asset costs, and monthly aggregated spend.

- Cluster Level Cost Analysis: For multi-cluster cost analysis and cluster level scoping, consider using the AWS Tagging strategy and tag your EKS clusters. Learn more about tagging strategy from the following documentations. You can then view your cost analysis in AWS Cost Explorer. AWS Cost Explorer provided additional insights into our AWS usage and spending trends. By analyzing cost and usage data at a granular level, we can identify areas for further optimization and cost reduction.
- Multi-Cluster Cost Analysis using Kubecost and Prometheus: Kubecost deployment comes with a Prometheus cluster to send cost analysis metrics to the Prometheus server. For multiple EKS clusters, we can enable the remote Prometheus server, either AWS-Managed Prometheus server or self-managed Prometheus. To get cost analysis metrics from multiple clusters, we need to run Kubeost with an additional Sigv4 pod that sends individual and combined cluster metrics to a common Prometheus cluster. You can follow the AWS documentation for Multi-Cluster Cost Analysis using Kubecost and Prometheus.
Cost Optimization Strategies:
Based on the cost analysis, the next step is to plan your cost optimization strategies. As explained in the previous section, the Control Plane has a fixed cost and straightforward pricing model. So, we will focus mainly on optimizing the data nodes and optimizing the application configuration. Let’s look at the following strategies when optimizing the cost of the EKS cluster and supporting AWS services:
- Right Sizing: On the cost optimization pillar of the AWS Well-Architected Framework, we find a section on Cost-Effective Resources, which describes Right Sizing as:
“… using the lowest cost resource that still meets the technical specifications of a specific workload.”
- Application Right Sizing: Right-sizing is the strategy to optimize pod resources by allocating the appropriate CPU and memory resources to pods. Care must be taken to try to set requests that align as close as possible to the actual utilization of these resources. If the value is too low, then the containers may experience throttling of the resources and impact the performance. However, if the value is too high, then there is waste, since those unused resources remain reserved for that single container. When actual utilization is lower than the requested value, the difference is called slack cost. A tool like kube-resource-report is valuable for visualizing the slack cost and right-sizing the requests for the containers in a pod. Installation instructions demonstrate how to install via an included helm chart.
helm upgrade –install kube-resource-report chart/kube-resource-report
You can also consider tools like VPA recommender with Goldilocks to get an insight into your pod resource consumption and utilization.
- Compute Right Sizing: Application right sizing and Kubecost analysis are required to right-size EKS Compute. Here are several strategies for computing right sizing:some text
- Mixed Instance Auto Scaling group: Employ a mixed instance policy to create a diversified pool of instances within your auto scaling group. This mix can include both spot and on-demand instances. However, it’s advisable not to mix instances of different sizes within the same Node group.
- Node Groups, Taints, and Tolerations: Utilize separate Node Groups with varying instance sizes for different application requirements. For example, use distinct node groups for GPU-intensive and CPU-intensive applications. Use taints and tolerations to ensure applications are deployed on the appropriate node group.
- Graviton Instances: Explore the adoption of Graviton Instances, which offer up to 40% better price performance compared to traditional instances. Consider migrating to Graviton Instances to optimize costs and enhance application performance.
- Purchase Options: Another part of the cost optimization pillar of the AWS Well-Architected Framework that we can apply comes from the Purchasing Options section, which says:
“Spot Instances allow you to use spare compute capacity at a significantly lower cost than On-Demand EC2 instances (up to 90%).”
Understanding purchase options for Amazon EC2 is crucial for cost optimization. The Amazon EKS data plane consists of worker nodes or serverless compute resources responsible for running Kubernetes application workloads. These nodes can utilize different capacity types and purchase options, including On-Demand, Spot Instances, Savings Plans, and Reserved Instances.
On-Demand and Spot capacity offer flexibility without spending commitments. On-Demand instances are billed based on runtime and guarantee availability at On-Demand rates, while Spot instances offer discounted rates but are preemptible. Both options are suitable for temporary or bursty workloads, with Spot instances being particularly cost-effective for applications tolerant of compute availability fluctuations.
Reserved Instances involve upfront spending commitments over one or three years for discounted rates. Once a steady-state resource consumption profile is established, Reserved Instances or Savings Plans become effective. Savings Plans, introduced as a more flexible alternative to Reserved Instances, allow for commitments based on a “US Dollar spend amount,” irrespective of provisioned resources. There are two types: Compute Savings Plans, offering flexibility across instance types, Fargate, and Lambda charges, and EC2 Instance Savings Plans, providing deeper discounts but restricting compute choice to an instance family.
Tailoring your approach to your workload can significantly impact cost optimization within your EKS cluster. For non-production environments, leveraging Spot Instances exclusively can yield substantial savings. Meanwhile, implementing Mixed-Instances Auto Scaling Groups for production workloads allows for dynamic scaling and cost optimization. Additionally, for steady workloads, investing in a Savings Plan for EC2 instances can provide long-term cost benefits. By strategically planning and optimizing your EC2 instances, you can achieve a notable reduction in your overall EKS compute costs, potentially reaching savings of approximately 60-70%.
- Auto Scaling: The cost optimization pillar of the AWS Well-Architected Framework includes a section on Matching Supply and Demand, which recommends the following:
“… this (matching supply and demand) accomplished using Auto Scaling, which helps you to scale your EC2 instances and Spot Fleet capacity up or down automatically according to conditions you define.”
- Cluster Autoscaling: Therefore, a prerequisite to cost optimization on a Kubernetes cluster is to ensure you have Cluster Autoscaler running. This tool performs two critical functions in the cluster. First, it will monitor the cluster for pods that are unable to run due to insufficient resources. Whenever this occurs, the Cluster Autoscaler will update the Amazon EC2 Auto Scaling group to increase the desired count, resulting in additional nodes in the cluster. Additionally, the Cluster Autoscaler will detect nodes that have been underutilized and reschedule pods onto other nodes. Cluster Autoscaler will then decrease the desired count for the Auto Scaling group to scale in the number of nodes.
The Amazon EKS User Guide has a great section on the configuration of the Cluster Autoscaler. There are a couple of things to pay attention to when configuring the Cluster Autoscaler:
IAM Roles for Service Account – Cluster Autoscaler will require access to update the desired capacity in the Auto Scaling group. The recommended approach is to create a new IAM role with the required policies and a trust policy that restricts access to the service account used by Cluster Autoscaler. The role name must then be provided as an annotation on the service account:
apiVersion: v1 kind: ServiceAccount metadata: name: cluster-autoscaler annotations: eks.amazonaws.com/role-arn: arn:aws:iam::000000000000:role/my_role_nameAuto-Discovery Setup
Setup your Cluster Autoscaler in Auto-Discovery Setup by enabling the –node-group-auto-discovery flag as an argument. Also, make sure to tag your EKS nodes’ Autoscaling groups with the following tags:
k8s.io/cluster-autoscaler/enabled,
k8s.io/cluster-autoscaler/<cluster-name>
Auto Scaling Group per AZ – When Cluster Autoscaler scales out, it simply increases the desired count for the Auto Scaling group, leaving the responsibility for launching new EC2 instances to the AWS Auto Scaling service. If an Auto Scaling group is configured for multiple availability zones, then the new instance may be provisioned in any of those availability zones.For deployments that use persistent volumes, you will need to provision a separate Auto Scaling group for each availability zone. This way, when Cluster Autoscaler detects the need to scale out in response to a given pod, it can target the correct availability zone for the scale-out based on persistent volume claims that already exist in a given availability zone.
When using multiple Auto Scaling groups, be sure to include the following argument in the pod specification for Cluster Autoscaler:
–balance-similar-node-groups=true
- Pod Autoscaling: Now that Cluster Autoscaler is running in the cluster, you can be confident that the instance hours will align closely with the demand from pods within the cluster. Next up is to use Horizontal Pod Autoscaler (HPA) to scale out or in the number of pods for a deployment based on specific metrics for the pods to optimize pod hours and further optimize our instance hours.
The HPA controller is included with Kubernetes, so all that is required to configure HPA is to ensure that the Kubernetes metrics server is deployed in your cluster and then defining HPA resources for your deployments. For example, the following HPA resource is configured to monitor the CPU utilization for a deployment named nginx-ingress-controller. HPA will then scale out or in the number of pods between 1 and 5 to target an average CPU utilization of 80% across all the pods:
apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: name: nginx-ingress-controller spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: nginx-ingress-controller minReplicas: 1 maxReplicas: 5 targetCPUUtilizationPercentage: 80The combination of Cluster Autoscaler and Horizontal Pod Autoscaler is an effective way to keep EC2 instance hours tied as close as possible to the actual utilization of the workloads running in the cluster.

- Down Scaling: In addition to demand-based automatic scaling, the Matching Supply and Demand section of the AWS Well-Architected Framework cost optimization pillar includes a section, which recommends the following:
“Systems can be scheduled to scale out or in at defined times, such as the start of business hours, thus ensuring that resources are available when users arrive.”
There are many deployments that only need to be available during business hours. A tool named kube-downscaler can be deployed to the cluster to scale in and out the deployments based on time of day.
Some example use case of kube-downscaler is:
- Deploy the downscaler to a test (non-prod) cluster with a default uptime or downtime time range to scale down all deployments during the night and weekend.
- Deploy the downscaler to a production cluster without any default uptime/downtime setting and scale down specific deployments by setting the downscaler/uptime (or downscaler/downtime) annotation. This might be useful for internal tooling front ends, which are only needed during work time.
- AWS Fargate with EKS: You can run Kubernetes without managing clusters of K8s servers with AWS Fargate, a serverless compute service.
AWS Fargate pricing is based on usage (pay-per-use). There are no upfront charges here as well. There is, however, a one-minute minimum charge. All charges are also rounded up to the nearest second. You will also be charged for any additional services you use, such as CloudWatch utilization charges and data transfer fees. Fargate can also reduce your management costs by reducing the number of DevOps professionals and tools you need to run Kubernetes on Amazon EKS.
Conclusion:
Effectively managing costs across multiple Amazon EKS clusters is essential for optimizing operations. By utilizing tools like Kubecost and AWS Cost Explorer, coupled with strategies such as right-sizing, mixed instance policies, and Spot Instances, organizations can streamline cost analysis and optimize resource allocation. Additionally, implementing auto-scaling mechanisms like Cluster Autoscaler ensures dynamic resource scaling based on demand, further optimizing costs. Leveraging AWS Fargate with EKS can eliminate the need to manage Kubernetes clusters, reducing management costs. Overall, by combining these strategies, organizations can achieve significant cost savings while maintaining performance and reliability in their containerized environments.
-
Cloud Data Migration: What You Need to Know?
Highlights
- What is Cloud Data Migration?
- What are the Benefits of Cloud Migration?
- Cloud Data Migration Challenges – How to Evade Them?
Cloud use and migration are undeniably increasing. According to a new Markets and Markets analysis, the usage of cloud is projected to expand at a compound annual growth rate (CAGR) of 16.3% from 2021 to 2026.
Similarly, according to Gartner’s prediction, by 2025, 95% of data workloads—up from 30% in 2021—will be hosted on the cloud. Cloud technology is vital to help businesses reopen, rethink, and navigate volatility. The increased use of the cloud is because of its advantages over traditional on-premises hosting. It offers smooth end-to-end digital transformation to businesses and help them succeed in this competitive world.
What is Cloud Data Migration?
Cloud data migration entails moving databases, IT resources, digital assets, and applications either partially or wholly to the cloud. Cloud migration also involves moving from one cloud service to another.
As businesses seek to bid farewell to antiquated and slow legacy infrastructures, such as aging servers and potentially unreliable legacy appliances, they are turning towards the cloud to unlock their full potential.
No one can deny that cloud migration helps businesses achieve real-time and updated performance and efficiency. However, the process of cloud data migration is not easy and requires expert assistance since it involves careful analysis, planning, and execution to ensure the cloud solution’s compatibility with your business requirements.
What are the Benefits of Cloud Migration?
Recently, companies have started migrating their apps, IT infrastructure, and data to the cloud to become more flexible digital workspaces in response to the shifting nature of the business landscape. Cloud migration has a massive impact on a business’ success. Companies that have already begun cloud migration are accelerating their digital transformation journey and putting themselves at the forefront of technological innovation.
Cloud data migration is projected to be a key driving force for enterprises in the following years. As a result, businesses that embrace cloud-based solutions proactively position themselves for long-term success and development. Some of the significant benefits of cloud data migration include:
- High Scalability
The Cloud data migration strategy provides businesses with high scalability, allowing them to efficiently manage fluctuations in demand and quickly scale their operations up and down to meet changing needs.
- Cost Savings
With cloud data migration, businesses can save money by minimizing the requirement for physical infrastructure, lowering maintenance and upgrade expenses, and removing the need for on-premises staff to manage the infrastructure. This can also free up resources that can be used to promote development and innovation in other areas of the organization.
- Increased Flexibility
Businesses can now access their apps and data from any location, at any time, and on any device with an internet connection, all because of cloud data migration. This leads to increased flexibility, can promote worker productivity and cooperation, and makes remote work possible.
- Improved Security
Moving your data to the cloud is an excellent idea for businesses that want to increase their data security. By migrating your data to a reliable cloud environment, you can take advantage of the security features given by cloud service providers like AWS, such as encryption, access restrictions, and automatic backups.
- Better Performance
Businesses benefit from cloud data migration by gaining access to the latest technologies that are optimized for performance and reliability. Allowing quicker and more effective operations, this can assist in boosting customer satisfaction and loyalty while driving revenue growth.
- Business Modernization
Having a smart cloud data migration strategy is vital for a company’s modernization since it allows businesses to harness advanced technology and remain competitive in the digital marketplace. Cloud migration assists organizations in better meeting their consumers’ demands by offering more advanced and innovative goods and services.
- Disaster Recovery
Cloud data migration is necessary for businesses since it offers excellent disaster recovery. Cloud providers generally have disaster recovery and business continuity plans that can assist organizations in recovering rapidly from unforeseen occurrences such as natural disasters or cyber-attacks. This can assist in reducing downtime and data loss, both of which can be costly and detrimental to a company.
Cloud Data Migration Challenges – How to Evade Them?
Migration of apps and data to the cloud is advantageous for businesses, but you might encounter many challenges during the process without specialized knowledge and expertise. An experienced cloud migration professional can assist companies in navigating the migration process and avoiding typical mistakes that might result in data loss, system outages, corruption, or delays, as well as security breaches.
Cloud migration specialists can assist you in developing and implementing a complete cloud data migration strategy that meets technical, operational, and security demands. They also assist organizations in selecting the best cloud service provider and platform for their specific needs and objectives. Furthermore, when the migration process is complete, expert service providers give continuous assistance and support to customers, helping them optimize their cloud infrastructure and maximize the value of their investment.
R Systems is a trusted cloud data migration service provider that houses well-experienced cloud migration professionals who ensure smooth and efficient AWS data migration with minimal disruption to your business operations. Being an AWS Advanced Tier Services Partner, we hold specialization in delivering top-notch AWS data migration services within a stipulated time and budget while helping businesses achieve their cloud migration goals more quickly and effectively.





