Tag: AWS

From Connected to Intelligent: The Evolution of Smart Homes
Overview:

From futuristic speculation to everyday reality, smart homes can go way beyond connected devices – they can become intelligent, collaborative, reactive and adaptable environments. This can be achieved using Multi-agent AI Systems (MAS) to unify IoT devices and lay a solid foundation for innovation, for more seamless and secure living.

This remarkable growth of smart homes brings both opportunities and challenges. In this whitepaper, we’ll explore both, moving from the general – market overview and predictions, to specific – blueprint architecture and use cases, using AWS Harmony.

Here’s a breakdown of the whitepaper:
- The Smart Homes market landscape: what is the current state and changes to expect
- Multi-Agent AI Systems (MAS): how they work and why they’re transforming Smart Homes
- The technology behind MAS: capabilities, practical applications and benefits
- Smart Homes on AWS Harmony: blueprint of Agentic AI as the foundation for next-gen experiences
- Use case for sustainable living: a hybrid Edge + Cloud IoT high-level architecture to implement for energy saving
December 3, 2025
A Much Needed Modern Touch to Enterprise Data Management

August 9, 2024
Salesforce Education Cloud

July 25, 2024
Use Cases of RPA in Healthcare Industry

July 25, 2024
Key Qs Enterprises Must Consider before Initializing Their Cloud Journey

July 25, 2024
Accelerate Growth through RPA

July 25, 2024
Leveraging Unstructured Data Analytics

July 25, 2024
Iceberg: Features and Hands-on (Part 2)
As we have already discussed in the previous blog about Apache Iceberg’s basic concepts, setup process, and how to load data. Further, we will now delve into some of Iceberg’s advanced features, including upsert functionality, schema evolution, time travel, and partitioning.

Upsert Functionality

One of Iceberg’s key features is its support for upserts. Upsert, which stands for update and insert, allows you to efficiently manage changes to your data. With Iceberg, you can perform these operations seamlessly, ensuring that your data remains accurate and up-to-date without the need for complex and time-consuming processes.

Schema Evolution

Schema evolution is another of its powerful features. Over time, the schema of your data may need to change due to new requirements or updates. Iceberg handles schema changes gracefully, allowing you to add, remove, or modify columns without having to rewrite your entire dataset. This flexibility ensures that your data architecture can evolve in tandem with your business needs.

Time Travel

Iceberg also provides time travel capabilities, enabling you to query historical data as it existed at any given point in time. This feature is particularly useful for debugging, auditing, and compliance purposes. By leveraging snapshots, you can easily access previous states of your data and perform analyses on how it has changed over time.

Setup Iceberg on the local machine using the local catalog option or Hive

You can also configure Iceberg in your Spark session like this:
import pyspark spark = pyspark.sql.SparkSession.builder .config('spark.jars.packages','org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0') .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') .config('spark.sql.catalog.spark_catalog.type', 'hive') .config('spark.sql.catalog.local', 'org.apache.iceberg.spark.SparkCatalog') .config('spark.sql.catalog.local.type', 'hadoop') .config('spark.sql.catalog.local.warehouse', './Data-Engineering/warehouse') .getOrCreate()
```
import pyspark
spark = pyspark.sql.SparkSession.builder 
.config('spark.jars.packages','org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0') 
    .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 
    .config('spark.sql.catalog.spark_catalog.type', 'hive') 
    .config('spark.sql.catalog.local', 'org.apache.iceberg.spark.SparkCatalog') 
    .config('spark.sql.catalog.local.type', 'hadoop') 
    .config('spark.sql.catalog.local.warehouse', './Data-Engineering/warehouse') 
    .getOrCreate()
```
Some configurations must pass while setting up Iceberg.

Create Tables in Iceberg and Insert Data
```
CREATE TABLE demo.db.data_sample (index string, organization_id string, name string, website string, country string, description string, founded string, industry string, num_of_employees string) USING iceberg
```
```
df = spark.read.option("header", "true").csv("../data/input-data/organizations-100.csv")

df.writeTo("demo.db.data_sample").append()
```
We can either create the sample table using Spark SQL or directly write the data by mentioning the DB name and table name, which will create the Iceberg table for us.

‍

You can see the data we have inserted. Apart from appending, you can use the overwrite method as well as Delta Lake tables. You can also see an example of how to read the data from an iceberg table.

Handling Upserts

This Iceberg feature is similar to Delta Lake. You can update the records in existing Iceberg tables without impacting the complete data. This is also used to handle the CDC operations. We can take input from any incoming CSV and merge the data in the existing table without any duplication. It will always have a single Record for each primary key. This is how Iceberg maintains the ACID properties.

Incoming Data
```
input_data = spark.read.option("header", "true").csv("../data/input-data/organizations-11111.csv")
# Creating the temp view of that dataframe to merge
input_data.createOrReplaceTempView("input_data")
spark.sql("select * from input_data").show()
```
We will merge this data into our existing Iceberg Table using Spark SQL.
```
MERGE INTO demo.db.data_sample t
USING (SELECT * FROM input_data) s
ON t.organization_id = s.organization_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

select * from demo.db.data_sample
```
Here, we can see the data once the merge operation has taken place.

Schema Evolution

Iceberg supports the following schema evolution changes:
- Add – Add a new column to the iceberg table
- Drop – If any columns get removed from the existing tables
- Rename – Change the name of the columns from the existing table
- Update – Change the data type or partition columns of the Iceberg table
- Reorder – Change in the order of the Iceberg table
After updating the schema, there will be no need to overwrite or re-write the data again. Like previously, your table has four columns, and all of them have data. If you added two more columns, you wouldn’t need to rewrite the data now that you have six columns. You can still easily access the data. This unique feature was lacking in Delta Lake but is present here. These are just some characteristics of the Iceberg scheme evolutions.
1. If we add any columns, they won’t impact the existing columns.
2. If we delete or drop any columns, they won’t impact other columns.
3. Updating a column or field does not change values in any other column.
Iceberg uses unique IDs to track each column added to a table.

Let’s run some queries to update the schema, or let’s try to delete some columns.
```
%%sql

ALTER TABLE demo.db.data_sample
ADD COLUMN fare_per_distance_unit float AFTER num_of_employees;
```
After adding another column, if we try to access the data again from the table, we can do so without seeing any kind of error. This is also how Iceberg solves schema-related problems.

Partition Evolution and Sort Order Evolution

Iceberg came up with this option, which was missing in Delta Lake. When you evolve a partition spec, the old data written with an earlier spec remains unchanged. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately. Because of this, when you start writing queries, you get split planning. This is where each partition layout plans files separately using the filter it derives for that specific partition layout.

Similar to partition spec, Iceberg sort order can also be updated in an existing table. When you evolve a sort order, the old data written with an earlier order remains unchanged.
```
%%sql

ALTER TABLE demo.db.data_sample ADD PARTITION FIELD founded
DESCRIBE TABLE demo.db.data_sample
```
‍

Copy on write(COW) and merge on read(MOR) as well

Iceberg supports both COW and MOR while loading the data into the Iceberg table. We can set up configuration for this by either altering the table or while creating the iceberg table.

Copy-On-Write (COW) – Best for tables with frequent reads, infrequent writes/updates, or large batch updates:

When your requirement is to frequently read but less often write and update, you can configure this property in an Iceberg table. In COW, when we update or delete any rows from the table, a new data file with another version is created, and the latest version holds the latest updated data. The data is rewritten when updates or deletions occur, making it slower and can be a bottleneck when large updates occur. As its name specifies, it creates another copy on write of data.

When reading occurs, it is an ideal process as we are not updating or deleting anything we are only reading so we can read the data faster.

Merge-On-Read (MOR) – Best for tables with frequent writes/updates:

This is just opposite of the COW, as we do not rewrite the data again on the update or deletion of any rows. It creates a change log with updated records and then merges this into the original data file to create a new state of file with updated records.

‍

‍

‍

Query engine and integration supported:

Conclusion

After performing this research, we learned about the Iceberg’s features and its compatibility with various metastore for integrations. We got the basic idea of configuring Iceberg on different cloud platforms and locally well. We had some basic ideas for Upsert, schema evolution and partition evolution.
June 27, 2024

Iceberg – Introduction and Setup (Part – 1)

As we already discussed in our previous Delta Lake blog, there are already table formats in use, ones with very high specifications and their own benefits. Iceberg is one of them. So, in this blog, we will discuss Iceberg.

What is Apache Iceberg?

Iceberg, from the open-source Apache, is a table format used to handle large amounts of data stored locally or on various cloud storage platforms. Netflix developed Iceberg to solve its big data problem. After that, they donated it to Apache, and it became open source in 2018. Iceberg now has a large number of contributors all over the world on GitHub and is the most widely used table format.

Iceberg mainly solves all the key problems we once faced when using the Hive table format to deal with data stored on various cloud storage like S3.

Iceberg has similar features and capabilities, like SQL tables. Yes, it is open source, so multiple engines like Spark can operate on it to perform transformations and such. It also has all ACID properties. This is a quick introduction to Iceberg, covering its features and initial setup.

Why to go with Iceberg

The main reason to use Iceberg is that it performs better when we need to load data from S3, or metadata is available on a cloud storage medium. Unlike Hive, Iceberg tracks the data at the file level rather than the folder level, which can decrease performance; that’s why we want to choose Iceberg. Here is the folder hierarchy that Iceberg uses while saving the data into its tables. Each Iceberg table is a combination of four files: snapshot metadata list, manifest list, manifest file, and data file.

Snapshot Metadata File: This file holds the metadata information about the table, such as the schema, partitions, and manifest list.
Manifest List: This list records each manifest file along with the path and metadata information. At this point, Iceberg decides which manifest files to ignore and which to read.
Manifest File: This file contains the paths to real data files, which hold the real data along with the metadata.
Data File: Here is the real parquet, ORC, and Avro file, along with the real data.

Features of Iceberg:

Some Iceberg features include:

Schema Evolution: Iceberg allows you to evolve your schema without having to rewrite your data. This means you can easily add, drop, or rename columns, providing flexibility to adapt to changing data requirements without impacting existing queries.
Partition Evolution: Iceberg supports partition evolution, enabling you to modify the partitioning scheme as your data and query patterns evolve. This feature helps maintain query performance and optimize data layout over time.
Time Travel: Iceberg’s time travel feature allows you to query historical versions of your data. This is particularly useful for debugging, auditing, and recreating analyses based on past data states.
Multiple Query Engine Support: Iceberg supports multiple query engines, including Trino, Presto, Hive, and Amazon Athena. This interoperability ensures that you can read and write data across different tools seamlessly, facilitating a more versatile and integrated data ecosystem.
AWS Support: Iceberg is well-integrated with AWS services, making it easy to use with Amazon S3 for storage and other AWS analytics services. This integration helps leverage the scalability and reliability of AWS infrastructure for your data lake.
ACID Compliance: Iceberg ensures ACID (Atomicity, Consistency, Isolation, Durability) transactions, providing reliable data consistency and integrity. This makes it suitable for complex data operations and concurrent workloads, ensuring data reliability and accuracy.
Hidden Partitioning: Iceberg’s hidden partitioning abstracts the complexity of managing partitions from the user, automatically handling partition management to improve query performance without manual intervention.
Snapshot Isolation: Iceberg supports snapshot isolation, enabling concurrent read and write operations without conflicts. This isolation ensures that users can work with consistent views of the data, even as it is being updated.
Support for Large Tables: Designed for high scalability, Iceberg can efficiently handle petabyte-scale tables, making it ideal for large datasets typical in big data environments.
Compatibility with Modern Data Lakes: Iceberg’s design is tailored for modern data lake architectures, supporting efficient data organization, metadata management, and performance optimization, aligning well with contemporary data management practices.

These features make Iceberg a powerful and flexible table format for managing data lakes, ensuring efficient data processing, robust performance, and seamless integration with various tools and platforms. By leveraging Iceberg, organizations can achieve greater data agility, reliability, and efficiency, enhancing their data analytics capabilities and driving better business outcomes.

Prerequisite:

PySpark: Ensure that you have PySpark installed and properly configured. PySpark provides the Python API for Spark, enabling you to harness the power of distributed computing with Spark using Python.
Python: Make sure you have Python installed on your system. Python is essential for writing and running your PySpark scripts. It’s recommended to use a virtual environment to manage your dependencies effectively.
Iceberg-Spark JAR: Download the appropriate Iceberg-Spark JAR file that corresponds to your Spark version. This JAR file is necessary to integrate Iceberg with Spark, allowing you to utilize Iceberg’s advanced table format capabilities within your Spark jobs.
Jars to Configure Cloud Storage: Obtain and configure the necessary JAR files for your specific cloud storage provider. For example, if you are using Amazon S3, you will need the hadoop-aws JAR and its dependencies. For Google Cloud Storage, you need the gcs-connector JAR. These JARs enable Spark to read from and write to cloud storage systems.
Spark and Hadoop Configuration: Ensure your Spark and Hadoop configurations are correctly set up to integrate with your cloud storage. This might include setting the appropriate access keys, secret keys, and endpoint configurations in your spark-defaults.conf and core-site.xml.
Iceberg Configuration: Configure Iceberg settings specific to your environment. This might include catalog configurations (e.g., Hive, Hadoop, AWS Glue) and other Iceberg properties that optimize performance and compatibility.
Development Environment: Set up a development environment with an IDE or text editor that supports Python and Spark development, such as IntelliJ IDEA with the PyCharm plugin, Visual Studio Code, or Jupyter Notebooks.
Data Source Access: Ensure you have access to the data sources you will be working with, whether they are in cloud storage, relational databases, or other data repositories. Proper permissions and network configurations are necessary for seamless data integration.
Basic Understanding of Data Lakes: A foundational understanding of data lake concepts and architectures will help effectively utilize Iceberg. Knowledge of how data lakes differ from traditional data warehouses and their benefits will also be helpful.
Version Control System: Use a version control system like Git to manage your codebase. This helps in tracking changes, collaborating with team members, and maintaining code quality.
Documentation and Resources: Familiarize yourself with Iceberg documentation and other relevant resources. This will help you troubleshoot issues, understand best practices, and leverage advanced features effectively.

You can download the run time JAR from here —according to the Spark version installed on your machine or cluster. It will be the same as the Delta Lake setup. You can either download these JAR files to your machine or cluster, provide a Spark submit command, or you can download these while initializing the Spark session by passing these in Spark config as a JAR package, along with the appropriate version.

To use cloud storage, we are using these JARs with the S3 bucket for reading and writing Iceberg tables. Here is the basic example of a spark session:

AWS_ACCESS_KEY_ID = "XXXXXXXXXXXXXX"
AWS_SECRET_ACCESS_KEY = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXpiwvahk7e"
spark_jars_packages = "com.amazonaws:aws-java-sdk:1.12.246,org.apache.hadoop:hadoop-aws:3.2.2,org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0"
spark = pyspark.sql.SparkSession.builder 
   .config("spark.jars.packages", spark_jars_packages) 
   .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") 
   .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog") 
   .config("spark.sql.catalog.demo.warehouse", "s3a://abhishek-test-01012023/iceberg-sample-data/") 
   .config('spark.sql.catalog.demo.type', 'hadoop') 
   .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') 
   .config("spark.driver.memory", "20g") 
   .config("spark.memory.offHeap.enabled", "true") 
   .config("spark.memory.offHeap.size", "8g") 
   .getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)

AWS_ACCESS_KEY_ID = "XXXXXXXXXXXXXX"
AWS_SECRET_ACCESS_KEY = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXpiwvahk7e"

spark_jars_packages = "com.amazonaws:aws-java-sdk:1.12.246,org.apache.hadoop:hadoop-aws:3.2.2,org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0"

spark = pyspark.sql.SparkSession.builder 
   .config("spark.jars.packages", spark_jars_packages) 
   .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") 
   .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog") 
   .config("spark.sql.catalog.demo.warehouse", "s3a://abhishek-test-01012023/iceberg-sample-data/") 
   .config('spark.sql.catalog.demo.type', 'hadoop') 
   .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') 
   .config("spark.driver.memory", "20g") 
   .config("spark.memory.offHeap.enabled", "true") 
   .config("spark.memory.offHeap.size", "8g") 
   .getOrCreate()

spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)

Iceberg Setup Using Docker

You can set and configure AWS creds, as well as some database-related or stream-related configs inside the docker-compose file.

version: "3"
services:
  spark-iceberg:
    image: tabulario/spark-iceberg
    container_name: spark-iceberg
    build: spark/
    depends_on:
      - rest
      - minio
    volumes:
      - ./warehouse:/home/iceberg/warehouse
      - ./notebooks:/home/iceberg/notebooks/notebooks
      - ./data:/home/iceberg/data
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    ports:
      - 8888:8888
      - 8080:8080
    links:
      - rest:rest
      - minio:minio
  rest:
    image: tabulario/iceberg-rest:0.1.0
    ports:
      - 8181:8181
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
      - CATALOG_WAREHOUSE=s3a://warehouse/wh/
      - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
      - CATALOG_S3_ENDPOINT=http://minio:9000
  minio:
    image: minio/minio
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
    ports:
      - 9001:9001
      - 9000:9000
    command: ["server", "/data", "--console-address", ":9001"]
  mc:
    depends_on:
      - minio
    image: minio/mc
    container_name: mc
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    entrypoint: >
      /bin/sh -c "
      until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
      /usr/bin/mc rm -r --force minio/warehouse;
      /usr/bin/mc mb minio/warehouse;
      /usr/bin/mc policy set public minio/warehouse;
      exit 0;
      "

version: "3"

services:
  spark-iceberg:
    image: tabulario/spark-iceberg
    container_name: spark-iceberg
    build: spark/
    depends_on:
      - rest
      - minio
    volumes:
      - ./warehouse:/home/iceberg/warehouse
      - ./notebooks:/home/iceberg/notebooks/notebooks
      - ./data:/home/iceberg/data
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    ports:
      - 8888:8888
      - 8080:8080
    links:
      - rest:rest
      - minio:minio
  rest:
    image: tabulario/iceberg-rest:0.1.0
    ports:
      - 8181:8181
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
      - CATALOG_WAREHOUSE=s3a://warehouse/wh/
      - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO
      - CATALOG_S3_ENDPOINT=http://minio:9000
  minio:
    image: minio/minio
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
    ports:
      - 9001:9001
      - 9000:9000
    command: ["server", "/data", "--console-address", ":9001"]
  mc:
    depends_on:
      - minio
    image: minio/mc
    container_name: mc
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
      - AWS_REGION=us-east-1
    entrypoint: >
      /bin/sh -c "
      until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
      /usr/bin/mc rm -r --force minio/warehouse;
      /usr/bin/mc mb minio/warehouse;
      /usr/bin/mc policy set public minio/warehouse;
      exit 0;
      "

Save this file with docker-compose.yaml. And run the command: docker compose up. Now, you can log into your container by using this command:

docker exec -it <container-id> bash

docker exec -it <container-id> bash

You can mount the sample data directory in a container or copy it from your local machine to the container. To copy the data inside the Docker directory, we can use the CP command.

docker cp input-data <Container ID>:/home/iceberg/data

docker cp input-data <Container ID>:/home/iceberg/data

Setup S3 As a Warehouse in Iceberg, Read Data from the S3, and Write Iceberg Tables in the S3 Again Using an EC2 Instance

We have generated 90 GB of data here using Spark Job, stored in the S3 bucket.

AWS_ACCESS_KEY_ID = "XXXXXXXXXXX"
AWS_SECRET_ACCESS_KEY = "XXXXXXXXXXX+XXXXXXXXXXX"
spark_jars_packages = "com.amazonaws:aws-java-sdk:1.12.246,org.apache.hadoop:hadoop-aws:3.2.2,org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0"
spark = pyspark.sql.SparkSession.builder 
   .config("spark.jars.packages", spark_jars_packages) 
   .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") 
   .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog") 
   .config("spark.sql.catalog.demo.warehouse", "s3a://abhishek-test-01012023/iceberg-sample-data/") 
   .config('spark.sql.catalog.demo.type', 'hadoop') 
   .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') 
   .config("spark.driver.memory", "20g") 
   .config("spark.memory.offHeap.enabled", "true") 
   .config("spark.memory.offHeap.size", "8g") 
   .getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)

AWS_ACCESS_KEY_ID = "XXXXXXXXXXX"
AWS_SECRET_ACCESS_KEY = "XXXXXXXXXXX+XXXXXXXXXXX"

spark_jars_packages = "com.amazonaws:aws-java-sdk:1.12.246,org.apache.hadoop:hadoop-aws:3.2.2,org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.1.0"

spark = pyspark.sql.SparkSession.builder 
   .config("spark.jars.packages", spark_jars_packages) 
   .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") 
   .config("spark.sql.catalog.demo", "org.apache.iceberg.spark.SparkCatalog") 
   .config("spark.sql.catalog.demo.warehouse", "s3a://abhishek-test-01012023/iceberg-sample-data/") 
   .config('spark.sql.catalog.demo.type', 'hadoop') 
   .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') 
   .config("spark.driver.memory", "20g") 
   .config("spark.memory.offHeap.enabled", "true") 
   .config("spark.memory.offHeap.size", "8g") 
   .getOrCreate()

spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)

Step 1

We read the data in Spark and create an Iceberg table out of it, storing the iceberg tables in the S3 bucket only.

Some Iceberg functionality won’t work if we haven’t installed or used the appropriate JAR file of the Iceberg version. The Iceberg version should be compatible with the Spark version you are using; otherwise, some feature partitions will throw an error of noSuchMethod. This must be taken care of carefully while setting this up, either in EC2 or EMR.

Create an Iceberg table on S3 and write data into that table. The sample data we have used is generated using a Spark job for Delta tables. We are using the same data and schema of the data as follows.

Step 2

We created Iceberg tables in the location of the S3 bucket and wrote the data with partition columns in the S3 bucket only.

spark.sql(""" CREATE TABLE IF NOT EXISTS demo.db.iceberg_data_2(id INT, first_name String,
last_name String, address String, pincocde INT, net_income INT, source_of_income String,
state String, email_id String, description String, population INT, population_1 String,
population_2 String, population_3 String, population_4 String, population_5 String, population_6 String,
population_7 String, date INT)
USING iceberg
TBLPROPERTIES ('format'='parquet', 'format-version' = '2')
PARTITIONED BY (`date`)
location 's3a://abhishek-test-01012023/iceberg_v2/db/iceberg_data_2'""")
# Read the data that need to be written
# Reading the data from delta tables in spark Dataframe
df = spark.read.parquet("s3a://abhishek-test-01012023/delta-lake-sample-data/")
logging.info("Starting writing the data")
df.sortWithinPartitions("date").writeTo("demo.db.iceberg_data").partitionedBy("date").createOrReplace()
logging.info("Writing has been finished")
logging.info("Query the data from iceberg using spark SQL")
spark.sql("describe table demo.db.iceberg_data").show()
spark.sql("Select * from demo.db.iceberg_data limit 10").show()

spark.sql(""" CREATE TABLE IF NOT EXISTS demo.db.iceberg_data_2(id INT, first_name String,
last_name String, address String, pincocde INT, net_income INT, source_of_income String,
state String, email_id String, description String, population INT, population_1 String,
population_2 String, population_3 String, population_4 String, population_5 String, population_6 String,
population_7 String, date INT)
USING iceberg
TBLPROPERTIES ('format'='parquet', 'format-version' = '2')
PARTITIONED BY (`date`)
location 's3a://abhishek-test-01012023/iceberg_v2/db/iceberg_data_2'""")

# Read the data that need to be written
# Reading the data from delta tables in spark Dataframe

df = spark.read.parquet("s3a://abhishek-test-01012023/delta-lake-sample-data/")

logging.info("Starting writing the data")

df.sortWithinPartitions("date").writeTo("demo.db.iceberg_data").partitionedBy("date").createOrReplace()

logging.info("Writing has been finished")

logging.info("Query the data from iceberg using spark SQL")

spark.sql("describe table demo.db.iceberg_data").show()
spark.sql("Select * from demo.db.iceberg_data limit 10").show()

This is how we can use Iceberg over S3. There is another option: We can also create Iceberg tables in the AWS Glue catalog. Most tables created in the Glue catalog using Ahena are external tables that we use externally after generating the manifest files, like Delta Lake.

Step 3

We print the Iceberg table’s data along with the table descriptions.

‍

Using Iceberg, we can directly create the table in the Glue catalog using Athena, and it supports all read and write operations on the data available. These are the configurations that need to use in spark while using Glue catalog.

{
    "conf":  {
             "spark.sql.catalog.glue_catalog1": "org.apache.iceberg.spark.SparkCatalog",
             "spark.sql.catalog.glue_catalog1.warehouse": 
                   "s3://YOUR-BUCKET-NAME/iceberg/glue_catalog1/tables/",
             "spark.sql.catalog.glue_catalog1.catalog-impl":    "org.apache.iceberg.aws.glue.GlueCatalog",
             "spark.sql.catalog.glue_catalog1.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
             "spark.sql.catalog.glue_catalog1.lock-impl": "org.apache.iceberg.aws.glue.DynamoLockManager",
             "spark.sql.catalog.glue_catalog1.lock.table": "myGlueLockTable",
  "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
           } 
}

{
    "conf":  {
             "spark.sql.catalog.glue_catalog1": "org.apache.iceberg.spark.SparkCatalog",
             "spark.sql.catalog.glue_catalog1.warehouse": 
                   "s3://YOUR-BUCKET-NAME/iceberg/glue_catalog1/tables/",
             "spark.sql.catalog.glue_catalog1.catalog-impl":    "org.apache.iceberg.aws.glue.GlueCatalog",
             "spark.sql.catalog.glue_catalog1.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
             "spark.sql.catalog.glue_catalog1.lock-impl": "org.apache.iceberg.aws.glue.DynamoLockManager",
             "spark.sql.catalog.glue_catalog1.lock.table": "myGlueLockTable",
  "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
           } 
}

Now, we can easily create the Iceberg table using the Spark or Athena, and it will be accessible via Delta. We can perform upserts, too.

Conclusion

We’ve learned the basics of the Iceberg table format, its features, and the reasons for choosing Iceberg. We discussed how Iceberg provides significant advantages such as schema evolution, partition evolution, hidden partitioning, and ACID compliance, making it a robust choice for managing large-scale data. We also delved into the fundamental setup required to implement this table format, including configuration and integration with data processing engines like Apache Spark and query engines like Presto and Trino. By leveraging Iceberg, organizations can ensure efficient data management and analytics, facilitating better performance and scalability. With this knowledge, you are well-equipped to start using Iceberg for your data lake needs, ensuring a more organized, scalable, and efficient data infrastructure.

May 23, 2024

Strategies for Cost Optimization Across Amazon EKS Clusters
Fast-growing tech companies rely heavily on Amazon EKS clusters to host a variety of microservices and applications. The pairing of Amazon EKS for managing the Kubernetes Control Plane and Amazon EC2 for flexible Kubernetes nodes creates an optimal environment for running containerized workloads.

With the increasing scale of operations, optimizing costs across multiple EKS clusters has become a critical priority. This blog will demonstrate how we can leverage various tools and strategies to analyze, optimize, and manage EKS costs effectively while maintaining performance and reliability.

Cost Analysis:

Working on cost optimization becomes absolutely necessary for cost analysis. Data plays an important role, and trust your data. The total cost of operating an EKS cluster encompasses several components. The EKS Control Plane (or Master Node) incurs a fixed cost of $0.20 per hour, offering straightforward pricing.

Meanwhile, EC2 instances, serving as the cluster’s nodes, introduce various cost factors, such as block storage and data transfer, which can vary significantly based on workload characteristics. For this discussion, we’ll focus primarily on two aspects of EC2 cost: instance hours and instance pricing. Let’s look at how to do the cost analysis on your EKS cluster.
- Tool Selection: We can begin our cost analysis journey by selecting Kubecost, a powerful tool specifically designed for Kubernetes cost analysis. Kubecost provides granular insights into resource utilization and costs across our EKS clusters.
- Deployment and Usage: Deploying Kubecost is straightforward. We can integrate it with our Kubernetes clusters following the provided documentation. Kubecost’s intuitive dashboard allowed us to visualize resource usage, cost breakdowns, and cost allocation by namespace, pod, or label. Once deployed, you can see the Kubecost overview page in your browser by port-forwarding the Kubecost k8s service. It might take 5-10 minutes for Kubecost to gather metrics. You can see your Amazon EKS spend, including cumulative cluster costs, associated Kubernetes asset costs, and monthly aggregated spend.
- Cluster Level Cost Analysis: For multi-cluster cost analysis and cluster level scoping, consider using the AWS Tagging strategy and tag your EKS clusters. Learn more about tagging strategy from the following documentations. You can then view your cost analysis in AWS Cost Explorer. AWS Cost Explorer provided additional insights into our AWS usage and spending trends. By analyzing cost and usage data at a granular level, we can identify areas for further optimization and cost reduction.
- Multi-Cluster Cost Analysis using Kubecost and Prometheus: Kubecost deployment comes with a Prometheus cluster to send cost analysis metrics to the Prometheus server. For multiple EKS clusters, we can enable the remote Prometheus server, either AWS-Managed Prometheus server or self-managed Prometheus. To get cost analysis metrics from multiple clusters, we need to run Kubeost with an additional Sigv4 pod that sends individual and combined cluster metrics to a common Prometheus cluster. You can follow the AWS documentation for Multi-Cluster Cost Analysis using Kubecost and Prometheus.
Cost Optimization Strategies:

Based on the cost analysis, the next step is to plan your cost optimization strategies. As explained in the previous section, the Control Plane has a fixed cost and straightforward pricing model. So, we will focus mainly on optimizing the data nodes and optimizing the application configuration. Let’s look at the following strategies when optimizing the cost of the EKS cluster and supporting AWS services:
- Right Sizing: On the cost optimization pillar of the AWS Well-Architected Framework, we find a section on Cost-Effective Resources, which describes Right Sizing as:
“… using the lowest cost resource that still meets the technical specifications of a specific workload.”
- Application Right Sizing: Right-sizing is the strategy to optimize pod resources by allocating the appropriate CPU and memory resources to pods. Care must be taken to try to set requests that align as close as possible to the actual utilization of these resources. If the value is too low, then the containers may experience throttling of the resources and impact the performance. However, if the value is too high, then there is waste, since those unused resources remain reserved for that single container. When actual utilization is lower than the requested value, the difference is called slack cost. A tool like kube-resource-report is valuable for visualizing the slack cost and right-sizing the requests for the containers in a pod. Installation instructions demonstrate how to install via an included helm chart.
  
  helm upgrade –install kube-resource-report chart/kube-resource-report
  
  You can also consider tools like VPA recommender with Goldilocks to get an insight into your pod resource consumption and utilization.
- Compute Right Sizing: Application right sizing and Kubecost analysis are required to right-size EKS Compute. Here are several strategies for computing right sizing:some text
  - Mixed Instance Auto Scaling group: Employ a mixed instance policy to create a diversified pool of instances within your auto scaling group. This mix can include both spot and on-demand instances. However, it’s advisable not to mix instances of different sizes within the same Node group.
  - Node Groups, Taints, and Tolerations: Utilize separate Node Groups with varying instance sizes for different application requirements. For example, use distinct node groups for GPU-intensive and CPU-intensive applications. Use taints and tolerations to ensure applications are deployed on the appropriate node group.
  - Graviton Instances: Explore the adoption of Graviton Instances, which offer up to 40% better price performance compared to traditional instances. Consider migrating to Graviton Instances to optimize costs and enhance application performance.
- Purchase Options: Another part of the cost optimization pillar of the AWS Well-Architected Framework that we can apply comes from the Purchasing Options section, which says:
“Spot Instances allow you to use spare compute capacity at a significantly lower cost than On-Demand EC2 instances (up to 90%).”

Understanding purchase options for Amazon EC2 is crucial for cost optimization. The Amazon EKS data plane consists of worker nodes or serverless compute resources responsible for running Kubernetes application workloads. These nodes can utilize different capacity types and purchase options, including On-Demand, Spot Instances, Savings Plans, and Reserved Instances.

On-Demand and Spot capacity offer flexibility without spending commitments. On-Demand instances are billed based on runtime and guarantee availability at On-Demand rates, while Spot instances offer discounted rates but are preemptible. Both options are suitable for temporary or bursty workloads, with Spot instances being particularly cost-effective for applications tolerant of compute availability fluctuations.

Reserved Instances involve upfront spending commitments over one or three years for discounted rates. Once a steady-state resource consumption profile is established, Reserved Instances or Savings Plans become effective. Savings Plans, introduced as a more flexible alternative to Reserved Instances, allow for commitments based on a “US Dollar spend amount,” irrespective of provisioned resources. There are two types: Compute Savings Plans, offering flexibility across instance types, Fargate, and Lambda charges, and EC2 Instance Savings Plans, providing deeper discounts but restricting compute choice to an instance family.

Tailoring your approach to your workload can significantly impact cost optimization within your EKS cluster. For non-production environments, leveraging Spot Instances exclusively can yield substantial savings. Meanwhile, implementing Mixed-Instances Auto Scaling Groups for production workloads allows for dynamic scaling and cost optimization. Additionally, for steady workloads, investing in a Savings Plan for EC2 instances can provide long-term cost benefits. By strategically planning and optimizing your EC2 instances, you can achieve a notable reduction in your overall EKS compute costs, potentially reaching savings of approximately 60-70%.
- Auto Scaling: The cost optimization pillar of the AWS Well-Architected Framework includes a section on Matching Supply and Demand, which recommends the following:
“… this (matching supply and demand) accomplished using Auto Scaling, which helps you to scale your EC2 instances and Spot Fleet capacity up or down automatically according to conditions you define.”
- Cluster Autoscaling: Therefore, a prerequisite to cost optimization on a Kubernetes cluster is to ensure you have Cluster Autoscaler running. This tool performs two critical functions in the cluster. First, it will monitor the cluster for pods that are unable to run due to insufficient resources. Whenever this occurs, the Cluster Autoscaler will update the Amazon EC2 Auto Scaling group to increase the desired count, resulting in additional nodes in the cluster. Additionally, the Cluster Autoscaler will detect nodes that have been underutilized and reschedule pods onto other nodes. Cluster Autoscaler will then decrease the desired count for the Auto Scaling group to scale in the number of nodes.
‍The Amazon EKS User Guide has a great section on the configuration of the Cluster Autoscaler. There are a couple of things to pay attention to when configuring the Cluster Autoscaler:

IAM Roles for Service Account – Cluster Autoscaler will require access to update the desired capacity in the Auto Scaling group. The recommended approach is to create a new IAM role with the required policies and a trust policy that restricts access to the service account used by Cluster Autoscaler. The role name must then be provided as an annotation on the service account:
```
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cluster-autoscaler
  annotations:
	eks.amazonaws.com/role-arn: arn:aws:iam::000000000000:role/my_role_name
```
Auto-Discovery Setup

Setup your Cluster Autoscaler in Auto-Discovery Setup by enabling the –node-group-auto-discovery flag as an argument. Also, make sure to tag your EKS nodes’ Autoscaling groups with the following tags:

k8s.io/cluster-autoscaler/enabled,
k8s.io/cluster-autoscaler/<cluster-name>

Auto Scaling Group per AZ – When Cluster Autoscaler scales out, it simply increases the desired count for the Auto Scaling group, leaving the responsibility for launching new EC2 instances to the AWS Auto Scaling service. If an Auto Scaling group is configured for multiple availability zones, then the new instance may be provisioned in any of those availability zones.

For deployments that use persistent volumes, you will need to provision a separate Auto Scaling group for each availability zone. This way, when Cluster Autoscaler detects the need to scale out in response to a given pod, it can target the correct availability zone for the scale-out based on persistent volume claims that already exist in a given availability zone.

When using multiple Auto Scaling groups, be sure to include the following argument in the pod specification for Cluster Autoscaler:

–balance-similar-node-groups=true
- Pod Autoscaling: Now that Cluster Autoscaler is running in the cluster, you can be confident that the instance hours will align closely with the demand from pods within the cluster. Next up is to use Horizontal Pod Autoscaler (HPA) to scale out or in the number of pods for a deployment based on specific metrics for the pods to optimize pod hours and further optimize our instance hours.
The HPA controller is included with Kubernetes, so all that is required to configure HPA is to ensure that the Kubernetes metrics server is deployed in your cluster and then defining HPA resources for your deployments. For example, the following HPA resource is configured to monitor the CPU utilization for a deployment named nginx-ingress-controller. HPA will then scale out or in the number of pods between 1 and 5 to target an average CPU utilization of 80% across all the pods:
```
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-ingress-controller
spec:
  scaleTargetRef:
	apiVersion: apps/v1
	kind: Deployment
	name: nginx-ingress-controller
  minReplicas: 1
  maxReplicas: 5
  targetCPUUtilizationPercentage: 80
```
The combination of Cluster Autoscaler and Horizontal Pod Autoscaler is an effective way to keep EC2 instance hours tied as close as possible to the actual utilization of the workloads running in the cluster.

‍
- Down Scaling: In addition to demand-based automatic scaling, the Matching Supply and Demand section of the AWS Well-Architected Framework cost optimization pillar includes a section, which recommends the following:
“Systems can be scheduled to scale out or in at defined times, such as the start of business hours, thus ensuring that resources are available when users arrive.”

There are many deployments that only need to be available during business hours. A tool named kube-downscaler can be deployed to the cluster to scale in and out the deployments based on time of day.

Some example use case of kube-downscaler is:
- Deploy the downscaler to a test (non-prod) cluster with a default uptime or downtime time range to scale down all deployments during the night and weekend.
- Deploy the downscaler to a production cluster without any default uptime/downtime setting and scale down specific deployments by setting the downscaler/uptime (or downscaler/downtime) annotation. This might be useful for internal tooling front ends, which are only needed during work time.
- AWS Fargate with EKS: You can run Kubernetes without managing clusters of K8s servers with AWS Fargate, a serverless compute service.
AWS Fargate pricing is based on usage (pay-per-use). There are no upfront charges here as well. There is, however, a one-minute minimum charge. All charges are also rounded up to the nearest second. You will also be charged for any additional services you use, such as CloudWatch utilization charges and data transfer fees. Fargate can also reduce your management costs by reducing the number of DevOps professionals and tools you need to run Kubernetes on Amazon EKS.

Conclusion:

Effectively managing costs across multiple Amazon EKS clusters is essential for optimizing operations. By utilizing tools like Kubecost and AWS Cost Explorer, coupled with strategies such as right-sizing, mixed instance policies, and Spot Instances, organizations can streamline cost analysis and optimize resource allocation. Additionally, implementing auto-scaling mechanisms like Cluster Autoscaler ensures dynamic resource scaling based on demand, further optimizing costs. Leveraging AWS Fargate with EKS can eliminate the need to manage Kubernetes clusters, reducing management costs. Overall, by combining these strategies, organizations can achieve significant cost savings while maintaining performance and reliability in their containerized environments.
April 2, 2024

Tag: AWS

Overview:

Upsert Functionality

Schema Evolution

Time Travel

Setup Iceberg on the local machine using the local catalog option or Hive

Create Tables in Iceberg and Insert Data

Handling Upserts

Schema Evolution

Partition Evolution and Sort Order Evolution

Copy on write(COW) and merge on read(MOR) as well

Copy-On-Write (COW) – Best for tables with frequent reads, infrequent writes/updates, or large batch updates:

Merge-On-Read (MOR) – Best for tables with frequent writes/updates:

Conclusion

What is Apache Iceberg?

Why to go with Iceberg

Features of Iceberg:

Prerequisite:

Iceberg Setup Using Docker

Setup S3 As a Warehouse in Iceberg, Read Data from the S3, and Write Iceberg Tables in the S3 Again Using an EC2 Instance

Step 1

Step 2

Step 3

Conclusion

Cost Analysis:

Cost Optimization Strategies:

Auto-Discovery Setup

Conclusion: