Introduction

Serverless has become a new trend today and is here to stay for sure! Now when you think of wireless internet, you know that it still has some wires but you don’t need to worry about them as you don’t have to maintain them. Similarly, serverless has servers but you don’t have to keep worrying about handling or maintaining them. All you need to do is focus on your code and you’re good to go.

It has some more benefits, such as:

Zero administration: You can deploy code without provisioning anything beforehand, or managing anything later. There is no concept of a fleet, an instance, or even an operating system.
Auto-scaling: It lets your service providers manage the scaling challenges. You don’t need to fire alerts or write scripts to scale up and down. It handles quick bursts of traffic and weekend lulls the same way.
Pay-per-use: The function-as-a-service compute and managed services are charged based on usage rather than pre-provisioned capacity. You can have complete resource utilization without paying a cent for idle time. The results? 90% cost-savings over a cloud VM, and the satisfaction of knowing that you never pay for resources you don’t use.

What is AWS Athena?

AWS Athena is a similar serverless service. It is more of an interactive query service than a code deployment service.

Using Athena one can directly query the data stored in S3 buckets and using standard ANSI SQL.

As mentioned earlier, it works on the principle of serverless, that is, there is no infrastructure to manage, and you pay only for the queries that you run.

Athena is easy to use. You can simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

It is based on Facebook’s PrestoDB and can be used to query structured and semi-structured data.

Some Exciting Features of Athena are:

Serverless. No ETL – Not having to set up and manage any servers or data warehouses.
Only pay for the data that is scanned.
You can ensure better performance by compressing, partitioning, and converting your data into columnar formats.
Can also handle complex analysis, including large joins, window functions, and arrays.
Athena automatically executes queries in parallel.
Need to provide a path to the S3 folder and when new files added automatically reflects in the table.
Supports –
Support CSV, Json, Parquet, ORC, Avro data formats
Complex Joins and datatypes
View creation
Does not Support –
User-defined functions and stored procedures
Hive or Presto transactions
LZO (Snappy is supported)

Pricing of Athena

AWS Athena is priced $5 for each TB of data scanned.
Queries are rounded up to the nearest MB, with a 10 MB minimum.
Users pay for stored data at regular S3 rates.
Amazon advises users to use compressed data files, have data in columnar formats, and routinely delete old results sets to keep charges low. Partitioning data in tables can speed up queries and reduce query bills.

Athena vs. Redshift Spectrum

AWS also has Redshift as data warehouse service, and we can use redshift spectrum to query S3 data, so then why should you use Athena?

Advantages of Redshift Spectrum:

Allows creation of Redshift tables. You’re able to join Redshift tables with Redshift spectrum tables efficiently.

If you do not need those things then you should consider Athena as well Athena differences from Redshift spectrum:

Billing. This is a major difference and depending on your use case you may find one much cheaper than the other Performance.
Athena slightly faster. SQL syntax and features.
Athena is derived from presto and is a bit different to Redshift which has its roots in Postgres.
It’s easy enough to connect to Athena using API, JDBC or ODBC but many more products offer “standard out of the box” connection to Redshift.
Athena has GIS functions and lambdas.

So in nutshell, if you have existing instances of redshift you would probably go for Redshift Spectrum, if not then you can opt for Athena for querying the data. In some cases, you can use both in tandem.

Example

Here is a sample query to create a sample database having 3 tables basic_details, contact_details and bill_details, Uploaded csv file to s3:
‍
Basic_details:

const outside = {weather: FRIGHTFUL}
const inside = {fire: DELIGHTFUL}
const go = places => places.some(p=>p>outside.weather)))

const snow = () => (outside.weather < inside.fire && !go(places)) {
  let it = snow()
}

let it = snow()

const FRIGHTFUL = 1
const DELIGHTFUL = 1337

const outside = {weather: FRIGHTFUL}
const inside = {fire: DELIGHTFUL}
const go = places => places.some(p=>p>outside.weather)))

const snow = () => (outside.weather < inside.fire && !go(places)) {
  let it = snow()
}

let it = snow()

const FRIGHTFUL = 1
const DELIGHTFUL = 1337

Bill_details:

CREATE EXTERNAL TABLE `bil_details`(
  `id` int COMMENT '', 
  `amount_paid` string COMMENT '', 
  `amount_due` string COMMENT '')
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://athena-blog/bill-details'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'skip.header.line.count'='1')

CREATE EXTERNAL TABLE `bil_details`(
  `id` int COMMENT '', 
  `amount_paid` string COMMENT '', 
  `amount_due` string COMMENT '')
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://athena-blog/bill-details'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'skip.header.line.count'='1')

‍Contact_details:

CREATE EXTERNAL TABLE `contact_details`(
  `id` int COMMENT '', 
  `street` string COMMENT '', 
  `city` string COMMENT '', 
  `state` string COMMENT '', 
  `country` string COMMENT '', 
  `zip` string COMMENT '')
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://athena-blog/contact-details'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'skip.header.line.count'='1')

CREATE EXTERNAL TABLE `contact_details`(
  `id` int COMMENT '', 
  `street` string COMMENT '', 
  `city` string COMMENT '', 
  `state` string COMMENT '', 
  `country` string COMMENT '', 
  `zip` string COMMENT '')
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://athena-blog/contact-details'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'skip.header.line.count'='1')

Sample Query for – FirstNames of People from Minnesota with amount_due > $100

WITH basic AS 
    (SELECT id,
         first_name
    FROM basic_details
    WHERE lower(gender) = 'male' ), bill AS 
    (SELECT id
    FROM bil_details
    WHERE CAST(amount_due AS INTEGER) > 100 ), contact AS 
    (SELECT contact_details.id
    FROM contact_details
    JOIN bill
        ON contact_details.id = bill.id
    WHERE state= 'Minnesota' )
SELECT basic.first_name
FROM basic
JOIN contact
    ON basic.id = contact.id

WITH basic AS 
    (SELECT id,
         first_name
    FROM basic_details
    WHERE lower(gender) = 'male' ), bill AS 
    (SELECT id
    FROM bil_details
    WHERE CAST(amount_due AS INTEGER) > 100 ), contact AS 
    (SELECT contact_details.id
    FROM contact_details
    JOIN bill
        ON contact_details.id = bill.id
    WHERE state= 'Minnesota' )
SELECT basic.first_name
FROM basic
JOIN contact
    ON basic.id = contact.id

Output:

Some Other Sample Queries:

1. Searching for Values in JSON

WITH dataset AS (
  SELECT * FROM (VALUES
    (JSON '{"name": "Bob Smith", "org": "legal", "projects": ["project1"]}'),
    (JSON '{"name": "Susan Smith", "org": "engineering", "projects": ["project1", "project2", "project3"]}'),
    (JSON '{"name": "Jane Smith", "org": "finance", "projects": ["project1", "project2"]}')
  ) AS t (users)
)
SELECT json_extract_scalar(users, '$.name') AS user
FROM dataset
WHERE json_array_contains(json_extract(users, '$.projects'), 'project2')

WITH dataset AS (
  SELECT * FROM (VALUES
    (JSON '{"name": "Bob Smith", "org": "legal", "projects": ["project1"]}'),
    (JSON '{"name": "Susan Smith", "org": "engineering", "projects": ["project1", "project2", "project3"]}'),
    (JSON '{"name": "Jane Smith", "org": "finance", "projects": ["project1", "project2"]}')
  ) AS t (users)
)
SELECT json_extract_scalar(users, '$.name') AS user
FROM dataset
WHERE json_array_contains(json_extract(users, '$.projects'), 'project2')

Output:

2. Extracting properties

WITH dataset AS (
  SELECT '{"name": "Susan Smith",
           "org": "engineering",
           "projects": [{"name":"project1", "completed":false},
           {"name":"project2", "completed":true}]}'
    AS blob
)
SELECT
  json_extract(blob, '$.name') AS name,
  json_extract(blob, '$.projects') AS projects
FROM dataset

WITH dataset AS (
  SELECT '{"name": "Susan Smith",
           "org": "engineering",
           "projects": [{"name":"project1", "completed":false},
           {"name":"project2", "completed":true}]}'
    AS blob
)
SELECT
  json_extract(blob, '$.name') AS name,
  json_extract(blob, '$.projects') AS projects
FROM dataset

Output:

3. Converting JSON to Athena Data Types

WITH dataset AS (
  SELECT
    CAST(JSON '"HELLO ATHENA"' AS VARCHAR) AS hello_msg,
    CAST(JSON '12345' AS INTEGER) AS some_int,
    CAST(JSON '{"a":1,"b":2}' AS MAP(VARCHAR, INTEGER)) AS some_map
)
SELECT * FROM dataset

WITH dataset AS (
  SELECT
    CAST(JSON '"HELLO ATHENA"' AS VARCHAR) AS hello_msg,
    CAST(JSON '12345' AS INTEGER) AS some_int,
    CAST(JSON '{"a":1,"b":2}' AS MAP(VARCHAR, INTEGER)) AS some_map
)
SELECT * FROM dataset

Output:

Conclusion

Hence, we can easily say that AWS Athena gives us an efficient way to query our raw data present in different formats in S3 object storage, without spawning a dedicated infrastructure and at minimal cost.

Need help with setting up AWS Athena for your organization? Connect with the experts at Velotio!

Your Quintessential Guide to AWS Athena

Introduction

What is AWS Athena?

Some Exciting Features of Athena are:

Pricing of Athena

Athena vs. Redshift Spectrum

Some Other Sample Queries:

Conclusion

Comments

Leave a Reply Cancel reply

More posts

Choosing the Right Partner: Why Agentic AI Success Depends Less on Tools and More on Who You Build With

Less Automation, More Trust: Why Tier-2 Operators Should Start Small with AI

The Insurance Analytics Stack: Future-Proofing Your Investments in BI Tools

Your Quintessential Guide to AWS Athena

Introduction

What is AWS Athena?

Some Exciting Features of Athena are:

Pricing of Athena

Athena vs. Redshift Spectrum

Some Other Sample Queries:

Conclusion

Comments

Leave a Reply Cancel reply

More posts

The Next Phase of FinOps: 3 AI-Powered Moves That Matter

Choosing the Right Partner: Why Agentic AI Success Depends Less on Tools and More on Who You Build With

Less Automation, More Trust: Why Tier-2 Operators Should Start Small with AI

The Insurance Analytics Stack: Future-Proofing Your Investments in BI Tools