Introduction
Serverless has become a new trend today and is here to stay for sure! Now when you think of wireless internet, you know that it still has some wires but you don’t need to worry about them as you don’t have to maintain them. Similarly, serverless has servers but you don’t have to keep worrying about handling or maintaining them. All you need to do is focus on your code and you’re good to go.
It has some more benefits, such as:
- Zero administration: You can deploy code without provisioning anything beforehand, or managing anything later. There is no concept of a fleet, an instance, or even an operating system.
- Auto-scaling: It lets your service providers manage the scaling challenges. You don’t need to fire alerts or write scripts to scale up and down. It handles quick bursts of traffic and weekend lulls the same way.
- Pay-per-use: The function-as-a-service compute and managed services are charged based on usage rather than pre-provisioned capacity. You can have complete resource utilization without paying a cent for idle time. The results? 90% cost-savings over a cloud VM, and the satisfaction of knowing that you never pay for resources you don’t use.
What is AWS Athena?
AWS Athena is a similar serverless service. It is more of an interactive query service than a code deployment service.
Using Athena one can directly query the data stored in S3 buckets and using standard ANSI SQL.
As mentioned earlier, it works on the principle of serverless, that is, there is no infrastructure to manage, and you pay only for the queries that you run.
Athena is easy to use. You can simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.
It is based on Facebook’s PrestoDB and can be used to query structured and semi-structured data.
Some Exciting Features of Athena are:
- Serverless. No ETL – Not having to set up and manage any servers or data warehouses.
- Only pay for the data that is scanned.
- You can ensure better performance by compressing, partitioning, and converting your data into columnar formats.
- Can also handle complex analysis, including large joins, window functions, and arrays.
- Athena automatically executes queries in parallel.
- Need to provide a path to the S3 folder and when new files added automatically reflects in the table.
- Supports –
- Support CSV, Json, Parquet, ORC, Avro data formats
- Complex Joins and datatypes
- View creation
- Does not Support –
- User-defined functions and stored procedures
- Hive or Presto transactions
- LZO (Snappy is supported)
Pricing of Athena
- AWS Athena is priced $5 for each TB of data scanned.
- Queries are rounded up to the nearest MB, with a 10 MB minimum.
- Users pay for stored data at regular S3 rates.
- Amazon advises users to use compressed data files, have data in columnar formats, and routinely delete old results sets to keep charges low. Partitioning data in tables can speed up queries and reduce query bills.
Athena vs. Redshift Spectrum
- AWS also has Redshift as data warehouse service, and we can use redshift spectrum to query S3 data, so then why should you use Athena?
Advantages of Redshift Spectrum:
- Allows creation of Redshift tables. You’re able to join Redshift tables with Redshift spectrum tables efficiently.
If you do not need those things then you should consider Athena as well Athena differences from Redshift spectrum:
- Billing. This is a major difference and depending on your use case you may find one much cheaper than the other Performance.
- Athena slightly faster. SQL syntax and features.
- Athena is derived from presto and is a bit different to Redshift which has its roots in Postgres.
- It’s easy enough to connect to Athena using API, JDBC or ODBC but many more products offer “standard out of the box” connection to Redshift.
- Athena has GIS functions and lambdas.
So in nutshell, if you have existing instances of redshift you would probably go for Redshift Spectrum, if not then you can opt for Athena for querying the data. In some cases, you can use both in tandem.
Example
Here is a sample query to create a sample database having 3 tables basic_details, contact_details and bill_details, Uploaded csv file to s3:
Basic_details:
const outside = {weather: FRIGHTFUL}
const inside = {fire: DELIGHTFUL}
const go = places => places.some(p=>p>outside.weather)))
const snow = () => (outside.weather < inside.fire && !go(places)) {
let it = snow()
}
let it = snow()
const FRIGHTFUL = 1
const DELIGHTFUL = 1337Bill_details:
CREATE EXTERNAL TABLE `bil_details`(
`id` int COMMENT '',
`amount_paid` string COMMENT '',
`amount_due` string COMMENT '')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://athena-blog/bill-details'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')Contact_details:
CREATE EXTERNAL TABLE `contact_details`(
`id` int COMMENT '',
`street` string COMMENT '',
`city` string COMMENT '',
`state` string COMMENT '',
`country` string COMMENT '',
`zip` string COMMENT '')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://athena-blog/contact-details'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')Sample Query for – FirstNames of People from Minnesota with amount_due > $100
WITH basic AS
(SELECT id,
first_name
FROM basic_details
WHERE lower(gender) = 'male' ), bill AS
(SELECT id
FROM bil_details
WHERE CAST(amount_due AS INTEGER) > 100 ), contact AS
(SELECT contact_details.id
FROM contact_details
JOIN bill
ON contact_details.id = bill.id
WHERE state= 'Minnesota' )
SELECT basic.first_name
FROM basic
JOIN contact
ON basic.id = contact.id Output:

Some Other Sample Queries:
1. Searching for Values in JSON
WITH dataset AS (
SELECT * FROM (VALUES
(JSON '{"name": "Bob Smith", "org": "legal", "projects": ["project1"]}'),
(JSON '{"name": "Susan Smith", "org": "engineering", "projects": ["project1", "project2", "project3"]}'),
(JSON '{"name": "Jane Smith", "org": "finance", "projects": ["project1", "project2"]}')
) AS t (users)
)
SELECT json_extract_scalar(users, '$.name') AS user
FROM dataset
WHERE json_array_contains(json_extract(users, '$.projects'), 'project2')Output:

2. Extracting properties
WITH dataset AS (
SELECT '{"name": "Susan Smith",
"org": "engineering",
"projects": [{"name":"project1", "completed":false},
{"name":"project2", "completed":true}]}'
AS blob
)
SELECT
json_extract(blob, '$.name') AS name,
json_extract(blob, '$.projects') AS projects
FROM datasetOutput:

3. Converting JSON to Athena Data Types
WITH dataset AS (
SELECT
CAST(JSON '"HELLO ATHENA"' AS VARCHAR) AS hello_msg,
CAST(JSON '12345' AS INTEGER) AS some_int,
CAST(JSON '{"a":1,"b":2}' AS MAP(VARCHAR, INTEGER)) AS some_map
)
SELECT * FROM datasetOutput:

Conclusion
Hence, we can easily say that AWS Athena gives us an efficient way to query our raw data present in different formats in S3 object storage, without spawning a dedicated infrastructure and at minimal cost.
Need help with setting up AWS Athena for your organization? Connect with the experts at Velotio!
Leave a Reply