Blog

Modern Data Stack: The What, Why and How?
This post will provide you with a comprehensive overview of the modern data stack (MDS), including its benefits, how it’s components differ from its predecessors’, and what its future holds.

“Modern” has the connotation of being up-to-date, of being better. This is true for MDS, but how exactly is MDS better than what was before?

What was the data stack like?…

A few decades back, the map-reduce technological breakthrough made it possible to efficiently process large amounts of data in parallel on multiple machines.

It provided the backbone of a standard pipeline that looked like:

‍

It was common to see HDFS used for storage, spark for computing, and hive to perform SQL queries on top.

To run this, we had people handling the deployment and maintenance of Hadoop on their own.

This core attribute of the setup eventually became a pain point and made it complex and inefficient in the long run.

Being on-prem while facing growing heavier loads meant scalability became a huge concern.

Hence, unlike today, the process was much more manual. Adding more RAM, increasing storage, and rolling out updates manually reduced productivity

Moreover,
- The pipeline wasn’t modular; components were tightly coupled, causing failures when deciding to shift to something new.
- Teams committed to specific vendors and found themselves locked in, by design, for years.
- Setup was complex, and the infrastructure was not resilient. Random surges in data crashed the systems. (This randomness in demand has only increased since the early decade of internet, due to social media-triggered virality.)
- Self-service was non-existent. If you wanted to do anything with your data, you needed data engineers.
- Observability was a myth. Your pipeline is failing, but you’re unaware, and then you don’t know why, where, how…Your customers become your testers, knowing more about your system’s issues.
- Data protection laws weren’t as formalized, especially the lack of policies within the organization. These issues made the traditional setup inefficient in solving modern problems, and as we all know…
For an upgraded modern setup, we needed something that is scalable, has a smaller learning curve, and something that is feasible for both a seed-stage startup or a fortune 500.

Standing on the shoulders of tech innovations from the 2000s, data engineers started building a blueprint for MDS tooling with three core attributes:

Cloud Native (or the ocean)

Arguably the definitive change of the MDS era, the cloud reduces the hassle of on-prem and welcomes auto-scaling horizontally or vertically in the era of virality and spikes as technical requirements.

Modularity

The M in MDS could stand for modular.

You can integrate any MDS tool into your existing stack, like LEGO blocks.

You can test out multiple tools, whether they’re open source or managed, choose the best fit, and iteratively build out your data infrastructure.

This mindset helps instill a habit of avoiding vendor lock-in by continuously upgrading your architecture with relative ease.

By moving away from the ancient, one-size-fits-all model, MDS recognizes the uniqueness of each company’s budget, domain, data types, and maturity—and provides the correct solution for a given use case.

Ease of Use

MDS tools are easier to set up. You can start playing with these tools within a day.

Importantly, the ease of use is not limited to technical engineers.

Owing to the rise of self-serve and no-code tools like tableau—data is finally democratized for usage for all kinds of consumers. SQL remains crucial, but for basic metric calculations PMs, Sales, Marketing, etc., can use a simple drag and drop in the UI (sometimes even simpler than Excel pivot tables).

MDS also enables one to experiment with different architectural frameworks for their use case. For example, ELT vs. ETL (explained under Data Transformation).

‍

But, one might think such improvements mean MDS is the v1.1 of Data Stack, a tech upgrade that ultimately uses data to solve similar problems.

Fortunately, that’s far from the case.

MDS enables data to solve more human problems across the org—problems that employees have long been facing but could never systematically solve for, helping generate much more value from the data.

Beyond these, employees want transparency and visibility into how any metric was calculated and which data source in Snowflake was used to build what specific tableau dashboard.

Critically, with compliance finally being focused on, orgs need solutions for giving the right people the right access at the right time.

Lastly, as opposed to previous eras, these days, even startups have varied infrastructure components with data; if you’re a PM tasked with bringing insights, how do you know where to start? What data assets the organization has?

Besides these problem statements being tackled, MDS builds a culture of upskilling employees in various data concepts.

Data security, governance, and data lineage are important irrespective of department or persona in the organization.

From designers to support executives, the need for a data-driven culture is a given.

You’re probably bored of hearing how good the MDS is and want to deconstruct it into its components.

Let’s dive in.

SOURCES

In our modern era, every product is inevitably becoming a tech product

From a smart bulb to an orbiting satellite, each generates data in its own unique flavor of frequency of generation, data format, data size, etc.

Social media, microservices, IoT devices, smart devices, DBs, CRMs, ERPs, flat files, and a lot more…

‍

INGESTION

Post creation of data, how does one “ingest” or take in that data for actual usage? (the whole point of investing).

Roughly, there are three categories to help describe the ingestion solutions:

Generic tools allow us to connect various data sources with data storages.

E.g.: we can connect Google Ads or Salesforce to dump data into BigQuery or S3.

These generic tools highlight the modularity and low or no code barrier aspect in MDS.

Things are as easy as drag and drop, and one doesn’t need to be fluent in scripting.

Then we have programmable tools as well, where we get more control over how we ingest data through code

For example, we can write Apache Airflow DAGs in Python to load data from S3 and dump it to Redshift.

Intermediary – these tools cater to a specific use case or are coupled with the source itself.

E.g. – Snowpipe, a part of the data source snowflake itself, allows us to load data from files as soon as it’s available at the source.

DATA STORAGE‍

Where do you ingest data into?

Here, we’ve expanded from HDFS & SQL DBs to a wider variety of formats (noSQL, document DB).

Depending on the use case and the way you interact with data, you can choose from a DW, DB, DL, ObjectStores, etc.

You might need a standard relational DB for transactions in finance, or you might be collecting logs. You might be experimenting with your product at an early stage and be fine with noSQL without worrying about prescribing schemas.

One key feature to note is that—most are cloud-based. So, no more worrying about scalability and we pay only for what we use.

PS: Do stick around till the end for new concepts of Lake House and reverse ETL (already prevalent in the industry).

DATA TRANSFORMATION

‍

The stored raw data must be cleaned and restructured into the shape we deem best for actual usage. This slicing and dicing is different for every kind of data.

For example, we have tools for the E-T-L way, which can be categorized into SaaS and Frameworks, e.g., Fivetran and Spark respectively.

Interestingly, the cloud era has given storage computational capability such that we don’t even need an external system for transformation, sometimes.

With this rise of E-LT, we leverage the processing capabilities of cloud data warehouses or lake houses. Using tools like DBT, we write templated SQL queries to transform our data in the warehouses or lake house itself.

This is enabling analysts to perform heavy lifting of traditional DE problems

We also see stream processing where we work with applications where “micro” data is processed in real time (analyzed as soon as it’s produced, as opposed to large batches).

DATA VISUALIZATION

The ability to visually learn from data has only improved in the MDS era with advanced design, methodology, and integration.

With Embedded analytics, one can integrate analytical capabilities and data visualizations into the software application itself.

External analytics, on the other hand, are used to build using your processed data. You choose your source, create a chart, and let it run.

DATA SCIENCE, MACHINE LEARNING, MLOps

Source: https://medium.com/vertexventures/thinking-data-the-modern-data-stack-d7d59e81e8c6

In the last decade, we have moved beyond ad-hoc insight generation in Jupyter notebooks to

production-ready, real-time ML workflows, like recommendation systems and price predictions. Any startup can and does integrate ML into its products.

Most cloud service providers offer machine learning models and automated model building as a service.

MDS concepts like data observation are used to build tools for ML practitioners, whether its feature stores (a feature store is a central repository that provides entity values as of a certain time), or model monitoring (checking data drift, tracking model performance, and improving model accuracy).

This is extremely important as statisticians can focus on the business problem not infrastructure.

This is an ever-expanding field where concepts for ex MLOps (DevOps for the ML pipelines—optimizing workflows, efficient transformations) and Synthetic media (using AI to generate content itself) arrive and quickly become mainstream.

ChatGPT is the current buzz, but by the time you’re reading this, I’m sure there’s going to be an updated one—such is the pace of development.

DATA ORCHESTRATION

With a higher number of modularized tools and source systems comes complicated complexity.

More steps, processes, connections, settings, and synchronization are required.

Data orchestration in MDS needs to be Cron on steroids.

Using a wide variety of products, MDS tools help bring the right data for the right purposes based on complex logic.

DATA OBSERVABILITY

Data observability is the ability to monitor and understand the state and behavior of data as it flows through an organization’s systems.

In a traditional data stack, organizations often rely on reactive approaches to data management, only addressing issues as they arise. In contrast, data observability in an MDS involves adopting a proactive mindset, where organizations actively monitor and understand the state of their data pipelines to identify potential issues before they become critical.

Monitoring – a dashboard that provides an operational view of your pipeline or system

Alerting – both for expected events and anomalies

Tracking – ability to set and track specific events

Analysis – automated issue detection that adapts to your pipeline and data health

Logging – a record of an event in a standardized format for faster resolution

SLA Tracking – Measure data quality against predefined standards (cost, performance, reliability)

Data Lineage – graph representation of data assets showing upstream/downstream steps.

DATA GOVERNANCE & SECURITY

Data security is a critical consideration for organizations of all sizes and industries and needs to be prioritized to protect sensitive information, ensure compliance, and preserve business continuity.

The introduction of stricter data protection regulations, such as the General Data Protection Regulation (GDPR) and CCPA, introduced a huge need in the market for MDS tools, which efficiently and painlessly help organizations govern and secure their data.

DATA CATALOG

Now that we have all the components of MDS, from ingestion to BI, we have so many sources, as well as things like dashboards, reports, views, other metadata, etc., that we need a google like engine just to navigate our components.

This is where a data catalog helps; it allows people to stitch the metadata (data about your data: the #rows in your table, the column names, types, etc.) across sources.

This is necessary to help efficiently discover, understand, trust, and collaborate on data assets.

We don’t want PMs & GTM to look at different dashboards for adoption data.

‍

Previously, the sole purpose of the original data pipeline was to aggregate and upload events to Hadoop/Hive for batch processing. Chukwa collected events and wrote them to S3 in Hadoop sequence file format. In those days, end-to-end latency was up to 10 minutes. That was sufficient for batch jobs, which usually scan data at daily or hourly frequency.

With the emergence of Kafka and Elasticsearch over the last decade, there has been a growing demand for real-time analytics on Netflix. By real-time, we mean sub-minute latency. Instead of starting from scratch, Netflix was able to iteratively grow its MDS as per changes in market requirements.

Source: https://blog.transform.co/data-talks/the-metric-layer-why-you-need-it-examples-and-how-it-fits-into-your-modern-data-stack/

This is a snapshot of the MDS stack a data-mature company like Netflix had some years back where instead of a few all in one tools, each data category was solved by a specialized tool.

FUTURE COMPONENTS OF MDS?

DATA MESH

Source: https://martinfowler.com/articles/data-monolith-to-mesh.html

The top picture shows how teams currently operate, where no matter the feature or product on the Y axis, the data pipeline’s journey remains the same moving along the X. But in an ideal world of data mesh, those who know the data should own its journey.

As decentralization is the name of the game, data mesh is MDS’s response to this demand for an architecture shift where domain owners use self-service infrastructure to shape how their data is consumed.

DATA LAKEHOUSE

Source: https://www.altexsoft.com/blog/data-lakehouse/

We have talked about data warehouses and data lakes being used for data storage.

Initially, when we only needed structured data, data warehouses were used. Later, with big data, we started getting all kinds of data, structured and unstructured.

So, we started using Data Lakes, where we just dumped everything.

The lakehouse tries to combine the best of both worlds by adding an intelligent metadata layer on top of the data lake. This layer basically classifies and categorizes data such that it can be interpreted in a structured manner.

Also, all the data in the lake house is open, meaning that it can be utilized by all kinds of tools. They are generally built on top of open data formats like parquet so that they can be easily accessed by all the tools.

End users can simply run their SQLs as if they’re querying a DWH.

REVERSE ETL

Suppose you’re a salesperson using Salesforce and want to know if a lead you just got is warm or cold (warm indicating a higher chance of conversion).

The attributes about your lead, like salary and age are fetched from your OLTP into a DWH, analyzed, and then the flag “warm” is sent back to Salesforce UI, ready to be used in live operations.

METRICS LAYER

The Metric layer will be all about consistency, accessibility, and trust in the calculations of metrics.

Earlier, for metrics, you had v1 v1.1 Excels with logic scattered around.

Currently, in the modern data stack world, each team’s calculation is isolated in the tool they are used to. For example, BI would store metrics in tableau dashboards while DEs would use code.

A metric layer would exist to ensure global access of the metrics to every other tool in the data stack.

For example, DBT metrics layer helps define these in the warehouse—something accessible to both BI and engineers. Similarly, looker, mode, and others have their unique approach to it.

In summary, this blog post discussed the modern data stack and its advantages over older approaches. We examined the components of the modern data stack, including data sources, ingestion, transformation, and more, and how they work together to create an efficient and effective system for data management and analysis. We also highlighted the benefits of the modern data stack, including increased efficiency, scalability, and flexibility.

As technology continues to advance, the modern data stack will evolve and incorporate new components and capabilities.
January 4, 2023
Best Practices for Kafka Security
Overview‍

We will cover the security concepts of Kafka and walkthrough the implementation of encryption, authentication, and authorization for the Kafka cluster.

This article will explain how to configure SASL_SSL (simple authentication security layer) security for your Kafka cluster and how to protect the data in transit. SASL_SSL is a communication type in which clients use authentication mechanisms like PLAIN, SCRAM, etc., and the server uses SSL certificates to establish secure communication. We will use the SCRAM authentication mechanism here for the client to help establish mutual authentication between the client and server. We’ll also discuss authorization and ACLs, which are important for securing your cluster.

Prerequisites

‍Running Kafka Cluster, basic understanding of security components.

Need for Kafka Security

The primary reason is to prevent unlawful internet activities for the purpose of misuse, modification, disruption, and disclosure. So, to understand the security in Kafka cluster a secure Kafka cluster, we need to know three terms:
- Authentication – It is a security method used for servers to determine whether users have permission to access their information or website.
- Authorization – The authorization security method implemented with authentication enables servers to have a methodology of identifying clients for access. Basically, it gives limited access, which is sufficient for the client.
- Encryption – It is the process of transforming data to make it distorted and unreadable without a decryption key. Encryption ensures that no other client can intercept and steal or read data.
Here is the quick start guide by Apache Kafka, so check it out if you still need to set up Kafka.

https://kafka.apache.org/quickstart

We’ll not cover the theoretical aspects here, but you can find a ton of sources on how these three components work internally. For now, we’ll focus on the implementation part and how Kafka revolves around security.

This image illustrates SSL communication between the Kafka client and server.

We are going to implement the steps in the below order:
- Create a Certificate Authority
- Create a Truststore & Keystore
Certificate Authority – It is a trusted entity that issues SSL certificates. As such, a CA is an independent entity that acts as a trusted third party, issuing certificates for use by others. A certificate authority validates the credentials of a person or organization that requests a certificate before issuing one.

Truststore – A truststore contains certificates from other parties with which you want to communicate or certificate authorities that you trust to identify other parties. In simple words, a list of CAs that can validate the certificate signed by the trusted CA.

KeyStore – A KeyStore contains private keys and certificates with their corresponding public keys. Keystores can have one or more CA certificates depending upon what’s needed.

For Kafka Server, we need a server certificate, and here, Keystore comes into the picture since it stores a server certificate. The server certificate should be signed by Certificate Authority (CA). The KeyStore requests to sign the server certificate and in response, CA send a signed CRT to Keystore.

We will create our own certificate authority for demonstration purposes. If you don’t want to create a private certificate authority, there are many certificate providers you can go with, like IdenTrust and GoDaddy. Since we are creating one, we need to tell our Kafka client to trust our private certificate authority using the Trust Store.

This block diagram shows you how all the components communicate with each other and their role to generate the final certificate.

So, let’s create our Certificate Authority. Run the below command in your terminal:

“openssl req -new -keyout <private_key_name> -out <public_certificate_name>”

It will ask for a passphrase, and keep it safe for future use cases. After successfully executing the command, we should have two files named private_key_name and public_certificate_name.

Now, let’s create a KeyStore and trust store for brokers; we need both because brokers also interact internally with each other. Let’s understand with the help of an example: Broker A wants to connect with Broker B, so Broker A acts as a client and Broker B as a server. We are using the SASL_SSL protocol, so A needs SASL credentials, and B needs a certificate for authentication. The reverse is also possible where Broker B wants to connect with Broker A, so we need both a KeyStore and a trust store for authentication.

Now let’s create a trust store. Execute the below command in the terminal, and it should ask for the password. Save the password for future use:

“keytool -keystore <truststore_name.jks> -alias <alias name of the entry to process> -import -file <public_certificate_name>”

Here, we are using the .jks extension for the file, which stands for Java KeyStore. You can also use Public-Key Cryptography Standards #12 (pkcs12) instead of .jks, but that’s totally up to you. public_certificate_name is the same certificate while we create CA.

For the KeyStore configuration, run the below command and store the password:

“keytool genkey -keystore <keystore_name.jks> -validity <number_of_days> -storepass <store_password> -genkey -alias <alias_name> -keyalg <key algorithm name> -ext SAN=<“DNS:localhost”>”

This action creates the KeyStore file in the current working directory. The question “First and Last Name” requires you to enter a fully qualified domain name because some certificate authorities, such as VeriSign, expect this property to be a fully qualified domain name. Not all CAs require a fully qualified domain name, but I recommend using a fully qualified domain name for portability. All other information should be valid. If the information cannot be verified, a certificate authority such as VeriSign will not sign the CSR generated for that record. I’m using localhost for the domain name here, as seen in the above command itself.

Keystore has an entry with alias_name. It contains the private key and information needed for generating a CSR. Now let’s create a signing certificate request, so it will be used to get a signed certificate from Certificate Authority.

Execute the below command in your terminal:

“keytool -keystore <keystore_name.jks> -alias <alias_name> -certreq -file <file_name.csr>”

So, we have generated a signing certificate request using a KeyStore (the KeyStore name and alias name should be the same). It should ask for the KeyStore password, so enter the same one used while creating the KeyStore.

Now, execute the below command. It will ask for the password, so enter the CA password, and now we have a signed certificate:

“openssl x509 -req -CA <public_certificate_name> -CAkey <private_key_name> -in <csr file> -out <signed_file_name> -CAcreateserial”

Finally, we need to add the public certificate of CA and signed certificate in the KeyStore, so run the below command. It will add the CA certificate to the KeyStore.

“keytool -keystore <keystore_name.jks> -alias <public_certificate_name> -import -file <public_certificate_name>”

Now, let’s run the below command; it will add the signed certificate to the KeyStore.

“keytool -keystore <keystore_name.jks> -alias <alias_name> -import -file <signed_file_name>”

As of now, we have generated all the security files for the broker. For internal broker communication, we are using SASL_SSL (see security.inter.broker.protocol in server.properties). Now we need to create a broker username and password using the SCRAM method. For more details, click here.

Run the below command:

“kafka-configs.sh –zookeeper <host: port> –entity-type users –entity-name <username> –alter –add-config ‘SCRAM-SHA-512=[password=<password>]’”

NOTE: Credentials for inter-broker communication must be created before Kafka brokers are started.

Now, we need to configure the Kafka broker property file, so update the file as given below:
listeners=SASL_SSL://localhost:9092 advertised.listeners=SASL_SSL://localhost:9092 ssl.truststore.location={path/to/truststore_name.jks} ssl.truststore.password={truststore_password} ssl.keystore.location={/path/to/keystore_name.jks} ssl.keystore.password={keystore_password} security.inter.broker.protocol=SASL_SSL ssl.client.auth=none ssl.protocol=TLSv1.2 sasl.enabled.mechanisms=SCRAM-SHA-512 sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 listener.name.sasl_ssl.scram-sha-512.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username={username} password={password}; super.users=User:{username}
```
listeners=SASL_SSL://localhost:9092
advertised.listeners=SASL_SSL://localhost:9092
ssl.truststore.location={path/to/truststore_name.jks}
ssl.truststore.password={truststore_password}
ssl.keystore.location={/path/to/keystore_name.jks}
ssl.keystore.password={keystore_password}
security.inter.broker.protocol=SASL_SSL
ssl.client.auth=none
ssl.protocol=TLSv1.2
sasl.enabled.mechanisms=SCRAM-SHA-512
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512
listener.name.sasl_ssl.scram-sha-512.sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username={username} password={password};
super.users=User:{username}
```
NOTE: If you are using an external jaas config file, then remove the ScramLoginModule line and set this environment variable before starting broker. “export KAFKA_OPTS=-Djava.security.auth.login.config={path/to/broker.conf}”

Now, if we run Kafka, the broker should be running on port 9092 without any failure, and if you have multiple brokers inside Kafka, the same config file can be replicated among them, but the port should be different for each broker.

Producers and consumers need a username and a password to access the broker, so let’s create their credentials and update respective configurations.

Create a producer user and update producer.properties inside the bin directory, so execute the below command in your terminal.

“bin/kafka-configs.sh –zookeeper <host: port> –entity-type users –entity-name <producer_name> –alter –add-config ‘SCRAM-SHA-512=[password=<password>]’”

We need a trust store file for our clients (producer and consumer), but as we already know how to create a trust store, this is a small task for you. It is suggested that producers and consumers should have separate trust stores because when we move Kafka to production, there could be multiple producers and consumers on different machines.
```
security.protocol=SASL_SSL
ssl.protocol=TLSv1.2
ssl.truststore.location={path/to/client.truststore.jks}
ssl.truststore.password={password}
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username={producer_name} password={password};
```
The below command creates a consumer user, so now let’s update consumer.properties inside the bin directory:

“bin/kafka-configs.sh –zookeeper <host: port> –entity-type users –entity-name <consumer_name> –alter –add-config ‘SCRAM-SHA-512=[password=<password>]’”
```
security.protocol=SASL_SSL
ssl.protocol=TLSv1.2
ssl.truststore.location={path/to/client.truststore.jks}
ssl.truststore.password={password}
sasl.mechanism=SCRAM-SHA-512
sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username={consumer_name} password={password};
```
As of now, we have implemented encryption and authentication for Kafka brokers. To verify that our producer and consumer are working properly with SCRAM credentials, run the console producer and consumer on some topics.

Authorization is not implemented yet. Kafka uses access control lists (ACLs) to specify which users can perform which actions on specific resources or groups of resources. Each ACL has a principal, a permission type, an operation, a resource type, and a name.

The default authorizer is ACLAuthorizer provided by Kafka; Confluent also provides the Confluent Server Authorizer, which is totally different from ACLAuthorizer. An authorizer is a server plugin used by Kafka to authorize actions. Specifically, the authorizer controls whether operations should be authorized based on the principal and resource being accessed.

Format of ACLs – Principal P is [Allowed/Denied] Operation O from Host H on any Resource R matching ResourcePattern RP

Execute the below command to create an ACL with writing permission for the producer:

“bin/kafka-acls.sh –authorizer-properties zookeeper.connect=<host: port> –add –allow-principal User:<producer_name> –operation WRITE –topic <topic_name>”

The above command should create ACL of write operation for producer_name on topic_name.

Now, execute the below command to create an ACL with reading permission for the consumer:

“bin/kafka-acls.sh –authorizer-properties zookeeper.connect=<host: port> –add –allow-principal User:<consumer_name> –operation READ –topic <topic_name>”

Now we need to define the consumer group ID for this consumer, so the below command associates a consumer with a given consumer group ID.

“bin/kafka-acls.sh –authorizer-properties zookeeper.connect=<host: port> –add –allow-principal User:<consumer_name> –operation READ –group <consumer_group_name>”

Now, we need to add some configuration in two files: broker.properties and consumer.properties.
```
# Authorizer class
authorizer.class.name=kafka.security.authorizer.AclAuthorizer
```
The above line indicates that AclAuthorizer class is used for authorization.
```
# consumer group id
group.id=<consumer_group_name>
```
Consumer group-id is mandatory, and if we do not specify any group, a consumer will not be able to access the data from topics, so to start a consumer, group-id should be provided.

Let’s test the producer and consumer one by one, run the console producer and also run the console consumer in another terminal; both should be running without error.

console-producer

console-consumer

Voila!! Your Kafka is secured.

Summary

In a nutshell, we have implemented security in our Kafka using the SASL_SSL mechanism and learned how to create ACLs and give different permission to different users.

Apache Kafka is the wild west without security. By default, there is no encryption, authentication, or access control list. Any client can communicate with the Kafka broker using the PLAINTEXT port. Access using this port should be restricted to trusted clients only. You can use network segmentation and/or authentication ACLs to restrict access to trusted IP addresses in these cases. If none of these are used, the cluster is wide open and available to anyone. A basic knowledge of Kafka authentication, authorization, encryption, and audit trails is required to safely move a system into production.
December 28, 2022

Discover the Benefits of Android Clean Architecture

All architectures have one common goal: to manage the complexity of our application. We may not need to worry about it on a smaller project, but it becomes a lifesaver on larger ones. The purpose of Clean Architecture is to minimize code complexity by preventing implementation complexity.

We must first understand a few things to implement the Clean Architecture in an Android project.

Entities: Encapsulate enterprise-wide critical business rules. An entity can be an object with methods or data structures and functions.
Use cases: It demonstrates data flow to and from the entities.
Controllers, gateways, presenters: A set of adapters that convert data from the use cases and entities format to the most convenient way to pass the data to the upper level (typically the UI).
UI, external interfaces, DB, web, devices: The outermost layer of the architecture, generally composed of frameworks such as database and web frameworks.

Here is one thumb rule we need to follow. First, look at the direction of the arrows in the diagram. Entities do not depend on use cases and use cases do not depend on controllers, and so on. A lower-level module should always rely on something other than a higher-level module. The dependencies between the layers must be inwards.

Advantages of Clean Architecture:

Strict architecture—hard to make mistakes
Business logic is encapsulated, easy to use, and tested
Enforcement of dependencies through encapsulation
Allows for parallel development
Highly scalable
Easy to understand and maintain
Testing is facilitated

Let’s understand this using the small case study of the Android project, which gives more practical knowledge rather than theoretical.

A pragmatic approach

A typical Android project typically needs to separate the concerns between the UI, the business logic, and the data model, so taking “the theory” into account, we decided to split the project into three modules:

Domain Layer: contains the definitions of the business logic of the app, the data models, the abstract definition of repositories, and the definition of the use cases.

Data Layer: This layer provides the abstract definition of all the data sources. Any application can reuse this without modifications. It contains repositories and data sources implementations, the database definition and its DAOs, the network APIs definitions, some mappers to convert network API models to database models, and vice versa.

Presentation layer: This is the layer that mainly interacts with the UI. It’s Android-specific and contains fragments, view models, adapters, activities, composable, and so on. It also includes a service locator to manage dependencies.

Marvel’s comic characters App

To elaborate on all the above concepts related to Clean Architecture, we are creating an app that lists Marvel’s comic characters using Marvel’s developer API. The app shows a list of Marvel characters, and clicking on each character will show details of that character. Users can also bookmark their favorite characters. It seems like nothing complicated, right?

Before proceeding further into the sample, it’s good to have an idea of the following frameworks because the example is wholly based on them.

Jetpack Compose – Android’s recommended modern toolkit for building native UI.
Retrofit 2 – A type-safe HTTP client for Android for Network calls.
ViewModel – A class responsible for preparing and managing the data for an activity or a fragment.
Kotlin – Kotlin is a cross-platform, statically typed, general-purpose programming language with type inference.

To get a characters list, we have used marvel’s developer API, which returns the list of marvel characters.

http://gateway.marvel.com/v1/public/characters

The domain layer

In the domain layer, we define the data model, the use cases, and the abstract definition of the character repository. The API returns a list of characters, with some info like name, description, and image links.

data class CharacterEntity(
    val id: Long,
    val name: String,
    val description: String,
    val imageUrl: String,
    val bookmarkStatus: Boolean
)

data class CharacterEntity(
    val id: Long,
    val name: String,
    val description: String,
    val imageUrl: String,
    val bookmarkStatus: Boolean
)

interface MarvelDataRepository {
    suspend fun getCharacters(dataSource: DataSource): Flow<List<CharacterEntity>>
    suspend fun getCharacter(characterId: Long): Flow<CharacterEntity>
    suspend fun toggleCharacterBookmarkStatus(characterId: Long): Boolean
    suspend fun getComics(dataSource: DataSource, characterId: Long): Flow<List<ComicsEntity>>
}

interface MarvelDataRepository {
    suspend fun getCharacters(dataSource: DataSource): Flow<List<CharacterEntity>>
    suspend fun getCharacter(characterId: Long): Flow<CharacterEntity>
    suspend fun toggleCharacterBookmarkStatus(characterId: Long): Boolean
    suspend fun getComics(dataSource: DataSource, characterId: Long): Flow<List<ComicsEntity>>
}

class GetCharactersUseCase(
    private val marvelDataRepository: MarvelDataRepository,
    private val ioDispatcher: CoroutineDispatcher = Dispatchers.IO
) {
    operator fun invoke(forceRefresh: Boolean = false): Flow<List<CharacterEntity>> {
        return flow {
            emitAll(
                marvelDataRepository.getCharacters(
                    if (forceRefresh) {
                        DataSource.Network
                    } else {
                        DataSource.Cache
                    }
                )
            )
        }
            .flowOn(ioDispatcher)
    }
}

class GetCharactersUseCase(
    private val marvelDataRepository: MarvelDataRepository,
    private val ioDispatcher: CoroutineDispatcher = Dispatchers.IO
) {
    operator fun invoke(forceRefresh: Boolean = false): Flow<List<CharacterEntity>> {
        return flow {
            emitAll(
                marvelDataRepository.getCharacters(
                    if (forceRefresh) {
                        DataSource.Network
                    } else {
                        DataSource.Cache
                    }
                )
            )
        }
            .flowOn(ioDispatcher)
    }
}

The data layer

As we said before, the data layer must implement the abstract definition of the domain layer, so we need to put the repository’s concrete implementation in this layer. To do so, we can define two data sources, a “local” data source to provide persistence and a “remote” data source to fetch the data from the API.

class MarvelDataRepositoryImpl(
    private val marvelRemoteService: MarvelRemoteService,
    private val charactersDao: CharactersDao,
    private val comicsDao: ComicsDao,
    private val ioDispatcher: CoroutineDispatcher = Dispatchers.IO
) : MarvelDataRepository {
    override suspend fun getCharacters(dataSource: DataSource): Flow<List<CharacterEntity>> =
        flow {
            emitAll(
                when (dataSource) {
                    is DataSource.Cache -> getCharactersCache().map { list ->
                        if (list.isEmpty()) {
                            getCharactersNetwork()
                        } else {
                            list.toDomain()
                        }
                    }
                        .flowOn(ioDispatcher)
                    is DataSource.Network -> flowOf(getCharactersNetwork())
                        .flowOn(ioDispatcher)
                }
            )
        }
    private suspend fun getCharactersNetwork(): List<CharacterEntity> =
        marvelRemoteService.getCharacters().body()?.data?.results?.let { remoteData ->
            if (remoteData.isNotEmpty()) {
                charactersDao.upsert(remoteData.toCache())
            }
            remoteData.toDomain()
        } ?: emptyList()
    private fun getCharactersCache(): Flow<List<CharacterCache>> =
        charactersDao.getCharacters()
    override suspend fun getCharacter(characterId: Long): Flow<CharacterEntity> =
        charactersDao.getCharacterFlow(id = characterId).map {
            it.toDomain()
        }
    override suspend fun toggleCharacterBookmarkStatus(characterId: Long): Boolean {
        val status = charactersDao.getCharacter(characterId)?.bookmarkStatus?.not() ?: false
        return charactersDao.toggleCharacterBookmarkStatus(id = characterId, status = status) > 0
    }
    override suspend fun getComics(
        dataSource: DataSource,
        characterId: Long
    ): Flow<List<ComicsEntity>> = flow {
        emitAll(
            when (dataSource) {
                is DataSource.Cache -> getComicsCache(characterId = characterId).map { list ->
                    if (list.isEmpty()) {
                        getComicsNetwork(characterId = characterId)
                    } else {
                        list.toDomain()
                    }
                }
                is DataSource.Network -> flowOf(getComicsNetwork(characterId = characterId))
                    .flowOn(ioDispatcher)
            }
        )
    }
    private suspend fun getComicsNetwork(characterId: Long): List<ComicsEntity> =
        marvelRemoteService.getComics(characterId = characterId)
            .body()?.data?.results?.let { remoteData ->
                if (remoteData.isNotEmpty()) {
                    comicsDao.upsert(remoteData.toCache(characterId = characterId))
                }
                remoteData.toDomain()
            } ?: emptyList()
    private fun getComicsCache(characterId: Long): Flow<List<ComicsCache>> =
        comicsDao.getComics(characterId = characterId)
}

class MarvelDataRepositoryImpl(
    private val marvelRemoteService: MarvelRemoteService,
    private val charactersDao: CharactersDao,
    private val comicsDao: ComicsDao,
    private val ioDispatcher: CoroutineDispatcher = Dispatchers.IO
) : MarvelDataRepository {

    override suspend fun getCharacters(dataSource: DataSource): Flow<List<CharacterEntity>> =
        flow {
            emitAll(
                when (dataSource) {
                    is DataSource.Cache -> getCharactersCache().map { list ->
                        if (list.isEmpty()) {
                            getCharactersNetwork()
                        } else {
                            list.toDomain()
                        }
                    }
                        .flowOn(ioDispatcher)

                    is DataSource.Network -> flowOf(getCharactersNetwork())
                        .flowOn(ioDispatcher)
                }
            )
        }

    private suspend fun getCharactersNetwork(): List<CharacterEntity> =
        marvelRemoteService.getCharacters().body()?.data?.results?.let { remoteData ->
            if (remoteData.isNotEmpty()) {
                charactersDao.upsert(remoteData.toCache())
            }
            remoteData.toDomain()
        } ?: emptyList()

    private fun getCharactersCache(): Flow<List<CharacterCache>> =
        charactersDao.getCharacters()

    override suspend fun getCharacter(characterId: Long): Flow<CharacterEntity> =
        charactersDao.getCharacterFlow(id = characterId).map {
            it.toDomain()
        }

    override suspend fun toggleCharacterBookmarkStatus(characterId: Long): Boolean {

        val status = charactersDao.getCharacter(characterId)?.bookmarkStatus?.not() ?: false

        return charactersDao.toggleCharacterBookmarkStatus(id = characterId, status = status) > 0
    }

    override suspend fun getComics(
        dataSource: DataSource,
        characterId: Long
    ): Flow<List<ComicsEntity>> = flow {
        emitAll(
            when (dataSource) {
                is DataSource.Cache -> getComicsCache(characterId = characterId).map { list ->
                    if (list.isEmpty()) {
                        getComicsNetwork(characterId = characterId)
                    } else {
                        list.toDomain()
                    }
                }
                is DataSource.Network -> flowOf(getComicsNetwork(characterId = characterId))
                    .flowOn(ioDispatcher)
            }
        )
    }

    private suspend fun getComicsNetwork(characterId: Long): List<ComicsEntity> =
        marvelRemoteService.getComics(characterId = characterId)
            .body()?.data?.results?.let { remoteData ->
                if (remoteData.isNotEmpty()) {
                    comicsDao.upsert(remoteData.toCache(characterId = characterId))
                }
                remoteData.toDomain()
            } ?: emptyList()

    private fun getComicsCache(characterId: Long): Flow<List<ComicsCache>> =
        comicsDao.getComics(characterId = characterId)
}

Since we defined the data source to manage persistence, in this layer, we also need to determine the database for which we are using the room database. In addition, it’s good practice to create some mappers to map the API response to the corresponding database entity.

fun List<Characters>.toCache() = map { character -> character.toCache() }
fun Characters.toCache() = CharacterCache(
    id = id ?: 0,
    name = name ?: "",
    description = description ?: "",
    imageUrl = thumbnail?.let {
        "${it.path}.${it.extension}"
    } ?: ""
)
fun List<Characters>.toDomain() = map { character -> character.toDomain() }
fun Characters.toDomain() = CharacterEntity(
    id = id ?: 0,
    name = name ?: "",
    description = description ?: "",
    imageUrl = thumbnail?.let {
        "${it.path}.${it.extension}"
    } ?: "",
    bookmarkStatus = false
)

fun List<Characters>.toCache() = map { character -> character.toCache() }

fun Characters.toCache() = CharacterCache(
    id = id ?: 0,
    name = name ?: "",
    description = description ?: "",
    imageUrl = thumbnail?.let {
        "${it.path}.${it.extension}"
    } ?: ""
)

fun List<Characters>.toDomain() = map { character -> character.toDomain() }

fun Characters.toDomain() = CharacterEntity(
    id = id ?: 0,
    name = name ?: "",
    description = description ?: "",
    imageUrl = thumbnail?.let {
        "${it.path}.${it.extension}"
    } ?: "",
    bookmarkStatus = false
)

@Entity
data class CharacterCache(
    @PrimaryKey
    val id: Long,
    val name: String,
    val description: String,
    val imageUrl: String,
    val bookmarkStatus: Boolean = false
) : BaseCache

@Entity
data class CharacterCache(
    @PrimaryKey
    val id: Long,
    val name: String,
    val description: String,
    val imageUrl: String,
    val bookmarkStatus: Boolean = false
) : BaseCache

The presentation layer

In this layer, we need a UI component like fragments, activity, or composable to display the list of characters; here, we can use the widely used MVVM approach. The view model takes the use cases in its constructors and invokes the corresponding use case according to user actions (get a character, characters & comics, etc.).

Each use case will invoke the appropriate method in the repository.

class CharactersListViewModel(
    private val getCharacters: GetCharactersUseCase,
    private val toggleCharacterBookmarkStatus: ToggleCharacterBookmarkStatus
) : ViewModel() {
    private val _characters = MutableStateFlow<UiState<List<CharacterViewState>>>(UiState.Loading())
    val characters: StateFlow<UiState<List<CharacterViewState>>> = _characters
    init {
        _characters.value = UiState.Loading()
        getAllCharacters()
    }
    private fun getAllCharacters(forceRefresh: Boolean = false) {
        getCharacters(forceRefresh)
            .catch { error ->
                error.printStackTrace()
                when (error) {
                    is UnknownHostException, is ConnectException, is SocketTimeoutException -> _characters.value =
                        UiState.NoInternetError(error)
                    else -> _characters.value = UiState.ApiError(error)
                }
            }.map { list ->
                _characters.value = UiState.Loaded(list.toViewState())
            }.launchIn(viewModelScope)
    }
    fun refresh(showLoader: Boolean = false) {
        if (showLoader) {
            _characters.value = UiState.Loading()
        }
        getAllCharacters(forceRefresh = true)
    }
    fun bookmarkCharacter(characterId: Long) {
        viewModelScope.launch {
            toggleCharacterBookmarkStatus(characterId = characterId)
        }
    }
}

class CharactersListViewModel(
    private val getCharacters: GetCharactersUseCase,
    private val toggleCharacterBookmarkStatus: ToggleCharacterBookmarkStatus
) : ViewModel() {

    private val _characters = MutableStateFlow<UiState<List<CharacterViewState>>>(UiState.Loading())
    val characters: StateFlow<UiState<List<CharacterViewState>>> = _characters

    init {
        _characters.value = UiState.Loading()
        getAllCharacters()
    }

    private fun getAllCharacters(forceRefresh: Boolean = false) {
        getCharacters(forceRefresh)
            .catch { error ->
                error.printStackTrace()
                when (error) {
                    is UnknownHostException, is ConnectException, is SocketTimeoutException -> _characters.value =
                        UiState.NoInternetError(error)
                    else -> _characters.value = UiState.ApiError(error)
                }
            }.map { list ->
                _characters.value = UiState.Loaded(list.toViewState())
            }.launchIn(viewModelScope)
    }

    fun refresh(showLoader: Boolean = false) {
        if (showLoader) {
            _characters.value = UiState.Loading()
        }
        getAllCharacters(forceRefresh = true)
    }

    fun bookmarkCharacter(characterId: Long) {
        viewModelScope.launch {
            toggleCharacterBookmarkStatus(characterId = characterId)
        }
    }
}

/*
* Scaffold(Layout) for Characters list page
* */
@SuppressLint("UnusedMaterialScaffoldPaddingParameter")
@Composable
fun CharactersListScaffold(
    showComics: (Long) -> Unit,
    closeAction: () -> Unit,
    modifier: Modifier = Modifier,
    charactersListViewModel: CharactersListViewModel = getViewModel()
) {
    Scaffold(
        modifier = modifier,
        topBar = {
            TopAppBar(
                title = {
                    Text(text = stringResource(id = R.string.characters))
                },
                navigationIcon = {
                    IconButton(onClick = closeAction) {
                        Icon(
                            imageVector = Icons.Filled.Close,
                            contentDescription = stringResource(id = R.string.close_icon)
                        )
                    }
                }
            )
        }
    ) {
        val state = charactersListViewModel.characters.collectAsState()
        when (state.value) {
            is UiState.Loading -> {
                Loader()
            }
            is UiState.Loaded -> {
                state.value.data?.let { characters ->
                    val isRefreshing = remember { mutableStateOf(false) }
                    SwipeRefresh(
                        state = rememberSwipeRefreshState(isRefreshing = isRefreshing.value),
                        onRefresh = {
                            isRefreshing.value = true
                            charactersListViewModel.refresh()
                        }
                    ) {
                        isRefreshing.value = false
                        if (characters.isNotEmpty()) {
                            LazyVerticalGrid(
                                columns = GridCells.Fixed(2),
                                modifier = Modifier
                                    .padding(5.dp)
                                    .fillMaxSize()
                            ) {
                                items(characters) { state ->
                                    CharacterTile(
                                        state = state,
                                        characterSelectAction = {
                                            showComics(state.id)
                                        },
                                        bookmarkAction = {
                                            charactersListViewModel.bookmarkCharacter(state.id)
                                        },
                                        modifier = Modifier
                                            .padding(5.dp)
                                            .fillMaxHeight(fraction = 0.35f)
                                    )
                                }
                            }
                        } else {
                            Info(
                                messageResource = R.string.no_characters_available,
                                iconResource = R.drawable.ic_no_data
                            )
                        }
                    }
                }
            }
            is UiState.ApiError -> {
                Info(
                    messageResource = R.string.api_error,
                    iconResource = R.drawable.ic_something_went_wrong
                )
            }
            is UiState.NoInternetError -> {
                Info(
                    messageResource = R.string.no_internet,
                    iconResource = R.drawable.ic_no_connection,
                    isInfoOnly = false,
                    buttonAction = {
                        charactersListViewModel.refresh(showLoader = true)
                    }
                )
            }
        }
    }
}
@Preview
@Composable
private fun CharactersListScaffoldPreview() {
    MarvelComicTheme {
        CharactersListScaffold(showComics = {}, closeAction = {})
    }
}

/*
* Scaffold(Layout) for Characters list page
* */


@SuppressLint("UnusedMaterialScaffoldPaddingParameter")
@Composable
fun CharactersListScaffold(
    showComics: (Long) -> Unit,
    closeAction: () -> Unit,
    modifier: Modifier = Modifier,
    charactersListViewModel: CharactersListViewModel = getViewModel()
) {
    Scaffold(
        modifier = modifier,
        topBar = {
            TopAppBar(
                title = {
                    Text(text = stringResource(id = R.string.characters))
                },
                navigationIcon = {
                    IconButton(onClick = closeAction) {
                        Icon(
                            imageVector = Icons.Filled.Close,
                            contentDescription = stringResource(id = R.string.close_icon)
                        )
                    }
                }
            )
        }
    ) {
        val state = charactersListViewModel.characters.collectAsState()

        when (state.value) {

            is UiState.Loading -> {
                Loader()
            }

            is UiState.Loaded -> {
                state.value.data?.let { characters ->
                    val isRefreshing = remember { mutableStateOf(false) }
                    SwipeRefresh(
                        state = rememberSwipeRefreshState(isRefreshing = isRefreshing.value),
                        onRefresh = {
                            isRefreshing.value = true
                            charactersListViewModel.refresh()
                        }
                    ) {
                        isRefreshing.value = false

                        if (characters.isNotEmpty()) {

                            LazyVerticalGrid(
                                columns = GridCells.Fixed(2),
                                modifier = Modifier
                                    .padding(5.dp)
                                    .fillMaxSize()
                            ) {
                                items(characters) { state ->
                                    CharacterTile(
                                        state = state,
                                        characterSelectAction = {
                                            showComics(state.id)
                                        },
                                        bookmarkAction = {
                                            charactersListViewModel.bookmarkCharacter(state.id)
                                        },
                                        modifier = Modifier
                                            .padding(5.dp)
                                            .fillMaxHeight(fraction = 0.35f)
                                    )
                                }
                            }

                        } else {
                            Info(
                                messageResource = R.string.no_characters_available,
                                iconResource = R.drawable.ic_no_data
                            )
                        }
                    }
                }
            }

            is UiState.ApiError -> {
                Info(
                    messageResource = R.string.api_error,
                    iconResource = R.drawable.ic_something_went_wrong
                )
            }

            is UiState.NoInternetError -> {
                Info(
                    messageResource = R.string.no_internet,
                    iconResource = R.drawable.ic_no_connection,
                    isInfoOnly = false,
                    buttonAction = {
                        charactersListViewModel.refresh(showLoader = true)
                    }
                )
            }
        }
    }
}

@Preview
@Composable
private fun CharactersListScaffoldPreview() {
    MarvelComicTheme {
        CharactersListScaffold(showComics = {}, closeAction = {})
    }
}

Let’s see how the communication between the layers looks like.

Source: Clean Architecture Tutorial for Android

As you can see, each layer communicates only with the closest one, keeping inner layers independent from lower layers, this way, we can quickly test each module separately, and the separation of concerns will help developers to collaborate on the different modules of the project.

Thank you so much!

December 14, 2022

Your Quintessential Guide to AWS Athena

Introduction

Serverless has become a new trend today and is here to stay for sure! Now when you think of wireless internet, you know that it still has some wires but you don’t need to worry about them as you don’t have to maintain them. Similarly, serverless has servers but you don’t have to keep worrying about handling or maintaining them. All you need to do is focus on your code and you’re good to go.

It has some more benefits, such as:

Zero administration: You can deploy code without provisioning anything beforehand, or managing anything later. There is no concept of a fleet, an instance, or even an operating system.
Auto-scaling: It lets your service providers manage the scaling challenges. You don’t need to fire alerts or write scripts to scale up and down. It handles quick bursts of traffic and weekend lulls the same way.
Pay-per-use: The function-as-a-service compute and managed services are charged based on usage rather than pre-provisioned capacity. You can have complete resource utilization without paying a cent for idle time. The results? 90% cost-savings over a cloud VM, and the satisfaction of knowing that you never pay for resources you don’t use.

What is AWS Athena?

AWS Athena is a similar serverless service. It is more of an interactive query service than a code deployment service.

Using Athena one can directly query the data stored in S3 buckets and using standard ANSI SQL.

As mentioned earlier, it works on the principle of serverless, that is, there is no infrastructure to manage, and you pay only for the queries that you run.

Athena is easy to use. You can simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

It is based on Facebook’s PrestoDB and can be used to query structured and semi-structured data.

Some Exciting Features of Athena are:

Serverless. No ETL – Not having to set up and manage any servers or data warehouses.
Only pay for the data that is scanned.
You can ensure better performance by compressing, partitioning, and converting your data into columnar formats.
Can also handle complex analysis, including large joins, window functions, and arrays.
Athena automatically executes queries in parallel.
Need to provide a path to the S3 folder and when new files added automatically reflects in the table.
Supports –
Support CSV, Json, Parquet, ORC, Avro data formats
Complex Joins and datatypes
View creation
Does not Support –
User-defined functions and stored procedures
Hive or Presto transactions
LZO (Snappy is supported)

Pricing of Athena

AWS Athena is priced $5 for each TB of data scanned.
Queries are rounded up to the nearest MB, with a 10 MB minimum.
Users pay for stored data at regular S3 rates.
Amazon advises users to use compressed data files, have data in columnar formats, and routinely delete old results sets to keep charges low. Partitioning data in tables can speed up queries and reduce query bills.

Athena vs. Redshift Spectrum

AWS also has Redshift as data warehouse service, and we can use redshift spectrum to query S3 data, so then why should you use Athena?

Advantages of Redshift Spectrum:

Allows creation of Redshift tables. You’re able to join Redshift tables with Redshift spectrum tables efficiently.

If you do not need those things then you should consider Athena as well Athena differences from Redshift spectrum:

Billing. This is a major difference and depending on your use case you may find one much cheaper than the other Performance.
Athena slightly faster. SQL syntax and features.
Athena is derived from presto and is a bit different to Redshift which has its roots in Postgres.
It’s easy enough to connect to Athena using API, JDBC or ODBC but many more products offer “standard out of the box” connection to Redshift.
Athena has GIS functions and lambdas.

So in nutshell, if you have existing instances of redshift you would probably go for Redshift Spectrum, if not then you can opt for Athena for querying the data. In some cases, you can use both in tandem.

Example

Here is a sample query to create a sample database having 3 tables basic_details, contact_details and bill_details, Uploaded csv file to s3:
‍
Basic_details:

const outside = {weather: FRIGHTFUL}
const inside = {fire: DELIGHTFUL}
const go = places => places.some(p=>p>outside.weather)))

const snow = () => (outside.weather < inside.fire && !go(places)) {
  let it = snow()
}

let it = snow()

const FRIGHTFUL = 1
const DELIGHTFUL = 1337

const outside = {weather: FRIGHTFUL}
const inside = {fire: DELIGHTFUL}
const go = places => places.some(p=>p>outside.weather)))

const snow = () => (outside.weather < inside.fire && !go(places)) {
  let it = snow()
}

let it = snow()

const FRIGHTFUL = 1
const DELIGHTFUL = 1337

Bill_details:

CREATE EXTERNAL TABLE `bil_details`(
  `id` int COMMENT '', 
  `amount_paid` string COMMENT '', 
  `amount_due` string COMMENT '')
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://athena-blog/bill-details'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'skip.header.line.count'='1')

CREATE EXTERNAL TABLE `bil_details`(
  `id` int COMMENT '', 
  `amount_paid` string COMMENT '', 
  `amount_due` string COMMENT '')
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://athena-blog/bill-details'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'skip.header.line.count'='1')

‍Contact_details:

CREATE EXTERNAL TABLE `contact_details`(
  `id` int COMMENT '', 
  `street` string COMMENT '', 
  `city` string COMMENT '', 
  `state` string COMMENT '', 
  `country` string COMMENT '', 
  `zip` string COMMENT '')
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://athena-blog/contact-details'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'skip.header.line.count'='1')

CREATE EXTERNAL TABLE `contact_details`(
  `id` int COMMENT '', 
  `street` string COMMENT '', 
  `city` string COMMENT '', 
  `state` string COMMENT '', 
  `country` string COMMENT '', 
  `zip` string COMMENT '')
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://athena-blog/contact-details'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'skip.header.line.count'='1')

Sample Query for – FirstNames of People from Minnesota with amount_due > $100

WITH basic AS 
    (SELECT id,
         first_name
    FROM basic_details
    WHERE lower(gender) = 'male' ), bill AS 
    (SELECT id
    FROM bil_details
    WHERE CAST(amount_due AS INTEGER) > 100 ), contact AS 
    (SELECT contact_details.id
    FROM contact_details
    JOIN bill
        ON contact_details.id = bill.id
    WHERE state= 'Minnesota' )
SELECT basic.first_name
FROM basic
JOIN contact
    ON basic.id = contact.id

WITH basic AS 
    (SELECT id,
         first_name
    FROM basic_details
    WHERE lower(gender) = 'male' ), bill AS 
    (SELECT id
    FROM bil_details
    WHERE CAST(amount_due AS INTEGER) > 100 ), contact AS 
    (SELECT contact_details.id
    FROM contact_details
    JOIN bill
        ON contact_details.id = bill.id
    WHERE state= 'Minnesota' )
SELECT basic.first_name
FROM basic
JOIN contact
    ON basic.id = contact.id

Output:

Some Other Sample Queries:

1. Searching for Values in JSON

WITH dataset AS (
  SELECT * FROM (VALUES
    (JSON '{"name": "Bob Smith", "org": "legal", "projects": ["project1"]}'),
    (JSON '{"name": "Susan Smith", "org": "engineering", "projects": ["project1", "project2", "project3"]}'),
    (JSON '{"name": "Jane Smith", "org": "finance", "projects": ["project1", "project2"]}')
  ) AS t (users)
)
SELECT json_extract_scalar(users, '$.name') AS user
FROM dataset
WHERE json_array_contains(json_extract(users, '$.projects'), 'project2')

WITH dataset AS (
  SELECT * FROM (VALUES
    (JSON '{"name": "Bob Smith", "org": "legal", "projects": ["project1"]}'),
    (JSON '{"name": "Susan Smith", "org": "engineering", "projects": ["project1", "project2", "project3"]}'),
    (JSON '{"name": "Jane Smith", "org": "finance", "projects": ["project1", "project2"]}')
  ) AS t (users)
)
SELECT json_extract_scalar(users, '$.name') AS user
FROM dataset
WHERE json_array_contains(json_extract(users, '$.projects'), 'project2')

Output:

2. Extracting properties

WITH dataset AS (
  SELECT '{"name": "Susan Smith",
           "org": "engineering",
           "projects": [{"name":"project1", "completed":false},
           {"name":"project2", "completed":true}]}'
    AS blob
)
SELECT
  json_extract(blob, '$.name') AS name,
  json_extract(blob, '$.projects') AS projects
FROM dataset

WITH dataset AS (
  SELECT '{"name": "Susan Smith",
           "org": "engineering",
           "projects": [{"name":"project1", "completed":false},
           {"name":"project2", "completed":true}]}'
    AS blob
)
SELECT
  json_extract(blob, '$.name') AS name,
  json_extract(blob, '$.projects') AS projects
FROM dataset

Output:

3. Converting JSON to Athena Data Types

WITH dataset AS (
  SELECT
    CAST(JSON '"HELLO ATHENA"' AS VARCHAR) AS hello_msg,
    CAST(JSON '12345' AS INTEGER) AS some_int,
    CAST(JSON '{"a":1,"b":2}' AS MAP(VARCHAR, INTEGER)) AS some_map
)
SELECT * FROM dataset

WITH dataset AS (
  SELECT
    CAST(JSON '"HELLO ATHENA"' AS VARCHAR) AS hello_msg,
    CAST(JSON '12345' AS INTEGER) AS some_int,
    CAST(JSON '{"a":1,"b":2}' AS MAP(VARCHAR, INTEGER)) AS some_map
)
SELECT * FROM dataset

Output:

Conclusion

Hence, we can easily say that AWS Athena gives us an efficient way to query our raw data present in different formats in S3 object storage, without spawning a dedicated infrastructure and at minimal cost.

Need help with setting up AWS Athena for your organization? Connect with the experts at Velotio!

December 12, 2022

Why You Should Prefer Next.js 12 Over Other React Setup
If you are coming from a robust framework, such as Angular or any other major full-stack framework, you have probably asked yourself why a popular library like React (yes, it’s not a framework, hence this blog) has the worst tooling and developer experience.

They’ve done the least amount of work possible to build this framework: no routing, no support for SSR, nor a decent design system, or CSS support. While some people might disagree—“The whole idea is to keep it simple so that people can bootstrap their own framework.” –Dan Abramov. However, here’s the catch: Most people don’t want to go through the tedious process of setting up.

Many just want to install and start building some robust applications, and with the new release of Next.js (12), it’s more production-ready than your own setup can ever be.

Before we get started discussing what Next.js 12 can do for us, let’s get some facts straight:
- React is indeed a library that could be used with or without JSX.
- Next.js is a framework (Not entirely UI ) for building full-stack applications.
- Next.js is opinionated, so if your plan is to do whatever you want or how you want, maybe Next isn’t the right thing for you (mind that it’s for production).
- Although Next is one of the most updated code bases and has a massive community supporting it, a huge portion of it is handled by Vercel, and like other frameworks backed by a tech giant… be ready for occasional Vendor-lockin (don’t forget React–[Meta] ).
- This is not a Next.js tutorial; I won’t be going over Next.js. I will be going over the features that are released with V12 that make it go over the inflection point where Next could be considered as the primary framework for React apps.
ES module support

ES modules bring a standardized module system to the entire JS ecosystem. They’re supported by all major browsers and node.js, enabling your build to have smaller package sizes. This lets you use any package using a URL—no installation or build step required—use any CDN that serves ES module as well as the design tools of the future (Framer already does it –https://www.framer.com/ ).
```
import Card from 'https://framer.com/m/Card-3Yxh.js@gsb1Gjlgc5HwfhuD1VId';
import Head from 'next/head';

export default class MyDocument extends Document {
  render() {
    return (
      <>
        <Head>
          <title>URL imports for Next 12</title>
        </Head>
        <div>
          <Card variant='R3F' />
        </div>
      </>
    );
  }
}
```
As you can see, we are importing a Card component directly from the framer CDN on the go with all its perks. This would, in turn, be the start of seamless integration with all your developer environments in the not-too-distant future. If you want to learn more about URL imports and how to enable the alpha version, go here.

New engine for faster DEV run and production build:

Next.js 12 comes with a new Rust compiler that comes with a native infrastructure. This is built on top of SWC, an open platform for fast tooling systems. It comes with an impressive stat of having 3 times faster local refresh and 5 times faster production builds.

Contrary to most productions builds with React using webpack, which come with a ton of overheads and don’t really run on the native system, SWC is going to save you a ton of time that you waste during your mundane workloads.

Source: Nextjs.org

Next.js Live:

If you are anything like me, you’ve probably had some changes that you aren’t really sure about and just want to go through them with the designer, but you don’t really wanna push the code to PROD. Taking a call with the designer and sharing your screen isn’t really the best way to do it. If only there were a way to share your workflow on-the-go with your team with some collaboration feature that just wouldn’t take up an entire day to setup. Well, Next.js Live lets you do just that.

Source: Next.js

With the help of ES module system and native support for webassembly, Next.js Live runs entirely on the browser, and irrespective of where you host it, the development engine behind it will soon be open source so that more platforms can actually take advantage of this, but for now, it’s all Next.js.

Go over to V and do a test run.

Middleware & serverless:

These are just repetitive pieces of code that you think could run on their own out of your actual backend. The best part about this is that you don’t really need to place these close to your backend. Before the request gets completed, you can potentially rewrite, redirect, add headers, or even stream HTML., Depending upon how you host your middleware using Vercel edge functions or lambdas with AWS, they can potentially handle
- Authentication
- Bot protection
- Redirects
- Browser support
- Feature flags
- A/B tests
- Server-side analytics
- Logging
And since this is part of the Next build output, you can technically use any hosting providers with an Edge network (No Vendor lock-in)

For implementing middleware, we can create a file _middleware inside any pages folder that will run before any requests at that particular route (routename)

pages/routeName/_middleware.ts.
```
import type { NextFetchEvent } from 'next/server';
import { NextResponse } from 'next/server';
export function middleware(event: NextFetchEvent) {
  // gram the user's location or use India for default
  const country = event.request.geo.country.toLowerCase() || 'IND';

  //rewrite to static, cached page for each local
  return event.respondWith(NextResponse.rewrite(`/routeName/${country}`));
}
```
Since this middleware, each request will be cached, and rewriting the response change the URL in your client Next.js can make the difference and still provide you the country flag.

Server-side streaming:

React 18 now supports server-side suspense API and SSR streaming. One big drawback of SSR was that it wasn’t restricted to the strict run time of REST fetch standard. So, in theory, any page that needed heavy lifting from the server could give you higher FCP (first contentful paint). Now this will allow you to stream server-rendered pages using HTTP streaming that will solve your problem for higher render time you can take a look at the alpha version by adding.
```
module.exports = {
  experimental: {
    concurrentFeatures: true
  }
}
```
React server components:

React server components allow us to render almost everything, including the components themselves inside the server. This is fundamentally different from SSR where you are just generating HTML on the server, with server components, there’s zero client-side Javascript needed, making the rendering process much faster (basically no hydration process). This could also be deemed as including the best parts of server rendering with client-side interactivity.
import Footer from '../components/Footer'; import Page from '../components/Page'; import Story from '../components/Story'; import fetchData from '../lib/api'; export async function getServerSideProps() { const storyIds = await fetchData('storyIds'); const data = await Promise.all( storyIds.slice(0, 30).map(async (id) => await fetchData(`item/${id}`)) ); return { props: { data, }, }; } export default function News({ data }) { return ( <Page> {data?.map((item, i) => ( <Story key={i} {...item} /> ))} <Footer /> </Page> ); }
```
import Footer from '../components/Footer';
import Page from '../components/Page';
import Story from '../components/Story';
import fetchData from '../lib/api';
export async function getServerSideProps() {
  const storyIds = await fetchData('storyIds');
  const data = await Promise.all(
    storyIds.slice(0, 30).map(async (id) => await fetchData(`item/${id}`))
  );

  return {
    props: {
      data,
    },
  };
}

export default function News({ data }) {
  return (
    <Page>
      {data?.map((item, i) => (
        <Story key={i} {...item} />
      ))}
      <Footer />
    </Page>
  );
}
```
As you can see in the above SSR example, while we are fetching the stories from the endpoint, our client is actually waiting for a response with a blank page, and depending upon how fast your APIs are, this is a pretty big problem—and the reason we don’t just use SSR blindly everywhere.

Now, let’s take a look at a server component example:

Any file ending with .server.js/.ts will be treated as a server component in your Next.js application.
export async function NewsWithData() { const storyIds = await fetchData('storyIds'); return ( <> {storyIds.slice(0, 30).map((id) => { return ( <Suspense fallback={<Spinner />}> <StoryWithData id={id} /> </Suspense> ); })} </> ); } export default function News() { return ( <Page> <Suspense fallback={<Spinner />}> <NewsWithData /> </Suspense> <Footer /> </Page> ); }
```
export async function NewsWithData() {
  const storyIds = await fetchData('storyIds');
  return (
    <>
      {storyIds.slice(0, 30).map((id) => {
        return (
          <Suspense fallback={<Spinner />}>
            <StoryWithData id={id} />
          </Suspense>
        );
      })}
    </>
  );
}

export default function News() {
  return (
    <Page>
      <Suspense fallback={<Spinner />}>
        <NewsWithData />
      </Suspense>
      <Footer />
    </Page>
  );
}
```
This implementation will stream your components progressively and eventually show your data as it gets generated in the server component–by-component. The difference is huge; it will be the next level of code-splitting ,and allow you to do data fetching at the component level and you don’t need to worry about making an API call in the browser.

And functions like getStaticProps and getserverSideProps will be a liability of the past.

And this also identifies the React Hooks model, going with the de-centralized component model. It also removes the choice we often need to make between static or dynamic, bringing the best of both worlds. In the future, the feature of incremental static regeneration will be based on a per-component level, removing the all or nothing page caching and in terms will allow decisive / intelligent caching based on your needs.

Next.js is internally working on a data component, which is basically the React suspense API but with surrogate keys, revalidate, and fallback, which will help to realize these things in the future. Defining your caching semantics at the component level

Conclusion:

Although all the features mentioned above are still in the development stage, just the inception of these will take the React world and frontend in general into a particular direction, and it’s the reason you should be keeping it as your default go-to production framework.
December 12, 2022
What is Gatsby.Js and What Problems Does it Solve?
According to their site, “Gatsby is a free and open source framework based on React that helps developers build blazing fast websites and apps”. Gatsby allows the developers to make a site using React and work with any data source (CMSs, Markdown, etc) of their choice. And then at the build time it pulls the data from these sources and spits out a bunch of static files that are optimized by Gatsby for performance. Gatsby loads only the critical HTML, CSS and JavaScript so that the site loads as fast as possible. Once loaded, Gatsby prefetches resources for other pages so clicking around the site feels incredibly fast.

What Gatsby Tries to Achieve?
- Construct new, higher-level web building blocks: Gatsby is trying to build abstractions like gatsby-image, gatsby-link which will make web development easier by providing building blocks instead of making a new component for each project.‍
- Create a cohesive “content mesh system”: The Content Management System (CMS) was developed to make the content sites possible. Traditionally, a CMS solution was a monolith application to store content, build sites and deliver them to users. But with time, the industry moved to using specialized tools to handle the key areas like search, analytics, payments, etc which have improved rapidly, while the quality of monolithic enterprise CMS applications like Adobe Experience Manager and Sitecore has stayed roughly the same.
  To tackle this modular CMS architecture, Gatsby aims to build a “content mesh” – the infrastructure layer for a decoupled website. The content mesh stitches together content systems in a modern development environment while optimizing website delivery for performance. The content mesh empowers developers while preserving content creators’ workflows. It gives you access to best-of-breed services without the pain of manual integration.
Image Source: Gatsby

Make building websites fun by making them simple: Each of the stakeholder in a website project should be able to see their creation quickl Using these building blocks along with the content mesh, website building feels fun no matter how big it gets. As Alan Kay said, “you get simplicity by finding slightly more sophisticated building blocks”.

An example of this can be seen in gatsby-image component. First lets consider how a single image gets on a website:

1. A page is designed
2. Specific images are chosen
3. The images are resized (with ideally multiple thumbnails to fit different devices)
4. And finally, the image(s) are included in the HTML/CSS/JS (or React component) for the page.

gatsby-image is integrated into Gatsby’s data layer and uses its image processing capabilities along with graphql to query for differently sized and shaped images.

We also skip the complexity around lazy-loading the images which are placed within placeholders. Also the complexity in generating the right sized image thumbnails is also taken care of.

So instead of a long pipeline of tasks to setup optimized images for your site, the steps now are:
1. Install gatsby-image
2. Decide what size of image you need
3. Add your query and the gatsby-image component to your page
4. And…that’s it!

Now images are fun!
- Build a better web – qualities like speed, security, maintainability SEO, etc should be baked into the framework being used. If they are implemented on a per-site basis then it is a luxury. Gatsby bakes these qualities by default so that the right thing is the easy thing. The most high-impact way to make the web better is to make it high-quality by default.
It is More Than Just a Static Site Generator

Gatsby is not just for creating static sites. Gatsby is fully capable of generating a PWA with all the things we think that a modern web app can do, including auth, dynamic interactions, fetching data etc.

Gatsby does this by generating the static content using React DOM server-side APIs. Once this basic HTML is generated by Gatsby, React picks up where we left off. That basically means that Gatsby renders as much, upfront, as possible statically then client side React picks up and now we can do whatever a traditional React web app can do.

Best of Both Worlds

Generating statically generated HTML and then giving client-side React to do whatever it needs to do, using Gatsby gives us the best of both the worlds.

Statically rendered pages maximize SEO, provide a better TTI, general web performance, etc. Static sites have an easy global distribution and are easier to deploy

Conclusion

If the code runs successfully in the development mode (Gatsby develop) it doesn’t mean that there will be no issues with the build version. An easy solution is to build the code regularly and solve the issues. It is easy enough for where the build has to be generated after every change and the build time is a couple of minutes. But if there are frequent changes and the build gets created a few times a week or month, then it might be harder to do as multiple issues will have to be solved at the build time.

If you have a very big site with a lot of styled components and libraries then the build time increases substantially. If the build takes half an hour to build then it is no longer feasible to run the build after every change which makes finding the build issues regularly complicated.
December 12, 2022
Web Scraping: Introduction, Best Practices & Caveats
Web scraping is a process to crawl various websites and extract the required data using spiders. This data is processed in a data pipeline and stored in a structured format. Today, web scraping is widely used and has many use cases:
- Using web scraping, Marketing & Sales companies can fetch lead-related information.
- Web scraping is useful for Real Estate businesses to get the data of new projects, resale properties, etc.
- Price comparison portals, like Trivago, extensively use web scraping to get the information of product and price from various e-commerce sites.
The process of web scraping usually involves spiders, which fetch the HTML documents from relevant websites, extract the needed content based on the business logic, and finally store it in a specific format. This blog is a primer to build highly scalable scrappers. We will cover the following items:
1. Ways to scrape: We’ll see basic ways to scrape data using techniques and frameworks in Python with some code snippets.
2. Scraping at scale: Scraping a single page is straightforward, but there are challenges in scraping millions of websites, including managing the spider code, collecting data, and maintaining a data warehouse. We’ll explore such challenges and their solutions to make scraping easy and accurate.
3. Scraping Guidelines: Scraping data from websites without the owner’s permission can be deemed as malicious. Certain guidelines need to be followed to ensure our scrappers are not blacklisted. We’ll look at some of the best practices one should follow for crawling.
So let’s start scraping.

Different Techniques for Scraping

Here, we will discuss how to scrape a page and the different libraries available in Python.

Note: Python is the most popular language for scraping.

1. Requests – HTTP Library in Python: To scrape the website or a page, first find out the content of the HTML page in an HTTP response object. The requests library from Python is pretty handy and easy to use. It uses urllib inside. I like ‘requests’ as it’s easy and the code becomes readable too.
```
#Example showing how to use the requests library
import requests
r = requests.get("https://velotio.com") #Fetch HTML Page
```
2. BeautifulSoup: Once you get the webpage, the next step is to extract the data. BeautifulSoup is a powerful Python library that helps you extract the data from the page. It’s easy to use and has a wide range of APIs that’ll help you extract the data. We use the requests library to fetch an HTML page and then use the BeautifulSoup to parse that page. In this example, we can easily fetch the page title and all links on the page. Check out the documentation for all the possible ways in which we can use BeautifulSoup.
```
from bs4 import BeautifulSoup
import requests
r = requests.get("https://velotio.com") #Fetch HTML Page
soup = BeautifulSoup(r.text, "html.parser") #Parse HTML Page
print "Webpage Title:" + soup.title.string
print "Fetch All Links:" soup.find_all('a')
```
3. Python Scrapy Framework:

Scrapy is a Python-based web scraping framework that allows you to create different kinds of spiders to fetch the source code of the target website. Scrapy starts crawling the web pages present on a certain website, and then you can write the extraction logic to get the required data. Scrapy is built on the top of Twisted, a Python-based asynchronous library that performs the requests in an async fashion to boost up the spider performance. Scrapy is faster than BeautifulSoup. Moreover, it is a framework to write scrapers as opposed to BeautifulSoup, which is just a library to parse HTML pages.

Here is a simple example of how to use Scrapy. Install Scrapy via pip. Scrapy gives a shell after parsing a website:
```
$ pip install scrapy #Install Scrapy"
$ scrapy shell https://velotio.com
In [1]: response.xpath("//a").extract() #Fetch all a hrefs
```
Now, let’s write a custom spider to parse a website.
```
$cat > myspider.py <import scrapy

class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']

def parse(self, response):
for title in response.css('h2.entry-title'):
yield {'title': title.css('a ::text').extract_first()}
EOF
scrapy runspider myspider.py
```
That’s it. Your first custom spider is created. Now. let’s understand the code.
- name: Name of the spider. In this case, it’s “blogspider”.
- start_urls: A list of URLs where the spider will begin to crawl from.
- parse(self, response): This function is called whenever the crawler successfully crawls a URL. The response object used earlier in the Scrapy shell is the same response object that is passed to the parse(..).
When you run this, Scrapy will look for start URL and will give you all the divs of the h2.entry-title class and extract the associated text from it. Alternatively, you can write your extraction logic in a parse method or create a separate class for extraction and call its object from the parse method.

You’ve seen how to extract simple items from a website using Scrapy, but this is just the surface. Scrapy provides a lot of powerful features for making scraping easy and efficient. Here is a tutorial for Scrapy and the additional documentation for LinkExtractor by which you can instruct Scrapy to extract links from a web page.

4. Python lxml.html library: This is another library from Python just like BeautifulSoup. Scrapy internally uses lxml. It comes with a list of APIs you can use for data extraction. Why will you use this when Scrapy itself can extract the data? Let’s say you want to iterate over the ‘div’ tag and perform some operation on each tag present under “div”, then you can use this library which will give you a list of ‘div’ tags. Now you can simply iterate over them using the iter() function and traverse each child tag inside the parent div tag. Such traversing operations are difficult in scraping. Here is the documentation for this library.

Challenges while Scraping at Scale

Let’s look at the challenges and solutions while scraping at large scale, i.e., scraping 100-200 websites regularly:

1. Data warehousing: Data extraction at a large scale generates vast volumes of information. Fault-tolerant, scalability, security, and high availability are the must-have features for a data warehouse. If your data warehouse is not stable or accessible then operations, like search and filter over data would be an overhead. To achieve this, instead of maintaining own database or infrastructure, you can use Amazon Web Services (AWS). You can use RDS (Relational Database Service) for a structured database and DynamoDB for the non-relational database. AWS takes care of the backup of data. It automatically takes a snapshot of the database. It gives you database error logs as well. This blog explains how to set up infrastructure in the cloud for scraping.

2. Pattern Changes: Scraping heavily relies on user interface and its structure, i.e., CSS and Xpath. Now, if the target website gets some adjustments then our scraper may crash completely or it can give random data that we don’t want. This is a common scenario and that’s why it’s more difficult to maintain scrapers than writing it. To handle this case, we can write the test cases for the extraction logic and run them daily, either manually or from CI tools, like Jenkins to track if the target website has changed or not.

3. Anti-scraping Technologies: Web scraping is a common thing these days, and every website host would want to prevent their data from being scraped. Anti-scraping technologies would help them in this. For example, if you are hitting a particular website from the same IP address on a regular interval then the target website can block your IP. Adding a captcha on a website also helps. There are methods by which we can bypass these anti-scraping methods. For e.g., we can use proxy servers to hide our original IP. There are several proxy services that keep on rotating the IP before each request. Also, it is easy to add support for proxy servers in the code, and in Python, the Scrapy framework does support it.

4. JavaScript-based dynamic content: Websites that heavily rely on JavaScript and Ajax to render dynamic content, makes data extraction difficult. Now, Scrapy and related frameworks/libraries will only work or extract what it finds in the HTML document. Ajax calls or JavaScript are executed at runtime so it can’t scrape that. This can be handled by rendering the web page in a headless browser such as Headless Chrome, which essentially allows running Chrome in a server environment. You can also use PhantomJS, which provides a headless Webkit-based environment.

5. Honeypot traps: Some websites have honeypot traps on the webpages for the detection of web crawlers. They are hard to detect as most of the links are blended with background color or the display property of CSS is set to none. To achieve this requires large coding efforts on both the server and the crawler side, hence this method is not frequently used.

6. Quality of data: Currently, AI and ML projects are in high demand and these projects need data at large scale. Data integrity is also important as one fault can cause serious problems in AI/ML algorithms. So, in scraping, it is very important to not just scrape the data, but verify its integrity as well. Now doing this in real-time is not possible always, so I would prefer to write test cases of the extraction logic to make sure whatever your spiders are extracting is correct and they are not scraping any bad data

7. More Data, More Time: This one is obvious. The larger a website is, the more data it contains, the longer it takes to scrape that site. This may be fine if your purpose for scanning the site isn’t time-sensitive, but that isn’t often the case. Stock prices don’t stay the same over hours. Sales listings, currency exchange rates, media trends, and market prices are just a few examples of time-sensitive data. What to do in this case then? Well, one solution could be to design your spiders carefully. If you’re using Scrapy like framework then apply proper LinkExtractor rules so that spider will not waste time on scraping unrelated URLs.

You may use multithreading scraping packages available in Python, such as Frontera and Scrapy Redis. Frontera lets you send out only one request per domain at a time, but can hit multiple domains at once, making it great for parallel scraping. Scrapy Redis lets you send out multiple requests to one domain. The right combination of these can result in a very powerful web spider that can handle both the bulk and variation for large websites.

8. Captchas: Captchas is a good way of keeping crawlers away from a website and it is used by many website hosts. So, in order to scrape the data from such websites, we need a mechanism to solve the captchas. There are packages, software that can solve the captcha and can act as a middleware between the target website and your spider. Also, you may use libraries like Pillow and Tesseract in Python to solve the simple image-based captchas.

9. Maintaining Deployment: Normally, we don’t want to limit ourselves to scrape just a few websites. We need the maximum amount of data that are present on the Internet and that may introduce scraping of millions of websites. Now, you can imagine the size of the code and the deployment. We can’t run spiders at this scale from a single machine. What I prefer here is to dockerize the scrapers and take advantage of the latest technologies, like AWS ECS, Kubernetes to run our scraper containers. This helps us keeping our scrapers in high availability state and it’s easy to maintain. Also, we can schedule the scrapers to run at regular intervals.

Scraping Guidelines/ Best Practices

1. Respect the robots.txt file: Robots.txt is a text file that webmasters create to instruct search engine robots on how to crawl and index pages on the website. This file generally contains instructions for crawlers. Now, before even planning the extraction logic, you should first check this file. Usually, you can find this at the website admin section. This file has all the rules set on how crawlers should interact with the website. For e.g., if a website has a link to download critical information then they probably don’t want to expose that to crawlers. Another important factor is the frequency interval for crawling, which means that crawlers can only hit the website at specified intervals. If someone has asked not to crawl their website then we better not do it. Because if they catch your crawlers, it can lead to some serious legal issues.

2. Do not hit the servers too frequently: As I mentioned above, some websites will have the frequency interval specified for crawlers. We better use it wisely because not every website is tested against the high load. If you are hitting at a constant interval then it creates huge traffic on the server-side, and it may crash or fail to serve other requests. This creates a high impact on user experience as they are more important than the bots. So, we should make the requests according to the specified interval in robots.txt or use a standard delay of 10 seconds. This also helps you not to get blocked by the target website.

3. User Agent Rotation and Spoofing: Every request consists of a User-Agent string in the header. This string helps to identify the browser you are using, its version, and the platform. If we use the same User-Agent in every request then it’s easy for the target website to check that request is coming from a crawler. So, to make sure we do not face this, try to rotate the User and the Agent between the requests. You can get examples of genuine User-Agent strings on the Internet very easily, try them out. If you’re using Scrapy, you can set USER_AGENT property in settings.py.

4. Disguise your requests by rotating IPs and Proxy Services: We’ve discussed this in the challenges above. It’s always better to use rotating IPs and proxy service so that your spider won’t get blocked.

5. Do not follow the same crawling pattern: Now, as you know many websites use anti-scraping technologies, so it’s easy for them to detect your spider if it’s crawling in the same pattern. Normally, we, as a human, would not follow a pattern on a particular website. So, to have your spiders run smoothly, we can introduce actions like mouse movements, clicking a random link, etc, which gives the impression of your spider as a human.

6. Scrape during off-peak hours: Off-peak hours are suitable for bots/crawlers as the traffic on the website is considerably less. These hours can be identified by the geolocation from where the site’s traffic originates. This also helps to improve the crawling rate and avoid the extra load from spider requests. Thus, it is advisable to schedule the crawlers to run in the off-peak hours.

7. Use the scraped data responsibly: We should always take the responsibility of the scraped data. It is not acceptable if someone is scraping the data and then republish it somewhere else. This can be considered as breaking the copyright laws and may lead to legal issues. So, it is advisable to check the target website’s Terms of Service page before scraping.

8. Use Canonical URLs: When we scrape, we tend to scrape duplicate URLs, and hence the duplicate data, which is the last thing we want to do. It may happen in a single website where we get multiple URLs having the same data. In this situation, duplicate URLs will have a canonical URL, which points to the parent or the original URL. By this, we make sure, we don’t scrape duplicate contents. In frameworks like Scrapy, duplicate URLs are handled by default.

9. Be transparent: Don’t misrepresent your purpose or use deceptive methods to gain access. If you have a login and a password that identifies you to gain access to a source, use it. Don’t hide who you are. If possible, share your credentials.

Conclusion

We’ve seen the basics of scraping, frameworks, how to crawl, and the best practices of scraping. To conclude:
- Follow target URLs rules while scraping. Don’t make them block your spider.
- Maintenance of data and spiders at scale is difficult. Use Docker/ Kubernetes and public cloud providers, like AWS to easily scale your web-scraping backend.
- Always respect the rules of the websites you plan to crawl. If APIs are available, always use them first.
December 12, 2022

Building Dynamic Forms in React Using Formik

Every day we see a huge number of web applications allowing us customizations. It involves drag & drop or metadata-driven UI interfaces to support multiple layouts while having a single backend. Feedback taking system is one of the simplest examples of such products, where on the admin side, one can manage the layout and on the consumer side, users are shown that layout to capture the data. This post focuses on building a microframework to support such use cases with the help of React and Formik.

Building big forms in React can be extremely time consuming and tedious when structural changes are requested. Handling their validations also takes too much time in the development life cycle. If we use Redux-based solutions to simplify this, like Redux-form, we see a lot of performance bottlenecks. So here comes Formik!

Why Formik?

“Why” is one of the most important questions while solving any problem. There are quite a few reasons to lean towards Formik for the implementation of such systems, such as:

Simplicity
Advanced validation support with Yup
Good community support with a lot of people helping on Github

Being said that, it’s one of the easiest frameworks for quick form building activities. Formik’s clean API lets us use it without worrying about a lot of state management.

Yup is probably the best library out there for validation and Formik provides out of the box support for Yup validations which makes it more programmer-friendly!!‍

API Responses:

We need to follow certain API structures to let our React code understand which component to render where.

Let’s assume we will be getting responses from the backend API in the following fashion.

[{
   “type” : “text”,
   “field”: “name”
   “name” : “User’s name”,
   “style” : {
         “width” : “50%”
    }
}]

[{
   “type” : “text”,
   “field”: “name”
   “name” : “User’s name”,
   “style” : {
         “width” : “50%”
    }
}]

We can have any number of fields but each one will have two mandatory unique properties type and field. We will use those properties to build UI as well as response.

So let’s start with building the simplest form with React and Formik.

import React from 'react';
import { useFormik } from 'formik';

const SignupForm = () => {
  const formik = useFormik({
    initialValues: {
      email: '',
    },
    onSubmit: values => {
      alert(JSON.stringify(values, null, 2));
    },
  });
  return (
    <form onSubmit={formik.handleSubmit}>
      <label htmlFor="email">Email Address</label>
      <input
        id="email"
        name="email"
        type="email"
        onChange={formik.handleChange}
        value={formik.values.email}
      />
      <button type="submit">Submit</button>
    </form>
  );
};

export default SignupForm;

import React from 'react';
import { useFormik } from 'formik';

const SignupForm = () => {
  const formik = useFormik({
    initialValues: {
      email: '',
    },
    onSubmit: values => {
      alert(JSON.stringify(values, null, 2));
    },
  });
  return (
    <form onSubmit={formik.handleSubmit}>
      <label htmlFor="email">Email Address</label>
      <input
        id="email"
        name="email"
        type="email"
        onChange={formik.handleChange}
        value={formik.values.email}
      />
      <button type="submit">Submit</button>
    </form>
  );
};

export default SignupForm;

import React from 'react';

export default ({ name }) => <h1>Hello {name}!</h1>;

import React from 'react';

export default ({ name }) => <h1>Hello {name}!</h1>;

<div id="root"></div>

<div id="root"></div>

import React, { Component } from 'react';
import { render } from 'react-dom';
import Basic from './Basic';
import './style.css';

class App extends Component {
  constructor() {
    super();
    this.state = {
      name: 'React'
    };
  }

  render() {
    return (
      <div>
        <Basic />
      </div>
    );
  }
}

render(<App />, document.getElementById('root'));

import React, { Component } from 'react';
import { render } from 'react-dom';
import Basic from './Basic';
import './style.css';

class App extends Component {
  constructor() {
    super();
    this.state = {
      name: 'React'
    };
  }

  render() {
    return (
      <div>
        <Basic />
      </div>
    );
  }
}

render(<App />, document.getElementById('root'));

{
  "name": "react",
  "version": "0.0.0",
  "private": true,
  "dependencies": {
    "react": "^16.12.0",
    "react-dom": "^16.12.0",
    "formik": "latest"
  },
  "scripts": {
    "start": "react-scripts start",
    "build": "react-scripts build",
    "test": "react-scripts test --env=jsdom",
    "eject": "react-scripts eject"
  },
  "devDependencies": {
    "react-scripts": "latest"
  }
}

{
  "name": "react",
  "version": "0.0.0",
  "private": true,
  "dependencies": {
    "react": "^16.12.0",
    "react-dom": "^16.12.0",
    "formik": "latest"
  },
  "scripts": {
    "start": "react-scripts start",
    "build": "react-scripts build",
    "test": "react-scripts test --env=jsdom",
    "eject": "react-scripts eject"
  },
  "devDependencies": {
    "react-scripts": "latest"
  }
}

h1, p {
  font-family: Lato;
}

h1, p {
  font-family: Lato;
}

You can view the fiddle of above code here to see the live demo.

We will go with the latest functional components to build this form. You can find more information on useFormik hook at useFormik Hook documentation.

It’s nothing more than just a wrapper for Formik functionality.

Adding dynamic nature

So let’s first create and import the mocked API response to build the UI dynamically.

import React from 'react';
import { useFormik } from 'formik';
import response from "./apiresponse"

const SignupForm = () => {
  const formik = useFormik({
    initialValues: {
      email: '',
    },
    onSubmit: values => {
      alert(JSON.stringify(values, null, 2));
    },
  });
  return (
    <form onSubmit={formik.handleSubmit}>
      <label htmlFor="email">Email Address</label>
      <input
        id="email"
        name="email"
        type="email"
        onChange={formik.handleChange}
        value={formik.values.email}
      />
      <button type="submit">Submit</button>
    </form>
  );
};

export default SignupForm;

import React from 'react';
import { useFormik } from 'formik';
import response from "./apiresponse"

const SignupForm = () => {
  const formik = useFormik({
    initialValues: {
      email: '',
    },
    onSubmit: values => {
      alert(JSON.stringify(values, null, 2));
    },
  });
  return (
    <form onSubmit={formik.handleSubmit}>
      <label htmlFor="email">Email Address</label>
      <input
        id="email"
        name="email"
        type="email"
        onChange={formik.handleChange}
        value={formik.values.email}
      />
      <button type="submit">Submit</button>
    </form>
  );
};

export default SignupForm;

You can view the fiddle here.

We simply imported the file and made it available for processing. So now, we need to write the logic to build components dynamically.
So let’s visualize the DOM hierarchy of components possible:

<Container>
	<TextField />
	<NumberField />
	<Container />
		<TextField />
		<BooleanField />
	</Container >
</Container>

<Container>
	<TextField />
	<NumberField />
	<Container />
		<TextField />
		<BooleanField />
	</Container >
</Container>

We can have a recurring container within the container, so let’s address this by adding a children attribute in API response.

export default [
  {
    "type": "text",
    "field": "name",
    "label": "User's name"
  },
  {
    "type": "number",
    "field": "number",
    "label": "User's age",
  },
  {
    "type": "none",
    "field": "none",
    "children": [
      {
        "type": "text",
        "field": "user.hobbies",
        "label": "User's hobbies"
      }
    ]
  }
]

export default [
  {
    "type": "text",
    "field": "name",
    "label": "User's name"
  },
  {
    "type": "number",
    "field": "number",
    "label": "User's age",
  },
  {
    "type": "none",
    "field": "none",
    "children": [
      {
        "type": "text",
        "field": "user.hobbies",
        "label": "User's hobbies"
      }
    ]
  }
]

You can see the fiddle with response processing here with live demo.

To process the recursive nature, we will create a separate component.

import React, { useMemo } from 'react';

const RecursiveContainer = ({config, formik}) => {
  const builder = (individualConfig) => {
    switch (individualConfig.type) {
      case 'text':
        return (
                <>
                <div>
                  <label htmlFor={individualConfig.field}>{individualConfig.label}</label>
                  <input type='text' 
                    name={individualConfig.field} 
                    onChange={formik.handleChange} style={{...individualConfig.style}} />
                  </div>
                </>
              );
      case 'number':
        return (
          <>
            <div>
              <label htmlFor={individualConfig.field}>{individualConfig.label}</label>
                  <input type='number' 
                    name={individualConfig.field} 
                    onChange={formik.handleChange} style={{...individualConfig.style}} />
            </div>
          </>
        )
      case 'array':
        return (
          <RecursiveContainer config={individualConfig.children || []} formik={formik} />
        );
      default:
        return <div>Unsupported field</div>
    }
  }

  return (
    <>
      {config.map((c) => {
        return builder(c);
      })}
    </>
  );
};

export default RecursiveContainer;

import React, { useMemo } from 'react';

const RecursiveContainer = ({config, formik}) => {
  const builder = (individualConfig) => {
    switch (individualConfig.type) {
      case 'text':
        return (
                <>
                <div>
                  <label htmlFor={individualConfig.field}>{individualConfig.label}</label>
                  <input type='text' 
                    name={individualConfig.field} 
                    onChange={formik.handleChange} style={{...individualConfig.style}} />
                  </div>
                </>
              );
      case 'number':
        return (
          <>
            <div>
              <label htmlFor={individualConfig.field}>{individualConfig.label}</label>
                  <input type='number' 
                    name={individualConfig.field} 
                    onChange={formik.handleChange} style={{...individualConfig.style}} />
            </div>
          </>
        )
      case 'array':
        return (
          <RecursiveContainer config={individualConfig.children || []} formik={formik} />
        );
      default:
        return <div>Unsupported field</div>
    }
  }

  return (
    <>
      {config.map((c) => {
        return builder(c);
      })}
    </>
  );
};

export default RecursiveContainer;

You can view the complete fiddle of the recursive component here.

So what we do in this is pretty simple. We pass config which is a JSON object that is retrieved from the API response. We simply iterate through config and build the component based on type. When the type is an array, we create the same component RecursiveContainer which is basic recursion.

We can optimize it by passing the depth and restricting to nth possible depth to avoid going out of stack errors at runtime. Specifying the depth will ultimately make it less prone to runtime errors. There is no standard limit, it varies from use case to use case. If you are planning to build a system that is based on a compliance questionnaire, it can go to a max depth of 5 to 7, while for the basic signup form, it’s often seen to be only 2.

So we generated the forms but how do we validate them? How do we enforce required, min, max checks on the form?

For this, Yup is very helpful. Yup is an object schema validation library that helps us validate the object and give us results back. Its chaining like syntax makes it very much easier to build incremental validation functions.

Yup provides us with a vast variety of existing validations. We can combine them, specify error or warning messages to be thrown and much more.

You can find more information on Yup at Yup Official Documentation‍

To build a validation function, we need to pass a Yup schema to Formik.

Here is a simple example:

import React from 'react';
import { useFormik } from 'formik';
import response from "./apiresponse"
import RecursiveContainer from './RecursiveContainer';
import * as yup from 'yup';

const SignupForm = () => {
  const signupSchema = yup.object().shape({
      name: yup.string().required()
  });

  const formik = useFormik({
    initialValues: {
    },
    onSubmit: values => {
      alert(JSON.stringify(values, null, 2));
    },
    validationSchema: signupSchema
  });
  console.log(formik, response)
  return (
    <form onSubmit={formik.handleSubmit}>
      <RecursiveContainer config={response} formik={formik} />
      <button type="submit">Submit</button>
    </form>
  );
};

export default SignupForm;

import React from 'react';
import { useFormik } from 'formik';
import response from "./apiresponse"
import RecursiveContainer from './RecursiveContainer';
import * as yup from 'yup';

const SignupForm = () => {
  const signupSchema = yup.object().shape({
      name: yup.string().required()
  });

  const formik = useFormik({
    initialValues: {
    },
    onSubmit: values => {
      alert(JSON.stringify(values, null, 2));
    },
    validationSchema: signupSchema
  });
  console.log(formik, response)
  return (
    <form onSubmit={formik.handleSubmit}>
      <RecursiveContainer config={response} formik={formik} />
      <button type="submit">Submit</button>
    </form>
  );
};

export default SignupForm;

You can see the schema usage example here.

In this example, we simply created a schema and passed it to useFormik hook. You can notice now unless and until the user enters the name field, the form submission is not working.

Here is a simple hack to make the button disabled until all necessary fields are filled.

import React from 'react';
import { useFormik } from 'formik';
import response from "./apiresponse"
import RecursiveContainer from './RecursiveContainer';
import * as yup from 'yup';

const SignupForm = () => {
  const signupSchema = yup.object().shape({
      name: yup.string().required()
  });

  const formik = useFormik({
    initialValues: {
    },
    onSubmit: values => {
      alert(JSON.stringify(values, null, 2));
    },
    validationSchema: signupSchema
  });
  console.log(formik, response)
  return (
    <form onSubmit={formik.handleSubmit}>
      <RecursiveContainer config={response} formik={formik} />
      <button type="submit" disabled={!formik.isValid}>Submit</button>
    </form>
  );
};

export default SignupForm;

import React from 'react';
import { useFormik } from 'formik';
import response from "./apiresponse"
import RecursiveContainer from './RecursiveContainer';
import * as yup from 'yup';

const SignupForm = () => {
  const signupSchema = yup.object().shape({
      name: yup.string().required()
  });

  const formik = useFormik({
    initialValues: {
    },
    onSubmit: values => {
      alert(JSON.stringify(values, null, 2));
    },
    validationSchema: signupSchema
  });
  console.log(formik, response)
  return (
    <form onSubmit={formik.handleSubmit}>
      <RecursiveContainer config={response} formik={formik} />
      <button type="submit" disabled={!formik.isValid}>Submit</button>
    </form>
  );
};

export default SignupForm;

You can see how to use submit validation with live fiddle here‍

We do get a vast variety of output from Formik while the form is being rendered and we can use them the way it suits us. You can find the full API of Formik at Formik Official Documentation

So existing validations are fine but we often get into cases where we would like to build our own validations. How do we write them and integrate them with Yup validations?

For this, there are 2 different ways with Formik + Yup. Either we can extend the Yup to support the additional validation or pass validation function to the Formik. The validation function approach is much simpler. You just need to write a function that gives back an error object to Formik. As simple as it sounds, it does get messy at times.

So we will see an example of adding custom validation to Yup. Yup provides us an addMethod interface to add our own user-defined validations in the application.

Let’s say we want to create an alias for existing validation for supporting casing because that’s the most common mistake we see. Url becomes url, trim is coming from the backend as Trim. These method names are case sensitive so if we say Yup.Url, it will fail. But with Yup.url, we get a function. These are just some examples, but you can also alias them with some other names like I can have an alias required to be as readable as NotEmpty.

The usage is very simple and straightforward as follows:

yup.addMethod(yup.string, “URL”, function(...args) {
return this.url(...args);
});

yup.addMethod(yup.string, “URL”, function(...args) {
return this.url(...args);
});

This will create an alias for url as URL.

Here is an example of custom method validation which takes Y and N as boolean values.

const validator = function (message) {
    return this.test('is-string-boolean', message, function (value) {
      if (isEmpty(value)) {
        return true;
      }

      if (['Y', 'N'].indexOf(value) !== -1) {
        return true;
      } else {
        return false;
      }
    });
  };

const validator = function (message) {
    return this.test('is-string-boolean', message, function (value) {
      if (isEmpty(value)) {
        return true;
      }

      if (['Y', 'N'].indexOf(value) !== -1) {
        return true;
      } else {
        return false;
      }
    });
  };

With the above, we will be able to execute yup.string().stringBoolean() and yup.string().StringBoolean().

It’s a pretty handy syntax that lets users create their own validations. You can create many more validations in your project to be used with Yup and reuse them wherever required.

So writing schema is also a cumbersome task and is useless if the form is dynamic. When the form is dynamic then validations also need to be dynamic. Yup’s chaining-like syntax lets us achieve it very easily.

We will consider that the backend sends us additional following things with metadata.

[{
   “type” : “text”,
   “field”: “name”
   “name” : “User’s name”,
   “style” : {
         “width” : “50%”
    },
   “validationType”: “string”,
   “validations”: [{
          type: “required”,
          params: [“Name is required”]
    }]
}]

[{
   “type” : “text”,
   “field”: “name”
   “name” : “User’s name”,
   “style” : {
         “width” : “50%”
    },
   “validationType”: “string”,
   “validations”: [{
          type: “required”,
          params: [“Name is required”]
    }]
}]

validationType will hold the Yup’s data types like string, number, date, etc and validations will hold the validations that need to be applied to that field.

So let’s have a look at the following snippet which utilizes the above structure and generates dynamic validation.

import * as yup from 'yup';

/** Adding just additional methods here */

yup.addMethod(yup.string, "URL", function(...args) {
    return this.url(...args);
});


const validator = function (message) {
    return this.test('is-string-boolean', message, function (value) {
      if (isEmpty(value)) {
        return true;
      }

      if (['Y', 'N'].indexOf(value) !== -1) {
        return true;
      } else {
        return false;
      }
    });
  };

yup.addMethod(yup.string, "stringBoolean", validator);
yup.addMethod(yup.string, "StringBoolean", validator);




export function createYupSchema(schema, config) {
  const { field, validationType, validations = [] } = config;
  if (!yup[validationType]) {
    return schema;
  }
  let validator = yup[validationType]();
  validations.forEach((validation) => {
    const { params, type } = validation;
    if (!validator[type]) {
      return;
    }
    validator = validator[type](...params);
  });
  if (field.indexOf('.') !== -1) {
    // nested fields are not covered in this example but are eash to handle tough
  } else {
    schema[field] = validator;
  }

  return schema;
}

export const getYupSchemaFromMetaData = (
  metadata,
  additionalValidations,
  forceRemove
) => {
  const yepSchema = metadata.reduce(createYupSchema, {});
  const mergedSchema = {
    ...yepSchema,
    ...additionalValidations,
  };

  forceRemove.forEach((field) => {
    delete mergedSchema[field];
  });

  const validateSchema = yup.object().shape(mergedSchema);

  return validateSchema;
};

import * as yup from 'yup';

/** Adding just additional methods here */

yup.addMethod(yup.string, "URL", function(...args) {
    return this.url(...args);
});


const validator = function (message) {
    return this.test('is-string-boolean', message, function (value) {
      if (isEmpty(value)) {
        return true;
      }

      if (['Y', 'N'].indexOf(value) !== -1) {
        return true;
      } else {
        return false;
      }
    });
  };

yup.addMethod(yup.string, "stringBoolean", validator);
yup.addMethod(yup.string, "StringBoolean", validator);




export function createYupSchema(schema, config) {
  const { field, validationType, validations = [] } = config;
  if (!yup[validationType]) {
    return schema;
  }
  let validator = yup[validationType]();
  validations.forEach((validation) => {
    const { params, type } = validation;
    if (!validator[type]) {
      return;
    }
    validator = validator[type](...params);
  });
  if (field.indexOf('.') !== -1) {
    // nested fields are not covered in this example but are eash to handle tough
  } else {
    schema[field] = validator;
  }

  return schema;
}

export const getYupSchemaFromMetaData = (
  metadata,
  additionalValidations,
  forceRemove
) => {
  const yepSchema = metadata.reduce(createYupSchema, {});
  const mergedSchema = {
    ...yepSchema,
    ...additionalValidations,
  };

  forceRemove.forEach((field) => {
    delete mergedSchema[field];
  });

  const validateSchema = yup.object().shape(mergedSchema);

  return validateSchema;
};

You can see the complete live fiddle with dynamic validations with formik here.

Here we have added the above code snippets to show how easily we can add a new method to Yup. Along with it, there are two functions createYupSchema and getYupSchemaFromMetaData which drive the whole logic for building dynamic schema. We are passing the validations in response and building the validation from it.

createYupSchema simply builds Yup validation based on the validation array and validationType. getYupSchemaFromMetaData basically iterates over the response array and builds Yup validation for each field and at the end, it wraps it in the Object schema. In this way, we can generate dynamic validations. One can even go further and create nested validations with recursion.‍

Conclusion

It’s often seen that adding just another field is time-consuming in the traditional approach of writing the large boilerplate for forms, while with this approach, it eliminates the need for hardcoding the fields and allows them to be backend-driven.

Formik provides very optimized state management which reduces performance issues that we generally see when Redux is used and updated quite frequently.

As we see above, it’s very easy to build dynamic forms with Formik. We can save the templates and even create template libraries that are very common with question and answer systems. If utilized correctly, we can simply have the templates saved in some NoSQL databases, like MongoDB and can generate a vast number of forms quickly with ease along with validations.

To learn more and build optimized solutions you can also refer to <fastfield> and <field> APIs at their </field></fastfield>official documentation. Thanks for reading!

December 12, 2022

Using DRF Effectively to Build Cleaner and Faster APIs in Django

Django REST Framework (DRF) is a popular library choice when it comes to creating REST APIs with Django. With minimal effort and time, you can start creating APIs that support authentication, authorization, pagination, sorting, etc. Once we start creating production-level APIs, we must do a lot of customization that are highly supported by DRF.

In this blog post, I will share some of the features that I have used extensively while working with DRF. We will be covering the following use cases:

Using serializer context to pass data from view to serializer
Handling reverse relationships in serializers
Solving slow queries by eliminating the N+1 query problem
Custom Response Format
SerializerMethodField to add read-only derived data to the response
Using Mixin to enable/disable pagination with Query Param

This will help you to write cleaner code and improve API performance.

Prerequisite:

To understand the things discussed in the blog, the reader should have some prior experience of creating REST APIs using DRF. We will not be covering the basic concepts like serializers, API view/viewsets, generic views, permissions, etc. If you need help in building the basics, here is the list of resources from official documentation.

Let’s explore Django REST Framework’s (DRF) lesser-known but useful features:

1. Using Serializer Context to Pass Data from View to Serializer

Let us consider a case when we need to write some complex validation logic in the serializer.

The validation method takes two parameters. One is the self or the serializer object, and the other is the field value received in the request payload. Our validation logic may sometimes need some extra information that must be taken from the database or derived from the view calling the serializer.

Next is the role of the serializer’s context data. The serializer takes the context parameter in the form of a python dictionary, and this data is available throughout the serializer methods. The context data can be accessed using self.context in serializer validation methods or any other serializer method.

Passing custom context data to the serializer

To pass the context to the serializer, create a dictionary with the data and pass it in the context parameter when initializing the serializer.

context_data = {"valid_domains": ValidDomain.objects.all()}
serializer = MySerializer(data=request.data, context=context_data)

context_data = {"valid_domains": ValidDomain.objects.all()}
serializer = MySerializer(data=request.data, context=context_data)

In case of generic view and viewsets, the serializer initialization is handled by the framework and passed the following as default context.

{
   'request': self.request,
   'format': self.format_kwarg,
   'view': self
}

{
   'request': self.request,
   'format': self.format_kwarg,
   'view': self
}

Thanks to DRF, we can cleanly and easily customize the context data.

# override the get_serializer_context method in the generic viewset
class UserCreateListAPIView(generice.ListCreateAPIView):
    def get_serializer_context(self):
        context = super().get_serializer_context()
        # Update context data to add new data
   	  context.update({"valid_domains": ValidDomain.objects.all()})
   	  return context

# override the get_serializer_context method in the generic viewset
class UserCreateListAPIView(generice.ListCreateAPIView):
    def get_serializer_context(self):
        context = super().get_serializer_context()
        # Update context data to add new data
   	  context.update({"valid_domains": ValidDomain.objects.all()})
   	  return context

# read the context data in the serializer validation method
class UserSerializer(serializer.Serializer):
    def validate_email(self, val):
        valid_domains = serf.context.get("valid_domains")
        # main validation logic goes here

# read the context data in the serializer validation method
class UserSerializer(serializer.Serializer):
    def validate_email(self, val):
        valid_domains = serf.context.get("valid_domains")
        # main validation logic goes here

2. Handling Reverse Relationships in Serializers

To better understand this, take the following example.

class User(models.Model):
   name = models.CharField(max_length=60)
   email = models.EmailField()


class Address(models.Model):
   detail = models.CharField(max_length=100)
   city = models.FloatField()
   user = models.ForeignKey(User, related_name="addresses", on_delete=models.CASCADE)

class User(models.Model):
   name = models.CharField(max_length=60)
   email = models.EmailField()


class Address(models.Model):
   detail = models.CharField(max_length=100)
   city = models.FloatField()
   user = models.ForeignKey(User, related_name="addresses", on_delete=models.CASCADE)

We have a User model, which contains data about the customer and Address that has the list of addresses added. We need to return the user details along with their address detail, as given below.

{
   "name": "Velotio",
   "email": "velotio@example.com",
   "addresses": [
   	{
       	"detail": "Akshya Nagar 1st Block 1st Cross, Rammurthy nagar",
       	"city": "Banglore"
   	},
   	{
       	"detail": "50 nd Floor, , Narayan Dhuru Street, Mandvi",
       	"city": "Mumbai"
   	},
   	{
       	"detail": "Ground Floor, 8/5, J K Bldg, H G Marg, Opp Gamdevi Temple, Grant Road",
       	"city": "Banglore"
   	}
   ]
}

{
   "name": "Velotio",
   "email": "velotio@example.com",
   "addresses": [
   	{
       	"detail": "Akshya Nagar 1st Block 1st Cross, Rammurthy nagar",
       	"city": "Banglore"
   	},
   	{
       	"detail": "50 nd Floor, , Narayan Dhuru Street, Mandvi",
       	"city": "Mumbai"
   	},
   	{
       	"detail": "Ground Floor, 8/5, J K Bldg, H G Marg, Opp Gamdevi Temple, Grant Road",
       	"city": "Banglore"
   	}
   ]
}

Forward model relationships are automatically included in the fields returned by the ModelSerializer.
The relationship between User and Address is a reverse relationship and needs to be explicitly added in the fields.
We have defined a related_name=addresses for the User Foreign Key in the Address; it can be used in the fields meta option.
If we don’t have the related_name, we can use address_set, which is the default related_name.

class UserSerializer(serializers.ModelSerializer):
      class Meta:
          model = User
          fields = ("name", "email", "addresses")

class UserSerializer(serializers.ModelSerializer):
      class Meta:
          model = User
          fields = ("name", "email", "addresses")

The above code will return the following response:

{
   "name": "Velotio",
   "email": "velotio@example.com",
   "addresses": [
       10,
       20,
       45
   ]
}

{
   "name": "Velotio",
   "email": "velotio@example.com",
   "addresses": [
       10,
       20,
       45
   ]
}

But this isn’t what we need. We want to return all the information about the address and not just the IDs. DRF gives us the ability to use a serializer as a field to another serializer.

The below code shows how to use the nested Serializer to return the address details.

class AddressSerializer(serializers.ModelSerializer):
   class Meta:
       model = Address
       fields = ("detail", "city") 

class UserSerializer(serializers.ModelSerializer):
   addresses = AddressSerializer(many=True, read_only=True)
   class Meta:
       model = User
       fields = ("name", "email", "addresses")

class AddressSerializer(serializers.ModelSerializer):
   class Meta:
       model = Address
       fields = ("detail", "city") 

class UserSerializer(serializers.ModelSerializer):
   addresses = AddressSerializer(many=True, read_only=True)
   class Meta:
       model = User
       fields = ("name", "email", "addresses")

The read_only=True parameter marks the field as a read-only field.
The addresses field will only be used in GET calls and will be ignored in write operations.
Nested Serializers can also be used in write operations, but DRF doesn’t handle the creation/deletion of nested serializers by default.

3. Solving Slow Queries by Eliminating the N+1 Query Problem

When using nested serializers, the API needs to run queries over multiple tables and a large number of records. This can often lead to slower APIs. A common and easy mistake to make while using serializer with relationships is the N+1 queries problem. Let’s first understand the problem and ways to solve it.

Identifying the N+1 Queries Problem

Let’s take the following API example and count the number of queries hitting the database on each API call.

class Author(models.Model):
   name = models.CharField(max_length=20)


class Book(models.Model):
   name = models.CharField(max_length=20)
   author = models.ForeignKey("Author", models.CASCADE, related_name="books")
   created_at = models.DateTimeField(auto_now_add=True)

class Author(models.Model):
   name = models.CharField(max_length=20)


class Book(models.Model):
   name = models.CharField(max_length=20)
   author = models.ForeignKey("Author", models.CASCADE, related_name="books")
   created_at = models.DateTimeField(auto_now_add=True)

class AuthorSerializer(serializers.ModelSerializer):
   class Meta:
   	model = Author
   	fields = "__all__"


class BookSerializer(serializers.ModelSerializer):
   author = AuthorSerializer()
   class Meta:
   	model = Book
   	fields = "__all__"

class AuthorSerializer(serializers.ModelSerializer):
   class Meta:
   	model = Author
   	fields = "__all__"


class BookSerializer(serializers.ModelSerializer):
   author = AuthorSerializer()
   class Meta:
   	model = Book
   	fields = "__all__"

class BookListCreateAPIView(generics.ListCreateAPIView):

	serializer_class = BookSerializer
	queryset = Book.objects.all()

class BookListCreateAPIView(generics.ListCreateAPIView):

	serializer_class = BookSerializer
	queryset = Book.objects.all()

urlpatterns = [
	path('admin/', admin.site.urls),
	path('hello-world/', HelloWorldAPI.as_view()),
	path('books/', BookListCreateAPIView.as_view(), name="book_list")
]

urlpatterns = [
	path('admin/', admin.site.urls),
	path('hello-world/', HelloWorldAPI.as_view()),
	path('books/', BookListCreateAPIView.as_view(), name="book_list")
]

We are creating a simple API to list the books along with the author’s details. Here is the output:

{
  "message": "",
  "errors": [],
  "data": [
    {
      "id": 1,
      "author": {
        "id": 3,
        "name": "Meet teacher."
      },
      "name": "Body society.",
      "created_at": "1973-08-03T02:43:22Z"
    },
    {
      "id": 2,
      "author": {
        "id": 49,
        "name": "Cause wait health."
      },
      "name": "Left next pretty.",
      "created_at": "2000-07-07T03:37:10Z"
    },
    {
      "id": 3,
      "author": {
        "id": 7,
        "name": "No figure those."
      },
      "name": "Reflect American.",
      "created_at": "1994-08-14T03:54:38Z"
    },
    {
      "id": 4,
      "author": {
        "id": 35,
        "name": "Garden order table."
      },
      "name": "Throw minute.",
      "created_at": "1993-12-30T20:50:56Z"
    },
    {
      "id": 5,
      "author": {
        "id": 49,
        "name": "Cause wait health."
      },
      "name": "Congress now build.",
      "created_at": "1977-07-21T17:35:42Z"
    },
    {
      "id": 6,
      "author": {
        "id": 39,
        "name": "Involve section."
      },
      "name": "Activity drop fight.",
      "created_at": "2011-04-21T23:09:54Z"
    },
    {
      "id": 7,
      "author": {
        "id": 44,
        "name": "Cost spring our."
      },
      "name": "Because pattern.",
      "created_at": "2010-01-04T08:21:29Z"
    },
    {
      "id": 8,
      "author": {
        "id": 45,
        "name": "Entire we certainly."
      },
      "name": "Program use feel.",
      "created_at": "1972-11-30T15:49:50Z"
    },
    {
      "id": 9,
      "author": {
        "id": 42,
        "name": "Interest drop."
      },
      "name": "Purpose live might.",
      "created_at": "1987-01-31T16:48:54Z"
    },
    {
      "id": 10,
      "author": {
        "id": 12,
        "name": "Sell data contain."
      },
      "name": "Everyone thing seem.",
      "created_at": "2007-10-19T07:16:34Z"
    }
  ],
  "status": "success"
}

{
  "message": "",
  "errors": [],
  "data": [
    {
      "id": 1,
      "author": {
        "id": 3,
        "name": "Meet teacher."
      },
      "name": "Body society.",
      "created_at": "1973-08-03T02:43:22Z"
    },
    {
      "id": 2,
      "author": {
        "id": 49,
        "name": "Cause wait health."
      },
      "name": "Left next pretty.",
      "created_at": "2000-07-07T03:37:10Z"
    },
    {
      "id": 3,
      "author": {
        "id": 7,
        "name": "No figure those."
      },
      "name": "Reflect American.",
      "created_at": "1994-08-14T03:54:38Z"
    },
    {
      "id": 4,
      "author": {
        "id": 35,
        "name": "Garden order table."
      },
      "name": "Throw minute.",
      "created_at": "1993-12-30T20:50:56Z"
    },
    {
      "id": 5,
      "author": {
        "id": 49,
        "name": "Cause wait health."
      },
      "name": "Congress now build.",
      "created_at": "1977-07-21T17:35:42Z"
    },
    {
      "id": 6,
      "author": {
        "id": 39,
        "name": "Involve section."
      },
      "name": "Activity drop fight.",
      "created_at": "2011-04-21T23:09:54Z"
    },
    {
      "id": 7,
      "author": {
        "id": 44,
        "name": "Cost spring our."
      },
      "name": "Because pattern.",
      "created_at": "2010-01-04T08:21:29Z"
    },
    {
      "id": 8,
      "author": {
        "id": 45,
        "name": "Entire we certainly."
      },
      "name": "Program use feel.",
      "created_at": "1972-11-30T15:49:50Z"
    },
    {
      "id": 9,
      "author": {
        "id": 42,
        "name": "Interest drop."
      },
      "name": "Purpose live might.",
      "created_at": "1987-01-31T16:48:54Z"
    },
    {
      "id": 10,
      "author": {
        "id": 12,
        "name": "Sell data contain."
      },
      "name": "Everyone thing seem.",
      "created_at": "2007-10-19T07:16:34Z"
    }
  ],
  "status": "success"
}

Ideally, we should be able to get data in 1 single SQL query. Now, let’s write a test case and see if our assumption is correct:

from django.urls import reverse
from django_seed import Seed

from core.models import Author, Book
from rest_framework.test import APITestCase

seeder = Seed.seeder()


class BooksTestCase(APITestCase):
    def test_list_books(self):
        # Add dummy data to the Author and Book Table
        seeder.add_entity(Author, 5)
        seeder.add_entity(Book, 10)
        seeder.execute()
        # we expect the result in 1 query
        with self.assertNumQueries(1):
            response = self.client.get(reverse("book_list"), format="json")

# test output
$ ./manage.py test
.
.
.
AssertionError: 11 != 1 : 11 queries executed, 1 expected
Captured queries were:
1. SELECT "core_book"."id", "core_book"."name", "core_book"."author_id", "core_book"."created_at" FROM "core_book"
2. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 4 LIMIT 21
3. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 1 LIMIT 21
4. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 4 LIMIT 21
5. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 4 LIMIT 21
6. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 5 LIMIT 21
7. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 5 LIMIT 21
8. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 1 LIMIT 21
9. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 3 LIMIT 21
10. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 3 LIMIT 21
11. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 5 LIMIT 21

----------------------------------------------------------------------
Ran 1 test in 0.027s

FAILED (failures=1)

from django.urls import reverse
from django_seed import Seed

from core.models import Author, Book
from rest_framework.test import APITestCase

seeder = Seed.seeder()


class BooksTestCase(APITestCase):
    def test_list_books(self):
        # Add dummy data to the Author and Book Table
        seeder.add_entity(Author, 5)
        seeder.add_entity(Book, 10)
        seeder.execute()
        # we expect the result in 1 query
        with self.assertNumQueries(1):
            response = self.client.get(reverse("book_list"), format="json")

# test output
$ ./manage.py test
.
.
.
AssertionError: 11 != 1 : 11 queries executed, 1 expected
Captured queries were:
1. SELECT "core_book"."id", "core_book"."name", "core_book"."author_id", "core_book"."created_at" FROM "core_book"
2. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 4 LIMIT 21
3. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 1 LIMIT 21
4. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 4 LIMIT 21
5. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 4 LIMIT 21
6. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 5 LIMIT 21
7. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 5 LIMIT 21
8. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 1 LIMIT 21
9. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 3 LIMIT 21
10. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 3 LIMIT 21
11. SELECT "core_author"."id", "core_author"."name" FROM "core_author" WHERE "core_author"."id" = 5 LIMIT 21

----------------------------------------------------------------------
Ran 1 test in 0.027s

FAILED (failures=1)

As we see, our test case has failed, and it shows that the number of queries running are 11 and not one. In our test case, we added 10 records in the Book model. The number of queries hitting the database is 1(to fetch books list) + the number of records in the Book model (to fetch author details for each book record). The test output shows the SQL queries executed.

The side effects of this can easily go unnoticed while working on a test database with a small number of records. But in production, when the data grows to thousands of records, this can seriously degrade the performance of the database and application.

Let’s Do It the Right Way

If we think this in terms of a raw SQL query, this can be achieved with a simple Inner Join operation between the Book and the Author table. We need to do something similar in our Django query.

Django provides selected_related and prefetch_related to handle query problems around related objects.

select_related works on forward ForeignKey, OneToOne, and backward OneToOne relationships by creating a database JOIN and fetching the related field data in one single query. ‍
prefetch_related works on forward ManyToMany and in reverse, ManyToMany, ForeignKey. prefetch_related does a different query for every relationship and plays out the “joining” in Python.

Let’s rewrite the above code using select_related and check the number of queries.

We only need to change the queryset in the view.

class BookListCreateAPIView(generics.ListCreateAPIView):

   serializer_class = BookSerializer

   def get_queryset(self):
       queryset = Book.objects.select_related("author").all()
       return queryset

class BookListCreateAPIView(generics.ListCreateAPIView):

   serializer_class = BookSerializer

   def get_queryset(self):
       queryset = Book.objects.select_related("author").all()
       return queryset

Now, we will rerun the test, and this time it should pass:

$ ./manage.py test	 
Creating test database for alias 'default'...
System check identified no issues (0 silenced).
.
----------------------------------------------------------------------
Ran 1 test in 0.024s

OK
Destroying test database for alias 'default'...

$ ./manage.py test	 
Creating test database for alias 'default'...
System check identified no issues (0 silenced).
.
----------------------------------------------------------------------
Ran 1 test in 0.024s

OK
Destroying test database for alias 'default'...

If you are interested in knowing the SQL query executed, here it is:

>> queryset = Book.objects.select_related("author").all()
>> print(queryset.query)

SELECT "core_book"."id",
       "core_book"."name",
       "core_book"."author_id",
       "core_book"."created_at",
       "core_author"."id",
       "core_author"."name"
FROM "core_book"
         INNER JOIN "core_author" ON ("core_book"."author_id" = "core_author"."id")

>> queryset = Book.objects.select_related("author").all()
>> print(queryset.query)

SELECT "core_book"."id",
       "core_book"."name",
       "core_book"."author_id",
       "core_book"."created_at",
       "core_author"."id",
       "core_author"."name"
FROM "core_book"
         INNER JOIN "core_author" ON ("core_book"."author_id" = "core_author"."id")

4. Custom Response Format

It’s a good practice to decide the API endpoints and their request/response payload before starting the actual implementation. If you are the developer, by writing the implementation for the API where the response format is already decided, you can not go with the default response returned by DRF.

Let’s assume that, below is the decided format for returning the response:

{
  "message": "",
  "errors": [],
  "data": [
    {
      "id": 1,
      "author": {
        "id": 3,
        "name": "Meet teacher."
      },
      "name": "Body society.",
      "created_at": "1973-08-03T02:43:22Z"
    },
    {
      "id": 2,
      "author": {
        "id": 49,
        "name": "Cause wait health."
      },
      "name": "Left next pretty.",
      "created_at": "2000-07-07T03:37:10Z"
    }
  ],
  "status": "success"
}

{
  "message": "",
  "errors": [],
  "data": [
    {
      "id": 1,
      "author": {
        "id": 3,
        "name": "Meet teacher."
      },
      "name": "Body society.",
      "created_at": "1973-08-03T02:43:22Z"
    },
    {
      "id": 2,
      "author": {
        "id": 49,
        "name": "Cause wait health."
      },
      "name": "Left next pretty.",
      "created_at": "2000-07-07T03:37:10Z"
    }
  ],
  "status": "success"
}

We can see that the response format has a message, errors, status, and data attributes. Next, we will see how to write a custom renderer to achieve the above response format. Since the format is in JSON , we override the rest_framework.renderers.JSONRenderer.

from rest_framework.renderers import JSONRenderer
from rest_framework.views import exception_handler


class CustomJSONRenderer(JSONRenderer):
   def render(self, data, accepted_media_type=None, renderer_context=None):
       # reformat the response
       response_data = {"message": "", "errors": [], "data": data, "status": "success"}
       # call super to render the response
       response = super(CustomJSONRenderer, self).render(
           response_data, accepted_media_type, renderer_context
       )

       return response

from rest_framework.renderers import JSONRenderer
from rest_framework.views import exception_handler


class CustomJSONRenderer(JSONRenderer):
   def render(self, data, accepted_media_type=None, renderer_context=None):
       # reformat the response
       response_data = {"message": "", "errors": [], "data": data, "status": "success"}
       # call super to render the response
       response = super(CustomJSONRenderer, self).render(
           response_data, accepted_media_type, renderer_context
       )

       return response

To use this new renderer, we need to add it to DRF settings:

REST_FRAMEWORK = {
   "DEFAULT_RENDERER_CLASSES": (
       "core.renderer.CustomJSONRenderer",
       "rest_framework.renderers.JSONRenderer",
       "rest_framework.renderers.BrowsableAPIRenderer",
   )
}

REST_FRAMEWORK = {
   "DEFAULT_RENDERER_CLASSES": (
       "core.renderer.CustomJSONRenderer",
       "rest_framework.renderers.JSONRenderer",
       "rest_framework.renderers.BrowsableAPIRenderer",
   )
}

5. Use the SerializerMethodField to add read-only derived data to the response

The SerializerMethodField can be used when we want to add some derived data to the object. Consider the same Book listing API. If we want to send an additional property display name—which is the book name in uppercase—we can use the serializer method field as below.

class BookSerializer(serializers.ModelSerializer):
   author = AuthorSerializer()
   book_display_name= serializers.SerializerMethodField(source="get_book_display_name")

   def get_book_display_name(self, book):
       return book.name.upper()

   class Meta:
       model = Book
       fields = "__all__"

class BookSerializer(serializers.ModelSerializer):
   author = AuthorSerializer()
   book_display_name= serializers.SerializerMethodField(source="get_book_display_name")

   def get_book_display_name(self, book):
       return book.name.upper()

   class Meta:
       model = Book
       fields = "__all__"

The SerializerMethodField takes the source parameter, where we can pass the method name that should be called.
The method gets self and the object as the argument.
By default, the DRF source parameter uses get_{field_name}, so in the example above, the source parameter can be omitted, and it will still give the same result.

book_display_name = serializers.SerializerMethodField()

book_display_name = serializers.SerializerMethodField()

6. Use Mixin to Enable/disable Pagination with Query Param

If you are developing APIs for an internal application and want to support APIs with pagination both enabled and disabled, you can make use of the Mixin below. This allows the caller to use the query parameter “pagination” to enable/disable pagination. This Mixin can be used with the generic views.

class DynamicPaginationMixin(object):
   """
   Controls pagination enable disable option using query param "pagination".
   If pagination=false is passed in query params, data is returned without pagination
   """
   def paginate_queryset(self, queryset):
   	pagination = self.request.query_params.get("pagination", "true")
    	if bool(pagination):
        	return None

   	return super().paginate_queryset(queryset)

class DynamicPaginationMixin(object):
   """
   Controls pagination enable disable option using query param "pagination".
   If pagination=false is passed in query params, data is returned without pagination
   """
   def paginate_queryset(self, queryset):
   	pagination = self.request.query_params.get("pagination", "true")
    	if bool(pagination):
        	return None

   	return super().paginate_queryset(queryset)

# Remember to use mixin before the generics
class BookListCreateAPIView(DynamicPaginationMixin, generics.ListCreateAPIView):

	serializer_class = BookSerializer

	def get_queryset(self):
    	    queryset = Book.objects.select_related("author").all()
    	    return queryset

# Remember to use mixin before the generics
class BookListCreateAPIView(DynamicPaginationMixin, generics.ListCreateAPIView):

	serializer_class = BookSerializer

	def get_queryset(self):
    	    queryset = Book.objects.select_related("author").all()
    	    return queryset

Conclusion

This was just a small selection of all the awesome features provided by Django and DRF, so keep exploring. I hope you learned something new today. If you are interested in learning more about serverless deployment of Django Applications, you can refer to our comprehensive guide to deploy serverless, event-driven Python applications using Zappa.

How to Use Pytest Fixtures With Django Models

With the test framework that Python and Django provide, there is a lot of code boilerplate, maintainability, and duplication issues rise as your projects grow. It’s also not a very pythonic way of writing tests.

Pytest provides a simple and more elegant way to write tests.

It provides the ability to write tests as functions, which means a lot of boilerplate code has been removed, making your code more readable and easy to maintain. Pytest also provides functionality in terms of test discovery—and defining and using fixtures.

Why Pytest Fixtures?

When writing tests, it’s very common that the test will need objects, and those objects may be needed by multiple tests. There might be a complicated process for the creation of these objects. It will be difficult to add that complex process in each of the test cases, and on any model changes, we will need to update our logic in all places. This will create issues of code duplication and its maintainability.

To avoid all of this, we can use the fixture provided by the pytest, where we will define the fixture in one place, and then we can inject that fixture in any of the tests in a much simpler way.

Briefly, if we have to understand fixtures, in the literal sense, they are where we prepare everything for our test. They’re everything that the test needs to do its thing.

We are going to explore how effectively we can make use of fixtures with Django models that are more readable and easy to maintain. These are the fixtures provided by the pytest and not to be confused with Django fixtures.

Installation and Setup

For this blog, we will set up a basic e-commerce application and set up the test suite for pytest.

Creating Django App

Before we begin testing, let’s create a basic e-commerce application and add a few models on which we can perform tests later.

To create a Django app, go to the folder you want to work in, open the terminal, and run the below commands:

$ django-admin startproject e_commerce_app
$ cd e-commerce-app
$ python manage.py startapp product

$ django-admin startproject e_commerce_app
$ cd e-commerce-app
$ python manage.py startapp product

Once the app is created, go to the settings.py and add the newly created product app to the INSTALLED_APPS.

# Application definition
INSTALLED_APPS = [
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'product'
]

# Application definition
INSTALLED_APPS = [
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'product'
]

Now, let’s create basic models in the models.py of the product app.

from django.db import models

class Retail(models.Model):
    name = models.CharField(max_length=128)

class Category(models.Model):
    name = models.CharField(max_length=128, unique=True)

class Product(models.Model):
    sku = models.CharField(max_length=50, unique=True)  # unique model number
    name = models.CharField(max_length=50)
    description = models.TextField(default="", blank=True)
    mrp = models.DecimalField(max_digits=10, decimal_places=2)
    weight = models.DecimalField(max_digits=10, decimal_places=2)
    retails = models.ManyToManyField(
        Retail,
        related_name="products",
        verbose_name="Retail stores that carry the product",
    )
    category = models.ForeignKey(
        Category, 
				related_name="products", 
				on_delete=models.CASCADE,
				blank=True,
				null=True,
    )
    date_created = models.DateTimeField(auto_now_add=True)
    date_modified = models.DateTimeField(auto_now=True)

from django.db import models

class Retail(models.Model):
    name = models.CharField(max_length=128)

class Category(models.Model):
    name = models.CharField(max_length=128, unique=True)

class Product(models.Model):
    sku = models.CharField(max_length=50, unique=True)  # unique model number
    name = models.CharField(max_length=50)
    description = models.TextField(default="", blank=True)
    mrp = models.DecimalField(max_digits=10, decimal_places=2)
    weight = models.DecimalField(max_digits=10, decimal_places=2)
    retails = models.ManyToManyField(
        Retail,
        related_name="products",
        verbose_name="Retail stores that carry the product",
    )
    category = models.ForeignKey(
        Category, 
				related_name="products", 
				on_delete=models.CASCADE,
				blank=True,
				null=True,
    )
    date_created = models.DateTimeField(auto_now_add=True)
    date_modified = models.DateTimeField(auto_now=True)

Here, each product will have a category and will be available at many retail stores. Now, let’s run the migration file and migrate the changes:

$ python manage.py makemigrations
$ python manage.py migrate

$ python manage.py makemigrations
$ python manage.py migrate

The models and database is now ready, and we can move on to writing test cases for these models.

Let’s set up the pytest in our Django app first.

For testing our Django applications with pytest, we will use the plugin pytest-django, which provides a set of useful tools for testing Django apps and projects. Let’s start with installing and configuration of the plugin.

Installing pytest

Pytest can be installed with pip:

$ pip install pytest-django

$ pip install pytest-django

Installing pytest-django will also automatically install the latest version of pytest. Once installed, we need to tell pytest-django where our settings.py file is located.

The easiest way to do this is to create a pytest configuration file with this information.

Create a file called pytest.ini in your project directory and add this content:

[pytest]
DJANGO_SETTINGS_MODULE=e_commerce_app.settings

[pytest]
DJANGO_SETTINGS_MODULE=e_commerce_app.settings

You can provide various configurations in the file that will define how our tests should run.

e.g. To configure how test files should be detected across project, we can add this line:

[pytest]
DJANGO_SETTINGS_MODULE=e_commerce_app.settings
python_files = tests.py test_*.py *_tests.py

[pytest]
DJANGO_SETTINGS_MODULE=e_commerce_app.settings
python_files = tests.py test_*.py *_tests.py

Adding Test Suite to the Django App

Django and pytest automatically detect and run your test cases in files whose name starts with ‘test’.

In the product app folder, create a new module named tests. Then add a file called test_models.py in which we will write all the model test cases for this app.

$ cd product
$ mkdir tests
$ cd tests && touch test_models.py

$ cd product
$ mkdir tests
$ cd tests && touch test_models.py

Running your Test Suite

Tests are invoked directly with the pytest command:

$ pytest
$ pytest tests                          # test a directory
$ pytest test.py                        # test file

$ pytest
$ pytest tests                          # test a directory
$ pytest test.py                        # test file

For now, we are configured and ready for writing the first test with pytest and Django.

Writing Tests with Pytest

Here, we will write a few test cases to test the models we have written in the models.py file. To start with, let’s create a simple test case to test the category creation.

from product.models import Category

def test_create_category():
    category = Category.objects.create(name="Books")
    assert category.name == "Books"

from product.models import Category

def test_create_category():
    category = Category.objects.create(name="Books")
    assert category.name == "Books"

Now, try to execute this test from your command line:

$ pytest
============================= test session starts ==============================
platform linux -- Python 3.7.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
django: settings: pytest_fixtures.settings (from ini)
rootdir: /home/suraj/PycharmProjects/e_commerce_app, configfile: pytest.ini
plugins: django-4.1.0
collected 1 item                                                               

product/tests/test_models.py F                                           [100%]

=================================== FAILURES ===================================
_____________________________ test_create_category _____________________________

    def test_create_category():
>       category = Category.objects.create(name="Books")

product/tests/test_models.py:5: 
...
E       RuntimeError: Database access not allowed, use the "django_db" mark, or the "db" or "transactional_db" fixtures to enable it.

venv/lib/python3.7/site-packages/django/db/backends/base/base.py:235: RuntimeError
=========================== short test summary info ============================
FAILED product/tests/test_models.py::test_create_category - RuntimeError: Dat...
============================== 1 failed in 0.21s ===============================

$ pytest
============================= test session starts ==============================
platform linux -- Python 3.7.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
django: settings: pytest_fixtures.settings (from ini)
rootdir: /home/suraj/PycharmProjects/e_commerce_app, configfile: pytest.ini
plugins: django-4.1.0
collected 1 item                                                               

product/tests/test_models.py F                                           [100%]

=================================== FAILURES ===================================
_____________________________ test_create_category _____________________________

    def test_create_category():
>       category = Category.objects.create(name="Books")

product/tests/test_models.py:5: 
...
E       RuntimeError: Database access not allowed, use the "django_db" mark, or the "db" or "transactional_db" fixtures to enable it.

venv/lib/python3.7/site-packages/django/db/backends/base/base.py:235: RuntimeError
=========================== short test summary info ============================
FAILED product/tests/test_models.py::test_create_category - RuntimeError: Dat...
============================== 1 failed in 0.21s ===============================

The tests failed. If you look at the error, it has to do something with the database. The pytest-django doc says:

pytest-django takes a conservative approach to enabling database access. By default your tests will fail if they try to access the database. Only if you explicitly request database access will this be allowed. This encourages you to keep database-needing tests to a minimum which makes it very clear what code uses the database.

This means we need to explicitly provide database access to our test cases. For this, we need to use [pytest marks](<https://docs.pytest.org/en/stable/mark.html#mark>) to tell pytest-django your test needs database access.

from product.models import Category

@pytest.mark.django_db
def test_create_category():
    category = Category.objects.create(name="Books")
    assert category.name == "Books"

from product.models import Category

@pytest.mark.django_db
def test_create_category():
    category = Category.objects.create(name="Books")
    assert category.name == "Books"

Alternatively, there is one more way we can access the database in the test cases, i.e., using the db helper fixture provided by the pytest-django. This fixture will ensure the Django database is set up. It’s only required for fixtures that want to use the database themselves.

from product.models import Category

def test_create_category(db):
    category = Category.objects.create(name="Books")
    assert category.name == "Books"

from product.models import Category

def test_create_category(db):
    category = Category.objects.create(name="Books")
    assert category.name == "Books"

Going forward, we will use the db fixture approach as it promotes code reusability using fixtures.

Run the test again:

$ pytest
============================= test session starts ==============================
platform linux -- Python 3.7.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
django: settings: pytest_fixtures.settings (from ini)
rootdir: /home/suraj/PycharmProjects/e_commerce_app, configfile: pytest.ini
plugins: django-4.1.0
collected 1 item                                                               

product/tests/test_models.py .                                           [100%]

============================== 1 passed in 0.24s ===============================

$ pytest
============================= test session starts ==============================
platform linux -- Python 3.7.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
django: settings: pytest_fixtures.settings (from ini)
rootdir: /home/suraj/PycharmProjects/e_commerce_app, configfile: pytest.ini
plugins: django-4.1.0
collected 1 item                                                               

product/tests/test_models.py .                                           [100%]

============================== 1 passed in 0.24s ===============================

The command completed successfully and your test passed. Great! We have successfully written our first test case using pytest.

Creating Fixtures for Django Models

Now that you’re familiar with Django and pytest, let’s add a test case to check if the to-check category updates.

from product.models import Category

def test_filter_category(db):
    Category.objects.create(name="Books")
    assert Category.objects.filter(name="Books").exists()

def test_update_category(db):
    category = Category.objects.create(name="Books")
    category.name = "DVDs"
    category.save()
    category_from_db = Category.objects.get(name="DVDs")
    assert category_from_db.name == "DVDs"

from product.models import Category

def test_filter_category(db):
    Category.objects.create(name="Books")
    assert Category.objects.filter(name="Books").exists()

def test_update_category(db):
    category = Category.objects.create(name="Books")
    category.name = "DVDs"
    category.save()
    category_from_db = Category.objects.get(name="DVDs")
    assert category_from_db.name == "DVDs"

If you look at both the test cases, one thing you can observe is that both the test cases do not test Category creation logic, and the Category instance is also getting created twice, once per test case. Once the project becomes large, we might have many test cases that will need the Category instance. If every test is creating its own category, then you might face trouble if any changes to the Category model happen.

This is where fixtures come to the rescue. It promotes code reusability in your test cases. To reuse an object in many test cases, you can create a test fixture:

import pytest
from product.models import Category

@pytest.fixture
def category(db) -> Category:
    return Category.objects.create(name="Books")

def test_filter_category(category):
    assert Category.objects.filter(name="Books").exists()

def test_update_category(category):
    category.name = "DVDs"
    category.save()
    category_from_db = Category.objects.get(name="DVDs")
    assert category_from_db.name == "DVDs"

import pytest
from product.models import Category

@pytest.fixture
def category(db) -> Category:
    return Category.objects.create(name="Books")

def test_filter_category(category):
    assert Category.objects.filter(name="Books").exists()

def test_update_category(category):
    category.name = "DVDs"
    category.save()
    category_from_db = Category.objects.get(name="DVDs")
    assert category_from_db.name == "DVDs"

Here, we have created a simple function called category and decorated it with @pytest.fixture to mark it as a fixture. It can now be injected into the test cases just like we injected the fixture db.

Now, if a new requirement comes in that every category should have a description and a small icon to represent the category, we don’t need to now go to each test case and update the category to create logic. We just need to update the fixture, i.e., only one place. And it will take effect in every test case.

import pytest
from product.models import Category

@pytest.fixture
def category(db) -> Category:
    return Category.objects.create(
        name="Books", description="Category of Books", icon="books.png"
    )

import pytest
from product.models import Category

@pytest.fixture
def category(db) -> Category:
    return Category.objects.create(
        name="Books", description="Category of Books", icon="books.png"
    )

Using fixtures, you can avoid code duplication and make tests more maintainable.

Parametrizing fixtures

It is recommended to have a single fixture function that can be executed across different input values. This can be achieved via parameterized pytest fixtures.

Let’s write the fixture for the product and consider we will need to create a SKU product number that has 6 characters and contains only alphanumeric characters.

import pytest
from product.models import Category, Product

@pytest.fixture
def product_one(db):
    return Product.objects.create(name="Book 1", sku="ABC123")

def test_product_sku(product_one):
    assert all(letter.isalnum() for letter in product_one.sku)
    assert len(product_one.sku) == 6

import pytest
from product.models import Category, Product

@pytest.fixture
def product_one(db):
    return Product.objects.create(name="Book 1", sku="ABC123")

def test_product_sku(product_one):
    assert all(letter.isalnum() for letter in product_one.sku)
    assert len(product_one.sku) == 6

We now want to test the case against multiple sku cases and make sure for all types of inputs the test is validated. We can flag the fixture to create three different product_one fixture instances. The fixture function gets access to each parameter through the special request object:

import pytest
from product.models import Product

@pytest.fixture(params=("ABC123", "123456", "ABCDEF"))
def product_one(db,request):
    return Product.objects.create(name="Book 1",sku=request.param)

def test_product_sku(product_one):
    assert all(letter.isalnum() for letter in product_one.sku)
    assert len(product_one.sku) == 6

import pytest
from product.models import Product

@pytest.fixture(params=("ABC123", "123456", "ABCDEF"))
def product_one(db,request):
    return Product.objects.create(name="Book 1",sku=request.param)

def test_product_sku(product_one):
    assert all(letter.isalnum() for letter in product_one.sku)
    assert len(product_one.sku) == 6

Fixture functions can be parametrized in which case they will be called multiple times, each time executing the set of dependent tests, i.e., the tests that depend on this fixture.

Test functions usually do not need to be aware of their re-running. Fixture parametrization helps to write exhaustive functional tests for components that can be configured in multiple ways.

Open the terminal and run the test:

$ pytest
============================= test session starts ==============================
platform linux -- Python 3.7.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
django: settings: pytest_fixtures.settings (from ini)
rootdir: /home/suraj/PycharmProjects/e_commerce_app, configfile: pytest.ini
plugins: django-4.1.0
collected 3 items                                                              

product/tests/test_models.py ...                                         [100%]

============================== 3 passed in 0.27s ===============================

$ pytest
============================= test session starts ==============================
platform linux -- Python 3.7.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
django: settings: pytest_fixtures.settings (from ini)
rootdir: /home/suraj/PycharmProjects/e_commerce_app, configfile: pytest.ini
plugins: django-4.1.0
collected 3 items                                                              

product/tests/test_models.py ...                                         [100%]

============================== 3 passed in 0.27s ===============================

We can see that our test_product_sku function ran thrice.

Injecting Fixtures into Other fixtures.

We will often come across a case wherein, we will need an object for a case that will be dependent on some other object. Let’s try to create a few products under the category “Books”.

import pytest

from product.models import Category, Product

@pytest.fixture
def product_one(db):
    category = Category.objects.create(name="Books")
    return Product.objects.create(name="Book 1", category=category)

@pytest.fixture
def product_two(db):
    category = Category.objects.create(name="Books")
    return Product.objects.create(name="Book 2", category=category)

def test_two_different_books_create(product_one, product_two):
    assert product_one.pk != product_two.pk

import pytest

from product.models import Category, Product

@pytest.fixture
def product_one(db):
    category = Category.objects.create(name="Books")
    return Product.objects.create(name="Book 1", category=category)

@pytest.fixture
def product_two(db):
    category = Category.objects.create(name="Books")
    return Product.objects.create(name="Book 2", category=category)

def test_two_different_books_create(product_one, product_two):
    assert product_one.pk != product_two.pk

If we try to test this in the terminal, we will encounter an error:

$ pytest
============================= test session starts ==============================
platform linux -- Python 3.7.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
django: settings: pytest_fixtures.settings (from ini)
rootdir: /home/suraj/PycharmProjects/e_commerce_app, configfile: pytest.ini
plugins: django-4.1.0
collected 1 item                                                               

product/tests/test_models.py E                                           [100%]

==================================== ERRORS ====================================
______________ ERROR at setup of test_two_different_books_create _______________
...
query = 'INSERT INTO "product_category" ("name") VALUES (?)', params = ['Books']
...
E       django.db.utils.IntegrityError: UNIQUE constraint failed: product_category.name

venv/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py:413: IntegrityError
=========================== short test summary info ============================
ERROR product/tests/test_models.py::test_two_different_books_create - django....
=============================== 1 error in 0.44s ===============================

$ pytest
============================= test session starts ==============================
platform linux -- Python 3.7.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
django: settings: pytest_fixtures.settings (from ini)
rootdir: /home/suraj/PycharmProjects/e_commerce_app, configfile: pytest.ini
plugins: django-4.1.0
collected 1 item                                                               

product/tests/test_models.py E                                           [100%]

==================================== ERRORS ====================================
______________ ERROR at setup of test_two_different_books_create _______________
...
query = 'INSERT INTO "product_category" ("name") VALUES (?)', params = ['Books']
...
E       django.db.utils.IntegrityError: UNIQUE constraint failed: product_category.name

venv/lib/python3.7/site-packages/django/db/backends/sqlite3/base.py:413: IntegrityError
=========================== short test summary info ============================
ERROR product/tests/test_models.py::test_two_different_books_create - django....
=============================== 1 error in 0.44s ===============================

The test case throws an IntegrityError, saying we tried to create the “Books” category twice. And if you look at the code, we have created the category in both product_one and product_two fixtures. What could we have done better?

If you look carefully, we have injected db in both the product_one and product_two fixtures, and db is just another fixture. So that means fixtures can be injected into other fixtures.

One of pytest’s greatest strengths is its extremely flexible fixture system. It allows us to boil down complex requirements for tests into more simple and organized functions, where we only need to have each one describe the things they are dependent on.

You can use this feature to address the IntegrityError above. Create the category fixture and inject it into both the product fixtures.

import pytest
from product.models import Category, Product

@pytest.fixture
def category(db) -> Category:
    return Category.objects.create(name="Books")

@pytest.fixture
def product_one(db, category):
    return Product.objects.create(name="Book 1", category=category)

@pytest.fixture
def product_two(db, category):
    return Product.objects.create(name="Book 2", category=category)

def test_two_different_books_create(product_one, product_two):
    assert product_one.pk != product_two.pk

import pytest
from product.models import Category, Product

@pytest.fixture
def category(db) -> Category:
    return Category.objects.create(name="Books")

@pytest.fixture
def product_one(db, category):
    return Product.objects.create(name="Book 1", category=category)

@pytest.fixture
def product_two(db, category):
    return Product.objects.create(name="Book 2", category=category)

def test_two_different_books_create(product_one, product_two):
    assert product_one.pk != product_two.pk

If we try to run the test now, it should run successfully.

$ pytest
============================= test session starts ==============================
platform linux -- Python 3.7.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
django: settings: pytest_fixtures.settings (from ini)
rootdir: /home/suraj/PycharmProjects/e_commerce_app, configfile: pytest.ini
plugins: django-4.1.0
collected 1 item                                                               

product/tests/test_models.py .                                           [100%]

============================== 1 passed in 0.20s ===============================

$ pytest
============================= test session starts ==============================
platform linux -- Python 3.7.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
django: settings: pytest_fixtures.settings (from ini)
rootdir: /home/suraj/PycharmProjects/e_commerce_app, configfile: pytest.ini
plugins: django-4.1.0
collected 1 item                                                               

product/tests/test_models.py .                                           [100%]

============================== 1 passed in 0.20s ===============================

By restructuring the fixtures this way, we have made code easier to maintain. By simply injecting fixtures, we can maintain a lot of complex model fixtures in a much simpler way.

Let’s say we need to add an example where product one and product two will be sold by retail shop “ABC”. This can be easily achieved by injecting retailer fixtures into the product fixture.

import pytest
from product.models import Category, Product, Retail

@pytest.fixture
def category(db) -> Category:
    return Category.objects.create(name="Books")

@pytest.fixture
def retailer_abc(db):
    return Retail.objects.create(name="ABC")

@pytest.fixture
def product_one(db, category, retailer_abc):
    product = Product.objects.create(name="Book 1", category=category)
    product.retails.add(retailer_abc)
    return product

def test_product_retailer(db, retailer_abc, product_one):
    assert product_one.retails.filter(name=retailer_abc.name).exists()

import pytest
from product.models import Category, Product, Retail

@pytest.fixture
def category(db) -> Category:
    return Category.objects.create(name="Books")

@pytest.fixture
def retailer_abc(db):
    return Retail.objects.create(name="ABC")

@pytest.fixture
def product_one(db, category, retailer_abc):
    product = Product.objects.create(name="Book 1", category=category)
    product.retails.add(retailer_abc)
    return product

def test_product_retailer(db, retailer_abc, product_one):
    assert product_one.retails.filter(name=retailer_abc.name).exists()

Autouse Fixtures

Sometimes, you may want to have a fixture (or even several) that you know all your tests will depend on. “Autouse” fixtures are a convenient way of making all tests automatically request them. This can cut out a lot of redundant requests, and can even provide more advanced fixture usage.

We can make a fixture an autouse fixture by passing in autouse=True to the fixture’s decorator. Here’s a simple example of how they can be used:

import pytest
from product.models import Category, Product, Retail
...

@pytest.fixture
def retailer_abc(db):
    return Retail.objects.create(name="ABC")

@pytest.fixture
def retailers(db) -> list:
    return []

@pytest.fixture(autouse=True)
def append_retailers(retailers, retailer_abc):
    return retailers.append(retailer_abc)

@pytest.fixture
def product_one(db, category, retailers):
    product = Product.objects.create(name="Book 1", category=category)
    product.retails.set(retailers)
    return product

def test_product_retailer(db, retailer_abc, product_one):
    assert product_one.retails.filter(name=retailer_abc.name).exists()

import pytest
from product.models import Category, Product, Retail
...

@pytest.fixture
def retailer_abc(db):
    return Retail.objects.create(name="ABC")

@pytest.fixture
def retailers(db) -> list:
    return []

@pytest.fixture(autouse=True)
def append_retailers(retailers, retailer_abc):
    return retailers.append(retailer_abc)

@pytest.fixture
def product_one(db, category, retailers):
    product = Product.objects.create(name="Book 1", category=category)
    product.retails.set(retailers)
    return product

def test_product_retailer(db, retailer_abc, product_one):
    assert product_one.retails.filter(name=retailer_abc.name).exists()

In this example, the append_retailers fixture is an autouse fixture. Because it happens automatically, test_product_retailer is affected by it, even though the test did not request it. That doesn’t mean they can’t be requested though; just that it isn’t necessary.

Factories as Fixtures

So far, we have created objects with a small number of arguments. However, practically models are a bit more complex and may require more inputs. Let’s say we will need to store the sku, mrp, and weight information along with name and category.

If we decide to provide every input to the product fixture, then the logic inside the product fixtures will get a little complicated.

import random
import string
import pytest
from product.models import Category, Product, Retail

@pytest.fixture
def category(db) -> Category:
    return Category.objects.create(name="Books")

@pytest.fixture
def retailer_abc(db):
    return Retail.objects.create(name="ABC")

@pytest.fixture
def product_one(db, category, retailer_abc):
    sku = "".join(random.choices(string.ascii_uppercase + string.digits, k=6))
    product = Product.objects.create(
        sku=sku,
        name="Book 1",
        description="A book for educational purpose.",
        mrp="100.00",
        is_available=True,
        category=category,
    )
    product.retails.set([retailer_abc])
    return product

@pytest.fixture
def product_two(db, category, retailer):
    sku = "".join(random.choices(string.ascii_uppercase + string.digits, k=6))
    product = Product.objects.create(
        sku=sku,
        name="Book 2",
        description="A book with thriller story.",
        mrp="50.00",
        is_available=True,
        category=category,
    )
    product.retails.add([retailer])
    return product

import random
import string
import pytest
from product.models import Category, Product, Retail

@pytest.fixture
def category(db) -> Category:
    return Category.objects.create(name="Books")

@pytest.fixture
def retailer_abc(db):
    return Retail.objects.create(name="ABC")

@pytest.fixture
def product_one(db, category, retailer_abc):
    sku = "".join(random.choices(string.ascii_uppercase + string.digits, k=6))
    product = Product.objects.create(
        sku=sku,
        name="Book 1",
        description="A book for educational purpose.",
        mrp="100.00",
        is_available=True,
        category=category,
    )
    product.retails.set([retailer_abc])
    return product

@pytest.fixture
def product_two(db, category, retailer):
    sku = "".join(random.choices(string.ascii_uppercase + string.digits, k=6))
    product = Product.objects.create(
        sku=sku,
        name="Book 2",
        description="A book with thriller story.",
        mrp="50.00",
        is_available=True,
        category=category,
    )
    product.retails.add([retailer])
    return product

Product creation has a somewhat complex logic of managing retailers and generating unique SKU. And the product creation logic will grow as we keep adding requirements. There may be some extra logic needed if we consider discounts and coupon code complexity for every retailer. There may also be a lot of versions of the product instance we may want to test against, and you have already learned how difficult it is to maintain such a complex code.

The “factory as fixture” pattern can help in these cases where the same class instance is needed for different tests. Instead of returning an instance directly, the fixture will return a function, and upon calling which one, you can get the distance that you wanted to test.

import random
import string
import pytest

from product.models import Category, Product, Retail

@pytest.fixture
def category(db) -> Category:
    return Category.objects.create(name="Books")

@pytest.fixture
def retailer_abc(db):
    return Retail.objects.create(name="ABC")

@pytest.fixture
def product_factory(db, category, retailer_abc):
    def create_product(
        name, description="A Book", mrp=None, is_available=True, retailers=None
    ):
        if retailers is None:
            retailers = []
        sku = "".join(random.choices(string.ascii_uppercase + 
										string.digits, k=6))
        product = Product.objects.create(
            sku=sku,
            name=name,
            description=description,
            mrp=mrp,
            is_available=is_available,
            category=category,
        )
        product.retails.add(retailer_abc)
        if retailers:
            product.retails.set(retailers)
        return product

    return create_product

@pytest.fixture
def product_one(product_factory):
    return product_factory(name="Book 1", mrp="100.2")

@pytest.fixture
def product_two(product_factory):
    return product_factory(name="Novel Book", mrp="51")

def test_product_retailer(db, retailer_abc, product_one):
    assert product_one.retails.filter(name=retailer_abc.name).exists()

def test_product_one(product_one):
    assert product_one.name == "Book 1"
    assert product_one.is_available

import random
import string
import pytest

from product.models import Category, Product, Retail

@pytest.fixture
def category(db) -> Category:
    return Category.objects.create(name="Books")

@pytest.fixture
def retailer_abc(db):
    return Retail.objects.create(name="ABC")

@pytest.fixture
def product_factory(db, category, retailer_abc):
    def create_product(
        name, description="A Book", mrp=None, is_available=True, retailers=None
    ):
        if retailers is None:
            retailers = []
        sku = "".join(random.choices(string.ascii_uppercase + 
										string.digits, k=6))
        product = Product.objects.create(
            sku=sku,
            name=name,
            description=description,
            mrp=mrp,
            is_available=is_available,
            category=category,
        )
        product.retails.add(retailer_abc)
        if retailers:
            product.retails.set(retailers)
        return product

    return create_product

@pytest.fixture
def product_one(product_factory):
    return product_factory(name="Book 1", mrp="100.2")

@pytest.fixture
def product_two(product_factory):
    return product_factory(name="Novel Book", mrp="51")

def test_product_retailer(db, retailer_abc, product_one):
    assert product_one.retails.filter(name=retailer_abc.name).exists()

def test_product_one(product_one):
    assert product_one.name == "Book 1"
    assert product_one.is_available

This is not far from what you’ve already done, so let’s break it down:

The category and retailer_abc fixture remains the same.
A new product_factory fixture is added, and it is injected with the category and retailer_abc fixture.
The fixture product_factory creates a wrapper and returns an inner function called create_product.
Inject product_factory into another fixture and use it to create a product instance

The factory fixture works similar to how decorators work in python.

Sharing Fixtures Using Scopes

Fixtures requiring network or db access depend on connectivity and are usually time-expensive to create. In the previous example, every time we request any fixture within our tests, it is used to run the method, generate an instance and pass them to the test. So if we have written ‘n’ tests, and every test calls for the same fixture then that fixture instance will be created n times during the entire execution.

This is mainly happening because fixtures are created when first requested by a test, and are destroyed based on their scope:

Function: the default scope, the fixture is destroyed at the end of the test.
Class: the fixture is destroyed during the teardown of the last test in the class.
Module: the fixture is destroyed during teardown of the last test in the module.
Package: the fixture is destroyed during teardown of the last test in the package.
Session: the fixture is destroyed at the end of the test session.

In the previous example, we can add scope=”module” so that the category, retailer_abc, product_one, and product_two instances will only be invoked once per test module.

Multiple test functions in a test module will thus each receive the same category, retailer_abc, product_one, and product_two fixture instance, thus saving time.

@pytest.fixture(scope="module")
def category(db) -> Category:
    return Category.objects.create(name="Books")

@pytest.fixture(scope="module")
def retailer_abc(db):
    return Retail.objects.create(name="ABC")

@pytest.fixture(scope="module")
def product_one(product_factory):
    return product_factory(name="Book 1", mrp="100.2")

@pytest.fixture(scope="module")
def product_two(product_factory):
    return product_factory(name="Novel Book", mrp="51")

@pytest.fixture(scope="module")
def category(db) -> Category:
    return Category.objects.create(name="Books")

@pytest.fixture(scope="module")
def retailer_abc(db):
    return Retail.objects.create(name="ABC")

@pytest.fixture(scope="module")
def product_one(product_factory):
    return product_factory(name="Book 1", mrp="100.2")

@pytest.fixture(scope="module")
def product_two(product_factory):
    return product_factory(name="Novel Book", mrp="51")

This is how we can add scope to the fixtures, and you can do it for all the fixtures.

But, If we try to test this in the terminal, we will encounter an error:

$ pytest
============================= test session starts ==============================
platform linux -- Python 3.7.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
django: settings: pytest_fixtures.settings (from ini)
rootdir: /home/suraj/PycharmProjects/e_commerce_app, configfile: pytest.ini
plugins: django-4.1.0
collected 2 items                                                              

product/tests/test_models.py EE                                          [100%]

==================================== ERRORS ====================================
___________________ ERROR at setup of test_product_retailer ____________________
ScopeMismatch: You tried to access the 'function' scoped fixture 'db' with a 'module' scoped request object, involved factories
product/tests/test_models.py:13:  def retailer_abc(db) -> product.models.Category
venv/lib/python3.7/site-packages/pytest_django/fixtures.py:193:  def db(request, django_db_setup, django_db_blocker)
______________________ ERROR at setup of test_product_one ______________________
ScopeMismatch: You tried to access the 'function' scoped fixture 'db' with a 'module' scoped request object, involved factories
...
============================== 2 errors in 0.24s ===============================

$ pytest
============================= test session starts ==============================
platform linux -- Python 3.7.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
django: settings: pytest_fixtures.settings (from ini)
rootdir: /home/suraj/PycharmProjects/e_commerce_app, configfile: pytest.ini
plugins: django-4.1.0
collected 2 items                                                              

product/tests/test_models.py EE                                          [100%]

==================================== ERRORS ====================================
___________________ ERROR at setup of test_product_retailer ____________________
ScopeMismatch: You tried to access the 'function' scoped fixture 'db' with a 'module' scoped request object, involved factories
product/tests/test_models.py:13:  def retailer_abc(db) -> product.models.Category
venv/lib/python3.7/site-packages/pytest_django/fixtures.py:193:  def db(request, django_db_setup, django_db_blocker)
______________________ ERROR at setup of test_product_one ______________________
ScopeMismatch: You tried to access the 'function' scoped fixture 'db' with a 'module' scoped request object, involved factories
...
============================== 2 errors in 0.24s ===============================

The reason for this error is that the db fixture has the function scope for a reason, so the transaction rollbacks on the end of each test ensure the database is left in the same state it has when the test starts. Nevertheless, you can have the session/module scoped access to the database in the fixture by using the django_db_blocker fixture:

import random
import string
import pytest

from product.models import Category, Product, Retail

@pytest.fixture(scope="module")
def category(django_db_blocker):
    with django_db_blocker.unblock():
        return Category.objects.create(name="Books")

@pytest.fixture(scope="module")
def retailer_abc(django_db_blocker):
    with django_db_blocker.unblock():
        return Retail.objects.create(name="ABC")

@pytest.fixture(scope="module")
def product_factory(django_db_blocker, category, retailer_abc):
    def create_product(
        name, description="A Book", mrp=None, is_available=True, retailers=None
    ):
        if retailers is None:
            retailers = []
        sku = "".join(random.choices(
								 string.ascii_uppercase + string.digits, k=6)
								)
        with django_db_blocker.unblock():
            product = Product.objects.create(
                sku=sku,
                name=name,
                description=description,
                mrp=mrp,
                is_available=is_available,
                category=category,
            )
            product.retails.add(retailer_abc)
            if retailers:
                product.retails.set(retailers)
            return product

    return create_product

@pytest.fixture(scope="module")
def product_one(product_factory):
    return product_factory(name="Book 1", mrp="100.2")

@pytest.fixture(scope="module")
def product_two(product_factory):
    return product_factory(name="Novel Book", mrp="51")

def test_product_retailer(db, retailer_abc, product_one):
    assert product_one.retails.filter(name=retailer_abc.name).exists()

def test_product_one(product_one):
    assert product_one.name == "Book 1"
    assert product_one.is_available

import random
import string
import pytest

from product.models import Category, Product, Retail

@pytest.fixture(scope="module")
def category(django_db_blocker):
    with django_db_blocker.unblock():
        return Category.objects.create(name="Books")

@pytest.fixture(scope="module")
def retailer_abc(django_db_blocker):
    with django_db_blocker.unblock():
        return Retail.objects.create(name="ABC")

@pytest.fixture(scope="module")
def product_factory(django_db_blocker, category, retailer_abc):
    def create_product(
        name, description="A Book", mrp=None, is_available=True, retailers=None
    ):
        if retailers is None:
            retailers = []
        sku = "".join(random.choices(
								 string.ascii_uppercase + string.digits, k=6)
								)
        with django_db_blocker.unblock():
            product = Product.objects.create(
                sku=sku,
                name=name,
                description=description,
                mrp=mrp,
                is_available=is_available,
                category=category,
            )
            product.retails.add(retailer_abc)
            if retailers:
                product.retails.set(retailers)
            return product

    return create_product

@pytest.fixture(scope="module")
def product_one(product_factory):
    return product_factory(name="Book 1", mrp="100.2")

@pytest.fixture(scope="module")
def product_two(product_factory):
    return product_factory(name="Novel Book", mrp="51")

def test_product_retailer(db, retailer_abc, product_one):
    assert product_one.retails.filter(name=retailer_abc.name).exists()

def test_product_one(product_one):
    assert product_one.name == "Book 1"
    assert product_one.is_available

Now, if we go to the terminal and run the tests, it will run successfully.

$ pytest
============================= test session starts ==============================
platform linux -- Python 3.7.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
django: settings: pytest_fixtures.settings (from ini)
rootdir: /home/suraj/PycharmProjects/e_commerce_app, configfile: pytest.ini
plugins: django-4.1.0
collected 2 items                                                              

product/tests/test_models.py ..                                          [100%]

============================== 2 passed in 0.22s ===============================

$ pytest
============================= test session starts ==============================
platform linux -- Python 3.7.0, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
django: settings: pytest_fixtures.settings (from ini)
rootdir: /home/suraj/PycharmProjects/e_commerce_app, configfile: pytest.ini
plugins: django-4.1.0
collected 2 items                                                              

product/tests/test_models.py ..                                          [100%]

============================== 2 passed in 0.22s ===============================

Warning: Beware that when unlocking the database in session scope, you’re on your own if you alter the database in other fixtures or tests.

Conclusion

We have successfully learned various features pytest fixtures provide and how we can benefit from the code reusability perspective and have maintainable code in your tests. Dependency management and arranging your test data becomes easy with the help of fixtures.

This was a blog about how you can use fixtures and the various features it provides along with the Django models. You can check more on fixtures by referring to the official documentation.

December 12, 2022

Blog

SOURCES

DATA STORAGE‍

DATA TRANSFORMATION

DATA VISUALIZATION

DATA ORCHESTRATION

DATA OBSERVABILITY

DATA GOVERNANCE & SECURITY

DATA CATALOG

FUTURE COMPONENTS OF MDS?

DATA MESH

DATA LAKEHOUSE

REVERSE ETL

METRICS LAYER

Overview‍

Prerequisites

Need for Kafka Security

Advantages of Clean Architecture:

A pragmatic approach

Marvel’s comic characters App

The domain layer

The data layer

The presentation layer

Introduction

What is AWS Athena?

Some Exciting Features of Athena are:

Pricing of Athena

Athena vs. Redshift Spectrum

Some Other Sample Queries:

Conclusion

New engine for faster DEV run and production build:

Next.js Live:

Middleware & serverless:

Server-side streaming:

React server components:

Conclusion:

What Gatsby Tries to Achieve?

It is More Than Just a Static Site Generator

Best of Both Worlds

Conclusion

Different Techniques for Scraping

Challenges while Scraping at Scale

Scraping Guidelines/ Best Practices

Conclusion

Why Formik?

API Responses:

Conclusion

Prerequisite:

Conclusion

Further Reading

Why Pytest Fixtures?

Installation and Setup

Creating Django App

Installing pytest

Adding Test Suite to the Django App

Running your Test Suite

Writing Tests with Pytest

Creating Fixtures for Django Models

Parametrizing fixtures

Injecting Fixtures into Other fixtures.

Autouse Fixtures

Factories as Fixtures

Sharing Fixtures Using Scopes

Conclusion