Category: Data and AI

The Responsible Use of Artificial Intelligence – Shaping a Safer Tomorrow
Introduction:

Artificial intelligence (AI) stands at the forefront of technological innovation, promising transformative changes in our lives. With its continuous advancements, AI has become an integral part of our daily routines, from virtual assistants and personalized recommendations to healthcare diagnostics and autonomous vehicles. However, the rapid integration of AI into our society raises pertinent ethical questions, necessitating a closer examination of its responsible use. In this blog post, we delve into the responsible use of artificial intelligence, exploring the principles that guide its ethical deployment and emphasizing the collaborative efforts required to shape a safer future.

Understanding Responsible AI:

Responsible AI signifies more than just technological progress; it embodies the ethical development and deployment of AI systems, emphasizing principles such as fairness, transparency, accountability, and privacy. To ensure that AI benefits society as a whole, it is crucial to address the following key aspects:
1. Ethical Considerations:
  Ethics serve as the cornerstone of AI development. Collaboration among engineers, data scientists, and policymakers is paramount in establishing ethical guidelines that prevent AI from being used for harmful purposes. It is imperative to avoid deploying AI in situations that could lead to discrimination, manipulation, or privacy erosion. Ethical considerations must permeate every stage of AI development, guiding decisions and actions.
2. Transparency and Accountability:
  Understanding the functioning of AI algorithms is pivotal for their ethical deployment. Striving for transparency, developers should elucidate, in plain language, how AI-driven decisions are made. Accountability mechanisms must be in place to address errors, biases, and unintended consequences. Regular audits and assessments ensure that AI systems remain ethical and accountable, promoting trust among users.
3. Bias Mitigation:
  The quality of AI algorithms hinges on the data they are trained on. Identifying and mitigating biases in datasets is imperative to create fair and equitable AI applications. Diverse and representative datasets are essential to reducing biases, ensuring that AI systems work impartially for everyone, irrespective of their background. Bias mitigation is an ongoing process, demanding continuous vigilance throughout AI development.
4. Privacy Protection:
  Responsible AI use involves safeguarding user privacy. As AI applications require extensive data, concerns arise about how this data is collected, stored, and utilized. Regulations and standards, such as GDPR in Europe, play a pivotal role in protecting user privacy rights and ensuring responsible handling of personal data. Developers and companies must prioritize user privacy in all facets of AI development to foster user trust and confidence.
5. Continuous Monitoring and Adaptation:
  AI’s landscape is in constant flux, necessitating continuous monitoring and adaptation to evolving ethical standards. Regular updates, feedback loops, and adaptive learning enable AI technologies to remain responsive to societal needs and concerns. Developers and companies must vigilantly monitor AI system performance, ready to make necessary changes to align outcomes with ethical standards.
6. Public Awareness and Education:
  Raising public awareness about AI and its implications is crucial. Educating the public about the ethical use of AI, potential biases, and privacy concerns empowers individuals to make informed decisions. Workshops, seminars, and accessible online resources can bridge the knowledge gap, ensuring that society comprehends the responsible use of AI and actively participates in shaping its trajectory.
7. Collaboration Across Sectors:
  Collaboration between governments, private sectors, and non-profit organizations is vital. By working together, diverse perspectives can be integrated, leading to comprehensive policies and guidelines. Initiatives like joint research projects, cross-industry collaborations, and international summits facilitate the exchange of ideas and foster a unified approach to responsible AI deployment.
Potential Risks of Irresponsible AI Use:

Irresponsible AI use can have detrimental consequences, impacting individuals and society profoundly. Here are the key risks associated with irresponsible AI deployment:
1. Bias and Discrimination:
  AI systems can perpetuate existing biases, leading to discriminatory outcomes, particularly in areas such as hiring, lending, and law enforcement.
  
  – For example, a study revealed that AI algorithms used in criminal justice systems exhibited racial bias, leading to disproportionately harsher sentences for people of color. This instance underscores the critical need for rigorous bias detection and mitigation strategies in AI development.
2. Privacy Violations:
  Improper AI use can result in unauthorized access to personal data, compromising individuals’ privacy and security. This breach can lead to identity theft, financial fraud, and other cybercrimes, highlighting the urgency of robust data protection measures.
  
  – For instance, the Cambridge Analytica scandal demonstrated how AI-driven data analysis could lead to unauthorized access to millions of users’ personal information on social media platforms, emphasizing the need for stringent data privacy regulations and ethical data management practices.
3. Job Displacement:
  AI-driven automation could lead to job displacement, posing economic challenges and social unrest. Industries reliant on routine tasks susceptible to automation are particularly vulnerable, necessitating proactive measures to address potential workforce transitions.
  
  – An example can be found in the manufacturing sector, where AI-driven robotics have significantly reduced the need for human workers in certain tasks, leading to workforce challenges and economic disparities. Initiatives focusing on retraining and upskilling programs can help mitigate these challenges.
4. Security Threats:
  AI systems are vulnerable to attacks and manipulation. Malicious actors could exploit these vulnerabilities, causing widespread disruption, manipulating financial markets, or spreading misinformation. Vigilance and robust security measures are paramount to mitigate these threats effectively.
  
  – For instance, the rise of deepfake technology, enabled by AI, poses significant threats to political, social, and economic stability. Misinformation and manipulated media can influence public opinion and decision-making processes, emphasizing the need for advanced AI-driven detection tools and awareness campaigns to combat this issue.
5. Loss of Human Control:
  Inadequate regulation could lead to AI systems, especially in military applications and autonomous vehicles, operating beyond human control. This lack of oversight might result in unintended consequences and ethical dilemmas, underscoring the need for stringent regulation and ethical guidelines.
  
  – A notable example is the debate surrounding autonomous vehicles, where AI-driven decision-making processes raise ethical questions, such as how these vehicles prioritize different lives in emergency situations. Robust ethical frameworks and regulatory guidelines are essential to navigate these complex scenarios responsibly.
Best AI Practices:

To ensure responsible AI development and usage, adhering to best practices is imperative:
1. Research and Development Ethics:
  Organizations should establish research ethics boards to oversee AI projects, ensuring strict adherence to ethical guidelines during the development phase.
2. Collaboration and Knowledge Sharing:
  Encourage collaboration between industry, academia, and policymakers to facilitate knowledge sharing and establish common ethical standards for AI development and deployment. Collaboration fosters a holistic approach, incorporating diverse perspectives and expertise.
3. User Education:
  Educating users about AI capabilities and limitations is essential. Raising awareness about the collected data and how it will be used promotes informed decision-making and user empowerment.
  
  – For instance, companies can provide interactive online resources and workshops explaining how AI algorithms work, demystifying complex concepts for the general public. Transparent communication about data collection and usage practices can enhance user trust and confidence.
4. Responsible Data Management:
  Implement robust data management practices to ensure data privacy, security, and compliance with regulations. Regularly update privacy policies to reflect the evolving nature of AI technology, demonstrating a commitment to user privacy and data protection.
  
  – Companies can employ advanced encryption techniques to protect user data and regularly undergo independent audits to assess their data security measures. Transparent reporting of these security practices can build trust with users and regulatory bodies.
5. Ethical AI Training:
  AI developers and engineers should receive training in ethics, emphasizing the significance of fairness, accountability, and transparency in AI algorithms. Ethical AI training fosters a culture of responsibility, guiding developers to create systems that benefit society.
  
  – Educational institutions and industry organizations can collaborate to offer specialized courses on AI ethics, ensuring that future developers are well-equipped to navigate the ethical challenges of AI technology. Industry-wide certification programs can validate developers’ expertise in ethical AI practices, setting industry standards.
Examples of Responsible AI Implementation:

Google’s Ethical Guidelines:

Google has been a trailblazer in implementing responsible AI practices, demonstrating a commitment to ethical use. The establishment of the Advanced Technology External Advisory Council (ATEAC) underscores Google’s dedication to addressing ethical challenges related to AI. Additionally, their release of guidelines for the ethical use of AI emphasizes fairness, accountability, and transparency. Google’s AI Principles outline their pledge to avoid bias, ensure safety, and provide broad benefits to humanity, setting a commendable precedent for the industry.

– An illustrative example of Google’s commitment to transparency can be seen in their Explainable AI (XAI) research, where efforts are made to make AI algorithms interpretable for users. By providing users with insights into how AI systems make decisions, Google enhances transparency and user understanding, contributing to responsible AI usage.

Microsoft’s AETHER Committee:

Microsoft has taken significant strides to ensure the responsible use of AI. The creation of the AI and Ethics in Engineering and Research (AETHER) Committee exemplifies Microsoft’s proactive approach to addressing ethical considerations. Furthermore, their active involvement in AI policy advocacy emphasizes the need for regulation to prevent the misuse of facial recognition technology. Microsoft’s initiatives promote transparency, accountability, and privacy protection, exemplifying a commitment to responsible AI implementation.

– A noteworthy initiative by Microsoft is its collaboration with academic institutions to research bias detection techniques in AI algorithms. By actively engaging in research, Microsoft contributes valuable knowledge to the industry, addressing the challenge of bias mitigation in AI systems and promoting responsible AI development.

Conclusion:

Artificial Intelligence holds immense potential to revolutionize our world, but this power must be wielded responsibly. By adhering to ethical guidelines, promoting transparency, and prioritizing privacy and fairness, we can harness the benefits of AI while mitigating its risks. Understanding the responsible use of AI empowers us as users and consumers to demand ethical practices from the companies and developers creating these technologies. Let us collectively work towards ensuring that AI becomes a force for good, shaping a safer and more equitable future for all.
January 3, 2024
Confluent Kafka vs. Amazon Managed Streaming for Apache Kafka (AWS MSK) vs. on-premise Kafka
Overview:

In the evolving landscape of distributed streaming platforms, Kafka has become the foundation of real-time data processing. As organizations opt for multiple deployment options, choosing Confluent Kafka, AWS MSK, and on-premises Kafka becomes important. This blog aims to provide an overview and comparison of these three Kafka deployment methods to help readers decide based on their needs.

Here’s a list of key bullet points to consider when comparing Confluent Kafka, AWS MSK, and on-premise Kafka:
1. Deployment and Management
2. Communication
3. Scalability
4. Performance
5. Cost Considerations
6. Security
So, let’s go through all the parameters one by one.

Deployment and Management:

Deployment and management refer to the processes involved in setting up, configuring, and overseeing the operation of a Kafka infrastructure. Each deployment option—Confluent Kafka, AWS MSK, and On-Premise Kafka—has its own approach to these aspects.

1. Deployment:

-> Confluent Kafka
- Confluent Kafka deployment involves setting up the Confluent Platform, which extends the open-source Kafka with additional tools.
- Users must install and configure Confluent components, such as Confluent Server, Connect, and Control Center.
- Deployment may vary based on whether it’s on-premise or in the cloud using Confluent Cloud.
-> AWS MSK
- AWS MSK streamlines deployment as it is a fully managed service.
- Users create an MSK cluster through the AWS Management Console or API, specifying configurations like instance types, storage, and networking.
- AWS handles the underlying infrastructure, automating tasks like scaling, patching, and maintenance.
-> On-Premise Kafka:
- On-premise deployment requires manual setup and configuration of Kafka brokers, Zookeeper, and other components.
- Organizations must provision and maintain the necessary hardware and networking infrastructure.
- Deployment may involve high availability, fault tolerance, and scalability considerations based on organizational needs.
2. Management:

-> Confluent Kafka:
- Confluent provides additional management tools beyond what’s available in open-source Kafka, including Confluent Control Center for monitoring and Confluent Hub for extensions.
- Users have control over configurations and can leverage Confluent’s tools for streamlining tasks like data integration and schema management.
-> AWS MSK:
- AWS MSK offers a managed environment where AWS takes care of routine management tasks.
- AWS Console and APIs provide tools for monitoring, scaling, and configuring MSK clusters.
- AWS handles maintenance tasks such as applying patches and updates to the Kafka software.
-> On-Premise Kafka:
- On-premise management involves manual oversight of the entire Kafka infrastructure.
- Organizations have full control but must handle tasks like software updates, monitoring configurations, and addressing any issues that arise.
- Management may require coordination between IT and operations teams to ensure smooth operation.
Communication

There are three main components of communication protocols, network configurations, and inter-cluster communication, so let’s look at them one by one.

1. Protocols

-> Confluent Kafka:
- Utilizes standard Kafka protocols such as TCP for communication between Kafka brokers and clients.
- Confluent components may communicate using REST APIs and other protocols specific to Confluent Platform extensions.
-> AWS MSK:
- Relies on standard Kafka protocols for communication between clients and brokers.
- Also employs protocols like TLS/SSL for secure communication.
-> On-Premise Kafka:
- Standard Kafka protocols are used for communication between components, including TCP for broker-client communication.
- Specific protocols may vary based on the organization’s network configuration.
2. Network Configuration

-> Confluent Kafka:
- Requires network configuration for Kafka brokers and other Confluent components.
- Configuration may include specifying listener ports, security settings, and inter-component communication settings.
-> AWS MSK:
- Network configuration involves setting up Virtual Private Cloud (VPC) settings, security groups, and subnet configurations.
- AWS MSK integrates with AWS networking services, allowing organizations to define network parameters.
-> On-Premise Kafka:
- Network configuration includes defining IP addresses, ports, and firewall rules for Kafka brokers.
- Organizations have full control over the on-premise network infrastructure, allowing for custom configurations.
3. Inter-Cluster Communication:

-> Confluent Kafka:
- Confluent components within the same cluster communicate using Kafka protocols.
- Communication between Confluent Kafka clusters or with external systems may involve additional configurations.
-> AWS MSK:
- AWS MSK clusters can communicate with each other within the same VPC or across different VPCs using standard Kafka protocols.
- AWS networking services facilitate secure and efficient inter-cluster communication.
-> On-Premise Kafka:
- Inter-cluster communication on-premise involves configuring network settings to enable communication between Kafka clusters.
- Organizations have full control over the network architecture for inter-cluster communication.
Scaling

Scalability is crucial for ensuring that Kafka deployments can handle varying workloads and grow as demands increase. Whether through vertical or horizontal scaling, the goal is to maintain performance, reliability, and efficiency in the face of changing requirements. Each deployment option—Confluent Kafka, AWS MSK, and on-premise Kafka—provides mechanisms for achieving scalability, with differences in how scaling is implemented and managed.

-> Confluent Kafka:
- Horizontal scaling is a fundamental feature of Kafka, allowing for adding more broker nodes to a Kafka cluster to handle increased message throughput.
- Kafka’s partitioning mechanism allows data to be distributed across multiple broker nodes, enabling parallel processing and improved scalability.
- Scaling can be achieved dynamically by adding or removing broker nodes, providing flexibility in adapting to varying workloads.
-> AWS MSK:
- AWS MSK leverages horizontal scaling by allowing users to adjust the number of broker nodes in a cluster based on demand.
- The managed service handles the underlying infrastructure and scaling tasks automatically, simplifying the process for users.
- Scaling with AWS MSK is designed to be seamless, with the service managing the distribution of partitions across broker nodes.
-> On-Premise Kafka:
- Scaling Kafka on-premise involves manually adding or removing broker nodes based on the organization’s infrastructure and capacity planning.
- Organizations need to consider factors such as hardware limitations, network configurations, and load balancing when scaling on-premise Kafka.
Performance

Performance in Kafka, whether it’s Confluent Kafka, AWS MSK (managed streaming for Kafka), or on-premise Kafka, is a critical aspect that directly impacts the throughput, latency, and efficiency of the system. Here’s a brief overview of performance considerations for each:

-> Confluent Kafka
- Enhancements and Extensions: Confluent Kafka builds upon the open-source Kafka with additional tools and features, potentially providing optimizations and performance enhancements. This may include tools for monitoring, data integration, and schema management.
- Customization: Organizations using Confluent Kafka have the flexibility to customize configurations and performance settings based on their specific use cases and requirements.
- Scalability: Confluent Kafka inherits Kafka’s fundamental scalability features, allowing for horizontal scaling by adding more broker nodes to a cluster.
-> AWS MSK
- Managed Environment: AWS MSK is a fully managed service, meaning AWS takes care of many operational aspects, including patches, updates, and scaling. This managed environment can contribute to a more streamlined and optimized performance.
- Scalability: AWS MSK allows for horizontal scaling by adjusting the number of broker nodes in a cluster dynamically. This elasticity contributes to the efficient handling of varying workloads.
- Integration with AWS Services: Integration with other AWS services can enhance performance in specific scenarios, such as leveraging high-performance storage solutions or integrating with AWS networking services.
-> On-Premise Kafka
- Control and Customization: On-premise Kafka deployments provide organizations with complete control over the infrastructure, allowing for fine-tuning and customization to meet specific performance requirements.
- Scalability Challenges: Scaling on-premise Kafka may present challenges related to manual provisioning of hardware, potential limitations in scalability due to physical constraints, and increased complexity in managing a distributed system.
- Hardware Considerations: Performance is closely tied to the quality of hardware chosen for on-premise deployment. Organizations must invest in suitable hardware to meet performance expectations.
Cost Considerations

Cost considerations for Kafka deployments involve evaluating the direct and indirect expenses associated with setting up, managing, and maintaining a Kafka infrastructure. Kafka cost considerations span licensing, infrastructure, scalability, operations, data transfer, and ongoing support. The choice between Confluent Kafka, AWS MSK, or on-premise Kafka should be made by evaluating these costs in alignment with the organization’s budget, requirements, and preferences.

Here’s a brief overview of key cost considerations:

1. Deployment Costs

-> Confluent Kafka:
- Involves licensing costs for Confluent Platform, which provides additional tools and features beyond open-source Kafka.
- Infrastructure costs for hosting Confluent Kafka, whether on-premise or in the cloud.
-> AWS MSK:
- Pay-as-you-go pricing model for AWS MSK, where users pay for the resources consumed by the Kafka cluster.
- Costs include AWS MSK service charges, as well as associated AWS resource costs such as EC2 instances, storage, and data transfer.
-> On-Premise Kafka:
- Upfront costs for hardware, networking equipment, and software licenses.
- Ongoing operational costs, including maintenance, power, cooling, and any required hardware upgrades.
2. Scalability Costs

-> Confluent Kafka:
- Scaling Confluent Kafka may involve additional licensing costs if more resources are needed.
- Infrastructure costs increase with the addition of more nodes or resources.
-> AWS MSK:
- Scaling AWS MSK is dynamic, and users are billed based on the resources consumed during scaling events.
- Costs may increase with additional broker nodes and associated AWS resources.
-> On-Premise Kafka:
- Scaling on-premise Kafka requires manual provisioning of additional hardware, incurring upfront and ongoing costs.
- Organizations must consider the total cost of ownership (TCO) when planning for scalability.
3. Operational Costs

-> Confluent Kafka:
- Operational costs include manpower for managing and monitoring Confluent Kafka.
- Costs associated with maintaining Confluent extensions and tools.
-> AWS MSK:
- AWS MSK is a managed service, reducing the operational burden on the user.
- Operational costs may include AWS support charges and personnel for configuring and monitoring the Kafka environment.
-> On-Premise Kafka:
- Organizations bear full operational responsibility, incurring staffing, maintenance, monitoring tools, and ongoing support costs.
4. Data Transfer Costs:

-> Confluent Kafka:
- Costs associated with data transfer depend on the chosen deployment model (e.g., cloud-based deployments may have data transfer costs).
-> AWS MSK:
- Data transfer costs may apply for communication between AWS MSK clusters and other AWS services.
-> On-Premise Kafka:
- Data transfer costs may be associated with network usage, especially if there are multiple geographically distributed Kafka clusters.
Check the below images of pricing figures for Confluent Kafka and AWS MSK side by side.

Security

Security involves implementing measures to protect data, ensure confidentiality, integrity, and availability, and mitigate risks associated with unauthorized access or data breaches. Here’s a brief overview of key security considerations for Kafka deployments:

1. Authentication:

-> Confluent Kafka:
- Supports various authentication mechanisms, including SSL/TLS for encrypted communication and SASL (Simple Authentication and Security Layer) for user authentication.
- Role-based access control (RBAC) allows administrators to define and manage user roles and permissions.
-> AWS MSK:
- Integrates with AWS Identity and Access Management (IAM) for user authentication and authorization.
- IAM policies control access to AWS MSK resources and actions.
-> On-Premise Kafka:
- Authentication mechanisms depend on the chosen Kafka distribution, but SSL/TLS and SASL are common.
- LDAP or other external authentication systems may be integrated for user management.
2. Authorization:

-> Confluent Kafka:
- RBAC allows fine-grained control over what users and clients can do within the Kafka environment.
- Access Control Lists (ACLs) can be used to restrict access at the topic level.
-> AWS MSK:
- IAM policies define permissions for accessing and managing AWS MSK resources.
- Fine-grained access control can be enforced using IAM roles and policies.
-> On-Premise Kafka:
- ACLs are typically used to specify which users or applications have access to specific Kafka topics or resources.
- Access controls depend on the Kafka distribution and configuration.
3. Encryption:

-> Confluent Kafka:
- Supports encryption in transit using SSL/TLS for secure communication between clients and brokers.
- Optionally supports encryption at rest for data stored on disks.
-> AWS MSK:
- Encrypts data in transit using SSL/TLS for communication between clients and brokers.
- Provides options for encrypting data at rest using AWS Key Management Service (KMS).
-> On-Premise Kafka:
- Encryption configurations depend on the Kafka distribution and may involve configuring SSL/TLS for secure communication and encryption at rest using platform-specific tools.
Here are some key points to evaluate AWS MSK, Confluent Kafka, and on-premise Kafka, along with their advantages and disadvantages.

Conclusion
- The best approach is to evaluate the level of Kafka expertise in-house and the consistency of your workload.
- On one side of the spectrum, with less expertise and uncertain workloads, Confluent’s breadth of features and support comes out ahead.
- On the other side, with experienced Kafka engineers and a very predictable workload, you can save money with MSK.
- For specific requirements and full control, you can go with on-premise Kafka.
December 27, 2023

Unlocking the Potential of Knowledge Graphs: Exploring Graph Databases

There is a growing demand for data-driven insights to help businesses make better decisions and stay competitive. To meet this need, organizations are turning to knowledge graphs as a way to access and analyze complex data sets. In this blog post, I will discuss what knowledge graphs are, what graph databases are, how they differ from hierarchical databases, the benefits of graphical representation of data, and more. Lastly, we’ll discuss some of the challenges of graph databases and how they can be overcome.

What Is a Knowledge Graph?

A knowledge graph is a visual representation of data or knowledge. In order to make the relationships between various types of facts and data easy to see and understand, facts and data are organized into a graph structure. A knowledge graph typically consists of nodes, which stand in for entities like people or objects, and edges, which stand in for the relationships among these entities.

Each node in a knowledge graph has characteristics and attributes that describe it. For instance, the node of a person might contain properties like name, age, and occupation. Edges between nodes reveal information about their connections. This makes knowledge graphs a powerful tool for representing and understanding data.

Benefits of a Knowledge Graph

There are a number of benefits to using knowledge graphs.

Knowledge graphs(KG) provide a visual representation of data that can be easily understood. This makes it easier to quickly identify patterns, and correlations.
Additionally, knowledge graphs make it simple to locate linkage data by allowing us to quickly access a particular node and obtain all of its child information.
These graphs are highly scalable, meaning they can support huge volumes of data. This makes them ideal for applications such as artificial intelligence (AI) and machine learning (ML).
Finally, knowledge graphs can be used to connect various types of data, including text, images, and videos, in addition to plain text. This makes them a great tool for data mining and analysis.‍

‍What are Graph Databases?

Graph databases are used to store and manage data in the form of a graph. Unlike traditional databases, they offer a more flexible representation of data using nodes, edges, and properties. Graph databases are designed to support queries that require traversing relationships between different types of data.

Graph databases are well-suited for applications that require complex data relationships, such as AI and ML. They are also more efficient than traditional databases in queries that involve intricate data relationships, as they can quickly process data without having to make multiple queries.

Comparing Graph Databases to Hierarchical Databases

It is important to understand the differences between graph databases and hierarchical databases. But first, what is a hierarchical database? Hierarchical databases are structured in a tree-like form, with each record in the database linked to one or more other records. This structure makes hierarchical databases ideal for storing data that is organized in a hierarchical manner, such as an organizational chart. However, hierarchical databases are less efficient at handling complex data relationships. To understand with an example, suppose we have an organization with a CEO at the top, followed by several vice presidents, who are in turn responsible for several managers, who are responsible for teams of employees.

In a hierarchical database, this structure would be represented as a tree, with the CEO at the root, and each level of the organization represented by a different level of the tree. For example:

In a graph database, this same structure would be represented as a graph, with each node representing an entity (e.g., a person), and each edge representing a relationship (e.g., reporting to). For example:

(Vice President A) — reports_to –> (CEO)

(Vice President B) — reports_to –> (CEO)

(Vice President A) — manages –> (Manager A1)

(Vice President B) — manages –> (Manager B1)

(Manager A1) — manages –> (Employee A1.1)

(Manager B1) — manages –> (Employee B1.1)

(Manager B1) — manages –> (Employee B1.2)‍

As you can see, in a graph database, the relationships between entities are explicit and can be easily queried and traversed. In a hierarchical database, the relationships are implicit and can be more difficult to work with if the hierarchy becomes more complex. Hence the reason graph databases are better suited for complex data relationships is that it gives them the flexibility to easily store and query data.

Creating a Knowledge Graph from Scratch

We will now understand how to create a knowledge graph using an example below where we’ll use a simple XML file that contains information about some movies, and we’ll use an XSLT stylesheet to transform the XML data into RDF format along with some python libraries to help us in the overall process.

Let’s consider an XML file having movie information:

<movies>
  <movie id="tt0083658">
    <title>Blade Runner</title>
    <year>1982</year>
    <director rid="12341">Ridley Scott</director>
    <genre>Action</genre>
  </movie>
  <movie id="tt0087469">
    <title>Top Gun</title>
    <year>1986</year>
    <director rid="65217">Tony Hank</director>
    <genre>Thriller</genre>
  </movie>
</movies>

<movies>
  <movie id="tt0083658">
    <title>Blade Runner</title>
    <year>1982</year>
    <director rid="12341">Ridley Scott</director>
    <genre>Action</genre>
  </movie>

  <movie id="tt0087469">
    <title>Top Gun</title>
    <year>1986</year>
    <director rid="65217">Tony Hank</director>
    <genre>Thriller</genre>
  </movie>
</movies>

As discussed, to convert this data into a knowledge graph, we will be using an XSL file, now a question may arise that what is an XSL file? Well, XSL files are stylesheet documents that are used to transform XML data. To explore more on XSL, visit here, but don’t worry as we will be starting from scratch.

Moving ahead, we also need to know that to convert any data into graph data, we need to use an ontology; there are many ontologies available, like OWL ontology or EBUCore ontology. But what is an ontology? Well, in the context of knowledge graphs, an ontology is a formal specification of the relationships and constraints that exist within a specific domain or knowledge domain. It provides a vocabulary and a set of rules for representing and sharing knowledge, allowing machines to reason about the data they are working with. EBUCore is an ontology developed by the European Broadcasting Union (EBU) to provide a standardized metadata model for the broadcasting industry (OTT platforms, media companies, etc.). Further references on EBUCore can be found here.

We will be using the below XSL for transforming the above XML with movie info.

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
                xmlns_xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns_rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                xmlns_ebucore="http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#"
                >
    <xsl:template match="movies">
        <rdf:RDF>
            <xsl:apply-templates select="movie"/>
        </rdf:RDF>
    </xsl:template>
    <xsl:template match="movie">
        <ebucore:Feature>
            <xsl:template match="title">
                <ebucore:title>
                    <xsl:value-of select="."/>
                </ebucore:title>
            </xsl:template>
            <xsl:template match="year">
                <ebucore:dateBroadcast>
                    <xsl:value-of select="."/>
                </ebucore:dateBroadcast>
            </xsl:template>
            <xsl:template match="director">
                <ebucore:hasParticipatingAgent>
                    <ebucore:Agent>                                        
                        <ebucore:hasRole>Director</ebucore:hasRole>
                        <ebucore:agentName>
                            <xsl:value-of select="."/>
                        </ebucore:agentName>
                    </ebucore:Agent>
                </ebucore:hasParticipatingAgent>
            </xsl:template>
            <xsl:template match="genre">
                <ebucore:hasGenre>
                    <xsl:value-of select="."/>
                </ebucore:hasGenre>
            </xsl:template>
        </ebucore:Feature>
    </xsl:template>
</xsl:stylesheet>

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                xmlns:ebucore="http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#"
                >
    <xsl:template match="movies">
        <rdf:RDF>
            <xsl:apply-templates select="movie"/>
        </rdf:RDF>
    </xsl:template>

    <xsl:template match="movie">
        <ebucore:Feature>

            <xsl:template match="title">
                <ebucore:title>
                    <xsl:value-of select="."/>
                </ebucore:title>
            </xsl:template>

            <xsl:template match="year">
                <ebucore:dateBroadcast>
                    <xsl:value-of select="."/>
                </ebucore:dateBroadcast>
            </xsl:template>

            <xsl:template match="director">
                <ebucore:hasParticipatingAgent>
                    <ebucore:Agent>                                        
                        <ebucore:hasRole>Director</ebucore:hasRole>
                        <ebucore:agentName>
                            <xsl:value-of select="."/>
                        </ebucore:agentName>
                    </ebucore:Agent>
                </ebucore:hasParticipatingAgent>
            </xsl:template>

            <xsl:template match="genre">
                <ebucore:hasGenre>
                    <xsl:value-of select="."/>
                </ebucore:hasGenre>
            </xsl:template>

        </ebucore:Feature>
    </xsl:template>
</xsl:stylesheet>

To start with XSL, the first line, “<?xml version=”1.0″?>” defines the version of document. The second line opens a stylesheet defining the XSL version we will be using and further having XSL, RDF, EBUCore as their namespaces. These namespaces are required as we will be using elements of those classes to avoid name conflicts in our XML document. The xsl:template match defines which element to match in the XML, as we want to match from the start of the XML. Since movies are the root element of our XML, we will be using xsl:template match=”movies”.

After that, we open an RDF tag to start our knowledge graph, this element will contain all the movie details, and hence we are using xsl:apply-templates on “movie” as in our XML we have multiple <movie> elements nested inside <movies> tag. To get further details from <movie> elements, we define a template matching all movie elements, which will help us to fetch all the required details. The tag <ebucore:Feature> defines that all of our contents belong to a feature which is an alternate name for “movie” in EBUCore ontology. We then match details like title, year, genre, etc., from XML and define their corresponding value from EBUCore, like ebucore:title, ebucore:dateBroadcast, and ebucore:hasGenre respectively.

Now that we have the XSL ready, we will need to apply this XSL on our XML and get RDF data out of it by following the below Python code:

import lxml.etree as ET
import xml.dom.minidom as xm
movie_data = """
            <movies>
                <movie id="tt0083658">
                    <title>Blade Runner</title>
                    <year>1982</year>
                    <director rid="12341">Ridley Scott</director>
                    <genre>Action</genre>
                </movie>
                    <movie id="tt0087469">
                    <title>Top Gun</title>
                    <year>1986</year>
                    <director rid="65217">Tony Hank</director>
                    <genre>Thriller</genre>
                </movie>
            </movies>
            """
xslt_file = "transform_movies.xsl"
xslt_root = ET.parse(xslt_file)
transform = ET.XSLT(xslt_root)
rss_root = ET.fromstring(movie_data)
result_tree = transform(rss_root)
result_string = ET.tostring(result_tree)
# Converting bytes to string and pretty formatting
dom_transformed = xm.parseString(result_string)
pretty_xml_as_string = dom_transformed.toprettyxml()
# Saving the output
with open('Path_to_downloads/output_movie.xml', "w") as f:
    f.write(pretty_xml_as_string)
# Print Output
print(pretty_xml_as_string)

import lxml.etree as ET
import xml.dom.minidom as xm

movie_data = """
            <movies>
                <movie id="tt0083658">
                    <title>Blade Runner</title>
                    <year>1982</year>
                    <director rid="12341">Ridley Scott</director>
                    <genre>Action</genre>
                </movie>
                    <movie id="tt0087469">
                    <title>Top Gun</title>
                    <year>1986</year>
                    <director rid="65217">Tony Hank</director>
                    <genre>Thriller</genre>
                </movie>
            </movies>
            """

xslt_file = "transform_movies.xsl"
xslt_root = ET.parse(xslt_file)
transform = ET.XSLT(xslt_root)
rss_root = ET.fromstring(movie_data)
result_tree = transform(rss_root)
result_string = ET.tostring(result_tree)

# Converting bytes to string and pretty formatting
dom_transformed = xm.parseString(result_string)
pretty_xml_as_string = dom_transformed.toprettyxml()

# Saving the output
with open('Path_to_downloads/output_movie.xml', "w") as f:
    f.write(pretty_xml_as_string)
# Print Output
print(pretty_xml_as_string)

The above code will generate the following output:

<?xml version="1.0" ?>
<rdf:RDF xmlns_rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns_ebucore="http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#">
<ebucore:Feature>
   <ebucore:title>Blade Runner</ebucore:title>
   <ebucore:dateBroadcast>1982</ebucore:dateBroadcast>
   <ebucore:hasParticipatingAgent>
       <ebucore:Agent>
           <ebucore:hasRole>Director</ebucore:hasRole>
           <ebucore:agentName>Ridley Scott</ebucore:agentName>
       </ebucore:Agent>
   </ebucore:hasParticipatingAgent>
   <ebucore:hasGenre>Action</ebucore:hasGenre>
</ebucore:Feature>
<ebucore:Feature>
   <ebucore:title>Top Gun</ebucore:title>
   <ebucore:dateBroadcast>1986</ebucore:dateBroadcast>
   <ebucore:hasParticipatingAgent>
       <ebucore:Agent>
           <ebucore:hasRole>Director</ebucore:hasRole>
           <ebucore:agentName>Tony Hank</ebucore:agentName>
       </ebucore:Agent>
   </ebucore:hasParticipatingAgent>
   <ebucore:hasGenre>Thriller</ebucore:hasGenre>
</ebucore:Feature>
</rdf:RDF>

<?xml version="1.0" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:ebucore="http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#">

<ebucore:Feature>
   <ebucore:title>Blade Runner</ebucore:title>
   <ebucore:dateBroadcast>1982</ebucore:dateBroadcast>
   <ebucore:hasParticipatingAgent>
       <ebucore:Agent>
           <ebucore:hasRole>Director</ebucore:hasRole>
           <ebucore:agentName>Ridley Scott</ebucore:agentName>
       </ebucore:Agent>
   </ebucore:hasParticipatingAgent>
   <ebucore:hasGenre>Action</ebucore:hasGenre>
</ebucore:Feature>

<ebucore:Feature>
   <ebucore:title>Top Gun</ebucore:title>
   <ebucore:dateBroadcast>1986</ebucore:dateBroadcast>
   <ebucore:hasParticipatingAgent>
       <ebucore:Agent>
           <ebucore:hasRole>Director</ebucore:hasRole>
           <ebucore:agentName>Tony Hank</ebucore:agentName>
       </ebucore:Agent>
   </ebucore:hasParticipatingAgent>
   <ebucore:hasGenre>Thriller</ebucore:hasGenre>
</ebucore:Feature>

</rdf:RDF>

This output is an RDF XML, which will now be converted to a Graph and we will also visualize it using the following code:

Note: Install the following library before proceeding.

pip install graphviz rdflib

pip install graphviz rdflib

from rdflib import Graph
from rdflib import Graph, Namespace
from rdflib.tools.rdf2dot import rdf2dot
from graphviz import render
# Creating empty graph object
graph = Graph()
graph.parse(result_string, format="xml")
# Saving graph/ttl(Terse RDF Triple Language) in movies.ttl file
graph.serialize(destination=f"Downloads/movies.ttl", format="ttl")
# Steps to Visualize the generated graph
# Define a namespace for the RDF data
ns = Namespace("http://example.com/movies#")
graph.bind("ex", ns)
# Serialize the RDF data to a DOT file
dot_file = open("Downloads/movies.dot", "w")
rdf2dot(graph, dot_file, opts={"label": "Movies Graph", "rankdir": "LR"})
dot_file.close()
# Render the DOT file to a PNG image
render("dot", "png", "Downloads/movies.dot")

from rdflib import Graph
from rdflib import Graph, Namespace
from rdflib.tools.rdf2dot import rdf2dot
from graphviz import render

# Creating empty graph object
graph = Graph()
graph.parse(result_string, format="xml")
# Saving graph/ttl(Terse RDF Triple Language) in movies.ttl file
graph.serialize(destination=f"Downloads/movies.ttl", format="ttl")

# Steps to Visualize the generated graph
# Define a namespace for the RDF data
ns = Namespace("http://example.com/movies#")
graph.bind("ex", ns)

# Serialize the RDF data to a DOT file
dot_file = open("Downloads/movies.dot", "w")
rdf2dot(graph, dot_file, opts={"label": "Movies Graph", "rankdir": "LR"})
dot_file.close()

# Render the DOT file to a PNG image

render("dot", "png", "Downloads/movies.dot")

Finally, the above code will yield a movies.dot.png file in the Downloads folder location, which will look something like this:

This clearly represents the relationship between edges and nodes along with all information in a well-formatted way.

Examples of Knowledge Graphs

Now that we have knowledge of how we can create a knowledge graph, let’s explore the big players that are using such graphs for their operations.

Google Knowledge Graph: This is one of the most well-known examples of a knowledge graph. It is used by Google to enhance its search results with additional information about entities, such as people, places, and things. For example, if you search for “Barack Obama,” the Knowledge Graph will display a panel with information about his birthdate, family members, education, career, and more. All this information is stored in the form of nodes and edges making it easier for the Google search engine to retrieve related information of any topic.

DBpedia: This is a community-driven project that extracts structured data from Wikipedia and makes it available as a linked data resource. It is primarily used for graph analysis and executing SPARQL queries. It contains information on millions of entities, such as people, places, and things, and their relationships with one another. DBpedia can be used to power applications like question-answering systems, recommendation engines, and more. One of the key advantages of DBpedia is that it is an open and community-driven project, which means that anyone can contribute to it and use it for their own applications. This has led to a wide variety of applications built on top of DBpedia, from academic research to commercial products.

As we have discussed the examples of knowledge graphs, one should know that they all use SPARQL queries to retrieve data from their huge corpus of graphs. So, let’s write one such query to retrieve data from the knowledge graph created by us for movie data. We will be writing a query to retrieve all movies’ Genre information along with Movie Titles.

from rdflib.plugins.sparql import prepareQuery
# Define the SPARQL query
query = prepareQuery('''
    PREFIX ebucore: <http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#>
    SELECT ?genre ?title
    WHERE {
      ?movie a ebucore:Feature ;
             ebucore:hasGenre ?genre ;
             ebucore:title ?title .
    }
''', initNs={"ebucore": ns})
# Execute the query and print the results
results = graph.query(query)
for row in results:
    genre, title = row
    print(f"Movie Genre: {genre}, Movie Title: {title}")

from rdflib.plugins.sparql import prepareQuery

# Define the SPARQL query
query = prepareQuery('''
    PREFIX ebucore: <http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#>
    SELECT ?genre ?title
    WHERE {
      ?movie a ebucore:Feature ;
             ebucore:hasGenre ?genre ;
             ebucore:title ?title .
    }
''', initNs={"ebucore": ns})


# Execute the query and print the results
results = graph.query(query)
for row in results:
    genre, title = row
    print(f"Movie Genre: {genre}, Movie Title: {title}")

Challenges with Graph Databases:

Data Complexity: One of the primary challenges with graph databases and knowledge graphs is data complexity. As the size and complexity of the data increase, it can become challenging to manage and query the data efficiently.

Data Integration: Graph databases and knowledge graphs often need to integrate data from different sources, which can be challenging due to differences in data format, schema, and structure.

Query Performance: Knowledge graphs are often used for complex queries, which can be slow to execute, especially for large datasets.

Knowledge Representation: Representing knowledge in a graph database or knowledge graph can be challenging due to the diversity of concepts and relationships that need to be modeled accurately. One should have experience with ontologies, relationships, and business use cases to curate a perfect representation

Bonus: How to Overcome These Challenges:

Use efficient indexing and query optimization techniques to handle data complexity and improve query performance.
Use data integration tools and techniques to standardize data formats and structures to improve data integration.
Use distributed computing and partitioning techniques to scale the database horizontally.
Use caching and precomputing techniques to speed up queries.
Use ontology modeling and semantic reasoning techniques to accurately represent knowledge and relationships in the graph database or knowledge graph.

Conclusion

In conclusion, graph databases, and knowledge graphs are powerful tools that offer several advantages over traditional relational databases. They enable flexible modeling of complex data and relationships, which can be difficult to achieve using a traditional tabular structure. Moreover, they enhance query performance for complex queries and enable new use cases such as recommendation engines, fraud detection, and knowledge management.

Despite the aforementioned challenges, graph databases and knowledge graphs are gaining popularity in various industries, ranging from finance to healthcare, and are expected to continue playing a significant role in the future of data management and analysis.

December 6, 2023

Vector Search: The New Frontier in Personalized Recommendations

Introduction

Imagine you are a modern-day treasure hunter, not in search of hidden gold, but rather the wealth of knowledge and entertainment hidden within the vast digital ocean of content. In this realm, where every conceivable topic has its own sea of content, discovering what will truly captivate you is like finding a needle in an expansive haystack.

This challenge leads us to the marvels of recommendation services, acting as your compass in this digital expanse. These services are the unsung heroes behind the scenes of your favorite platforms, from e-commerce sites that suggest enticing products to streaming services that understand your movie preferences better than you might yourself. They sift through immense datasets of user interactions and content features, striving to tailor your online experience to be more personalized, engaging, and enriching.

But what if I told you that there is a cutting-edge technology that can take personalized recommendations to the next level? Today, I will take you through a journey to build a blog recommendation service that understands the contextual similarities between different pieces of content, transcending beyond basic keyword matching. We’ll harness the power of vector search, a technology that’s revolutionizing personalized recommendations. We’ll explore how recommendation services are traditionally implemented, and then briefly discuss how vector search enhances them.

Finally, we’ll put this knowledge to work, using OpenAI’s embedding API and Elasticsearch to create a recommendation service that not only finds content but also understands and aligns it with your unique interests.

Exploring the Landscape: Traditional Recommendation Systems and Their Limits

Traditionally, these digital compasses, or recommendation systems, employ methods like collaborative and content-based filtering. Imagine sitting in a café where the barista suggests a coffee based on what others with similar tastes enjoyed (collaborative filtering) or based on your past coffee choices (content-based filtering). While these methods have been effective in many scenarios, they come with some limitations. They often stumble when faced with the vast and unstructured wilderness of web data, struggling to make sense of the diverse and ever-expanding content landscape. Additionally, when user preferences are ambiguous or when you want to recommend content by truly understanding it on a semantic level, traditional methods may fall short.

Enhancing Recommendation with Vector Search and Vector Databases

Our journey now takes an exciting turn with vector search and vector databases, the modern tools that help us navigate this unstructured data. These technologies transform our café into a futuristic spot where your coffee preference is understood on a deeper, more nuanced level.

Vector Search: The Art of Finding Similarities

Vector search operates like a seasoned traveler who understands the essence of every place visited. Text, images, or sounds can be transformed into numerical vectors, like unique coordinates on a map. The magic happens when these vectors are compared, revealing hidden similarities and connections, much like discovering that two seemingly different cities share a similar vibe.

Vector Databases: Navigating Complex Data Landscapes

Imagine a vast library of books where each book captures different aspects of a place along with its coordinates. Vector databases are akin to this library, designed to store and navigate these complex data points. They easily handle intricate queries over large datasets, making them perfect for our recommendation service, ensuring no blog worth reading remains undiscovered.

Embeddings: Semantic Representation

In our journey, embeddings are akin to a skilled artist who captures not just the visuals but the soul of a landscape. They map items like words or entire documents into real-number vectors, encapsulating their deeper meaning. This helps in understanding and comparing different pieces of content on a semantic level, letting the recommendation service show you things that really match your interests.

Sample Project: Blog Recommendation Service

Project Overview

Now, let’s craft a simple blog recommendation service using OpenAI’s embedding APIs and Elasticsearch as a vector database. The goal is to recommend blogs similar to the current one the user is reading, which can be shown in the read more or recommendation section.

Our blogs service will be responsible for indexing the blogs, finding similar one, and interacting with the UI Service.

Tools and Setup

We will need the following tools to build our service:

OpenAI Account: We will be using OpenAI’s embedding API to generate the embeddings for our blog content. You will need an OpenAI account to use the APIs. Once you have created an account, please create an API key and store it in a secure location.
Elasticsearch: A popular database renowned for its full-text search capabilities, which can also be used as a vector database, adept at storing and querying complex embeddings with its dense_vector field type.
Docker: A tool that allows developers to package their applications and all the necessary dependencies into containers, ensuring that the application runs smoothly and consistently across different computing environments.
Python: A versatile programming language for developers across diverse fields, from web development to data science.

The APIs will be created using the FastAPI framework, but you can choose any framework.

Steps

First, we’ll create a BlogItem class to represent each blog. It has only three fields, which will be enough for this demonstration, but real-world entities would have more details to accommodate a wider range of properties and functionalities.

class BlogItem:
    blog_id: int
    title: str
    content: str
    def __init__(self, blog_id: int, title: str, content: str):
        self.blog_id = blog_id
        self.title = title
        self.content = content

class BlogItem:
    blog_id: int
    title: str
    content: str

    def __init__(self, blog_id: int, title: str, content: str):
        self.blog_id = blog_id
        self.title = title
        self.content = content

Elasticsearch Setup:

To store the blog data along with its embedding in Elasticsearch, we need to set up a local Elasticsearch cluster and then create an index for our blogs. You can also use a cloud-based version if you have already procured one for personal use.
Install Docker or Docker Desktop on your machine and create Elasticsearch and Kibana docker containers using the below docker compose file. Run the following command to create and start the services in the background:
docker compose -f /path/to/your/docker-compose/file up -d.
You can exclude the file path if you are in the same directory as your docker-compose.yml file. The advantage of using docker compose is that it allows you to clean up these resources with just one command.
docker compose -f /path/to/your/docker-compose/file down.‍

version: '3'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:<version>
    container_name: elasticsearch
    environment:
      - node.name=docker-cluster
      - discovery.type=single-node
      - cluster.routing.allocation.disk.threshold_enabled=false
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - xpack.security.enabled=true
      - ELASTIC_PASSWORD=YourElasticPassword
      - "ELASTICSEARCH_USERNAME=elastic"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - esdata:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    networks:
      - esnet
  kibana:
    image: docker.elastic.co/kibana/kibana:<version>
    container_name: kibana
    environment:
      ELASTICSEARCH_HOSTS: http://elasticsearch:9200
      ELASTICSEARCH_USERNAME: elastic
      ELASTICSEARCH_PASSWORD: YourElasticPassword
    ports:
      - "5601:5601"
    networks:
      - esnet
    depends_on:
      - elasticsearch
networks:
  esnet:
    driver: bridge
volumes:
  esdata:
    driver: local

version: '3'

services:

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:<version>
    container_name: elasticsearch
    environment:
      - node.name=docker-cluster
      - discovery.type=single-node
      - cluster.routing.allocation.disk.threshold_enabled=false
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - xpack.security.enabled=true
      - ELASTIC_PASSWORD=YourElasticPassword
      - "ELASTICSEARCH_USERNAME=elastic"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - esdata:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    networks:
      - esnet

  kibana:
    image: docker.elastic.co/kibana/kibana:<version>
    container_name: kibana
    environment:
      ELASTICSEARCH_HOSTS: http://elasticsearch:9200
      ELASTICSEARCH_USERNAME: elastic
      ELASTICSEARCH_PASSWORD: YourElasticPassword
    ports:
      - "5601:5601"
    networks:
      - esnet
    depends_on:
      - elasticsearch

networks:
  esnet:
    driver: bridge

volumes:
  esdata:
    driver: local

Connect to the local ES instance and create an index. Our “blogs” index will have a unique blog ID, blog title, blog content, and an embedding field to store the vector representation of blog content. The text-embedding-ada-002 model we have used here produces vectors with 1536 dimensions; hence, it’s important to use the same in our embeddings field in the blogs index.

import logging
from elasticsearch import Elasticsearch
# Setting up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] %(filename)s:%(lineno)d - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S')
# Elasticsearch client setup
es = Elasticsearch(hosts="http://localhost:9200",
                   basic_auth=("elastic", "YourElasticPassword"))
# Create an index with a mappings for embeddings
def create_index(index_name: str, embedding_dimensions: int):
    try:
        es.indices.create(
            index=index_name,
            body={
                'mappings': {
                    'properties': {
                        'blog_id': {'type': 'long'},
                        'title': {'type': 'text'},
                        'content': {'type': 'text'},
                        'embedding': {'type': 'dense_vector', 'dims': embedding_dimensions}
                    }
                }
            }
        )
    except Exception as e:
        logging.error(f"An error occurred while creating index {index_name} : {e}")
# Sample usage
create_index("blogs", 1536)

import logging
from elasticsearch import Elasticsearch

# Setting up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] %(filename)s:%(lineno)d - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S')
# Elasticsearch client setup
es = Elasticsearch(hosts="http://localhost:9200",
                   basic_auth=("elastic", "YourElasticPassword"))


# Create an index with a mappings for embeddings
def create_index(index_name: str, embedding_dimensions: int):
    try:
        es.indices.create(
            index=index_name,
            body={
                'mappings': {
                    'properties': {
                        'blog_id': {'type': 'long'},
                        'title': {'type': 'text'},
                        'content': {'type': 'text'},
                        'embedding': {'type': 'dense_vector', 'dims': embedding_dimensions}
                    }
                }
            }
        )
    except Exception as e:
        logging.error(f"An error occurred while creating index {index_name} : {e}")


# Sample usage
create_index("blogs", 1536)

Create Embeddings AND Index Blogs:

We use OpenAI’s Embedding API to get a vector representation of our blog title and content. I am using the 002 model here, which is recommended by Open AI for most use cases. The input to the text-embedding-ada-002 should not exceed 8291 tokens (1000 tokens are roughly equal to 750 words) and cannot be empty.

from openai import OpenAI, OpenAIError
api_key = ‘YourApiKey’
client = OpenAI(api_key=api_key)
def create_embeddings(text: str, model: str = "text-embedding-ada-002") -> list[float]:
    try:
        text = text.replace("n", " ")
        response = client.embeddings.create(input=[text], model=model)
        logging.info(f"Embedding created successfully")
        return response.data[0].embedding
    except OpenAIError as e:
        logging.error(f"An OpenAI error occurred while creating embedding : {e}")
        raise
    except Exception as e:
        logging.exception(f"An unexpected error occurred while creating embedding  : {e}")
        raise

from openai import OpenAI, OpenAIError

api_key = ‘YourApiKey’
client = OpenAI(api_key=api_key)

def create_embeddings(text: str, model: str = "text-embedding-ada-002") -> list[float]:
    try:
        text = text.replace("n", " ")
        response = client.embeddings.create(input=[text], model=model)
        logging.info(f"Embedding created successfully")
        return response.data[0].embedding
    except OpenAIError as e:
        logging.error(f"An OpenAI error occurred while creating embedding : {e}")
        raise
    except Exception as e:
        logging.exception(f"An unexpected error occurred while creating embedding  : {e}")
        raise

When the blogs get created or the content of the blog gets updated, we will call the create_embeddings function to get text embedding and store it in our blogs index.

# Define the index name as a global constant
ELASTICSEARCH_INDEX = 'blogs'
def index_blog(blog_item: BlogItem):
    try:
        es.index(index=ELASTICSEARCH_INDEX, body={
            'blog_id': blog_item.blog_id,
            'title': blog_item.title,
            'content': blog_item.content,
            'embedding': create_embeddings(blog_item.title + "n" + blog_item.content)
        })
    except Exception as e:
        logging.error(f"Failed to index blog with blog id {blog_item.blog_id} : {e}")
        raise

# Define the index name as a global constant

ELASTICSEARCH_INDEX = 'blogs'

def index_blog(blog_item: BlogItem):
    try:
        es.index(index=ELASTICSEARCH_INDEX, body={
            'blog_id': blog_item.blog_id,
            'title': blog_item.title,
            'content': blog_item.content,
            'embedding': create_embeddings(blog_item.title + "n" + blog_item.content)
        })
    except Exception as e:
        logging.error(f"Failed to index blog with blog id {blog_item.blog_id} : {e}")
        raise

Create a Pydantic model for the request body:

class BlogItemRequest(BaseModel):
    title: str
    content: str

class BlogItemRequest(BaseModel):
    title: str
    content: str

Create an API to save blogs to Elasticsearch. The UI Service would call this API when a new blog post gets created.

@app.post("/blogs/")
def save_blog(response: Response, blog_item: BlogItemRequest) -> dict[str, str]:
    # Create a BlogItem instance from the request data
    try:
        blog_id = get_blog_id()
        blog_item_obj = BlogItem(
            blog_id=blog_id,
            title=blog_item.title,
            content=blog_item.content,
        )
        # Call the index_blog method to index the blog
        index_blog(blog_item_obj)
        return {"message": "Blog indexed successfully", "blog_id": str(blog_id)}
    except Exception as e:
        logging.error(f"An error occurred while indexing blog with blog_id {blog_id} : {e}")
        response.status_code = status.HTTP_500_INTERNAL_SERVER_ERROR
        return {"error": "Failed to index blog"}

@app.post("/blogs/")
def save_blog(response: Response, blog_item: BlogItemRequest) -> dict[str, str]:
    # Create a BlogItem instance from the request data
    try:
        blog_id = get_blog_id()
        blog_item_obj = BlogItem(
            blog_id=blog_id,
            title=blog_item.title,
            content=blog_item.content,
        )

        # Call the index_blog method to index the blog
        index_blog(blog_item_obj)
        return {"message": "Blog indexed successfully", "blog_id": str(blog_id)}
    except Exception as e:
        logging.error(f"An error occurred while indexing blog with blog_id {blog_id} : {e}")
        response.status_code = status.HTTP_500_INTERNAL_SERVER_ERROR
        return {"error": "Failed to index blog"}

Finding Relevant Blogs:

To find blogs that are similar to the current one, we will compare the current blog’s vector representation with other blogs present in the Elasticsearch index using the cosine similarity function.
Cosine similarity is a mathematical measure used to determine the cosine of the angle between two vectors in a multi-dimensional space, often employed to assess the similarity between two documents or data points.
The cosine similarity score ranges from -1 to 1. As the cosine similarity score increases from -1 to 1, it indicates an increasing degree of similarity between the vectors. Higher values represent greater similarity.
Create a custom exception to handle a scenario when a blog for a given ID is not present in Elasticsearch.

class BlogNotFoundException(Exception):
    def __init__(self, message="Blog not found"):
        self.message = message
        super().__init__(self.message)

class BlogNotFoundException(Exception):
    def __init__(self, message="Blog not found"):
        self.message = message
        super().__init__(self.message)

First, we will check if the current blog is present in the blogs index and get its embedding. This is done to prevent unnecessary calls to Open AI APIs as it consumes tokens. Then, we would construct an Elasticsearch dsl query to find the nearest neighbors and return their blog content.

def get_blog_embedding(blog_id: int) -> Optional[Dict]:
    try:
        response = es.search(index=ELASTICSEARCH_INDEX, body={
            'query': {
                'term': {
                    'blog_id': blog_id
                }
            },
            '_source': ['title', 'content', 'embedding']  # Fetch title, content and embedding
        })
        if response['hits']['hits']:
            logging.info(f"Blog found with blog_id {blog_id}")
            return response['hits']['hits'][0]['_source']
        else:
            logging.info(f"No blog found with blog_id {blog_id}")
            return None
    except Exception as e:
        logging.error(f"Error occurred while searching for blog with blog_id {blog_id}: {e}")
        raise
def find_similar_blog(current_blog_id: int, num_neighbors=2) -> list[dict[str, str]]:
    try:
        blog_data = get_blog_embedding(current_blog_id)
        if not blog_data:
            raise BlogNotFoundException(f"Blog not found for id:{current_blog_id}")
        blog_embedding = blog_data['embedding']
        if not blog_embedding:
            blog_embedding = create_embeddings(blog_data['title'] + 'n' + blog_data['content'])
        # Find similar blogs using the embedding
        response = es.search(index=ELASTICSEARCH_INDEX, body={
            'size': num_neighbors + 1,  # Retrieve extra result as we'll exclude the current blog
            '_source': ['title', 'content', 'blog_id', '_score'],
            'query': {
                'bool': {
                    'must': {
                        'script_score': {
                            'query': {'match_all': {}},
                            'script': {
                                'source': "cosineSimilarity(params.query_vector, 'embedding')",
                                'params': {'query_vector': blog_embedding}
                            }
                        }
                    },
                    'must_not': {
                        'term': {
                            'blog_id': current_blog_id  # Exclude the current blog
                        }
                    }
                }
            }
        })
        # Extract and return the hits
        hits = [
            {
                'title': hit['_source']['title'],
                'content': hit['_source']['content'],
                'blog_id': hit['_source']['blog_id'],
                'score': f"{hit['_score'] * 100:.2f}%"
            }
            for hit in response['hits']['hits']
            if hit['_source']['blog_id'] != current_blog_id
        ]
        return hits
    except Exception as e:
        logging.error(f"An error while finding similar blogs: {e}")
        raise

def get_blog_embedding(blog_id: int) -> Optional[Dict]:
    try:
        response = es.search(index=ELASTICSEARCH_INDEX, body={
            'query': {
                'term': {
                    'blog_id': blog_id
                }
            },
            '_source': ['title', 'content', 'embedding']  # Fetch title, content and embedding
        })

        if response['hits']['hits']:
            logging.info(f"Blog found with blog_id {blog_id}")
            return response['hits']['hits'][0]['_source']
        else:
            logging.info(f"No blog found with blog_id {blog_id}")
            return None
    except Exception as e:
        logging.error(f"Error occurred while searching for blog with blog_id {blog_id}: {e}")
        raise


def find_similar_blog(current_blog_id: int, num_neighbors=2) -> list[dict[str, str]]:
    try:
        blog_data = get_blog_embedding(current_blog_id)
        if not blog_data:
            raise BlogNotFoundException(f"Blog not found for id:{current_blog_id}")
        blog_embedding = blog_data['embedding']
        if not blog_embedding:
            blog_embedding = create_embeddings(blog_data['title'] + 'n' + blog_data['content'])
        # Find similar blogs using the embedding
        response = es.search(index=ELASTICSEARCH_INDEX, body={
            'size': num_neighbors + 1,  # Retrieve extra result as we'll exclude the current blog
            '_source': ['title', 'content', 'blog_id', '_score'],
            'query': {
                'bool': {
                    'must': {
                        'script_score': {
                            'query': {'match_all': {}},
                            'script': {
                                'source': "cosineSimilarity(params.query_vector, 'embedding')",
                                'params': {'query_vector': blog_embedding}
                            }
                        }
                    },
                    'must_not': {
                        'term': {
                            'blog_id': current_blog_id  # Exclude the current blog
                        }
                    }
                }
            }
        })

        # Extract and return the hits
        hits = [
            {
                'title': hit['_source']['title'],
                'content': hit['_source']['content'],
                'blog_id': hit['_source']['blog_id'],
                'score': f"{hit['_score'] * 100:.2f}%"
            }
            for hit in response['hits']['hits']
            if hit['_source']['blog_id'] != current_blog_id
        ]

        return hits
    except Exception as e:
        logging.error(f"An error while finding similar blogs: {e}")
        raise

Define a Pydantic model for the response:

class BlogRecommendation(BaseModel):
    blog_id: int
    title: str
    content: str
    score: str

class BlogRecommendation(BaseModel):
    blog_id: int
    title: str
    content: str
    score: str

Create an API that would be used by UI Service to find similar blogs as the current one user is reading:

@app.get("/recommend-blogs/{current_blog_id}")
def recommend_blogs(
        response: Response,
        current_blog_id: int,
        num_neighbors: Optional[int] = 2) -> Union[Dict[str, str], List[BlogRecommendation]]:
    try:
        # Call the find_similar_blog function to get recommended blogs
        recommended_blogs = find_similar_blog(current_blog_id, num_neighbors)
        return recommended_blogs
    except BlogNotFoundException as e:
        response.status_code = status.HTTP_400_BAD_REQUEST
        return {"error": f"Blog not found for id:{current_blog_id}"}
    except Exception as e:
        response.status_code = status.HTTP_500_INTERNAL_SERVER_ERROR
        return {"error": "Unable to process the request"}

@app.get("/recommend-blogs/{current_blog_id}")
def recommend_blogs(
        response: Response,
        current_blog_id: int,
        num_neighbors: Optional[int] = 2) -> Union[Dict[str, str], List[BlogRecommendation]]:
    try:
        # Call the find_similar_blog function to get recommended blogs
        recommended_blogs = find_similar_blog(current_blog_id, num_neighbors)
        return recommended_blogs
    except BlogNotFoundException as e:
        response.status_code = status.HTTP_400_BAD_REQUEST
        return {"error": f"Blog not found for id:{current_blog_id}"}
    except Exception as e:
        response.status_code = status.HTTP_500_INTERNAL_SERVER_ERROR
        return {"error": "Unable to process the request"}

The below flow diagram summarizes all the steps we have discussed so far:

Testing the Recommendation Service

Ideally, we would be receiving the blog ID from the UI Service and passing the recommendations back, but for illustration purposes, we’ll be calling the recommend blogs API with some test inputs from my test dataset. The blogs in this sample dataset have concise titles and content, which are sufficient for testing purposes, but real-world blogs will be much more detailed and have a significant amount of data. The test dataset has around 1000 blogs on various categories like healthcare, tech, travel, entertainment, and so on.
A sample from the test dataset:

Test Result 1: Medical Research Blog
Input Blog: Blog_Id: 1, Title: Breakthrough in Heart Disease Treatment, Content: Researchers have developed a new treatment for heart disease that promises to be more effective and less invasive. This breakthrough could save millions of lives every year.

‍

Test Result 2: Travel Blog
Input Blog: Blog_Id: 4, Title: Travel Tips for Sustainable Tourism, Content: How to travel responsibly and sustainably.

I manually tested multiple blogs from the test dataset of 1,000 blogs, representing distinct topics and content, and assessed the quality and relevance of the recommendations. The recommended blogs had scores in the range of 87% to 95%, and upon examination, the blogs often appeared very similar in content and style.

Based on the test results, it’s evident that utilizing vector search enables us to effectively recommend blogs to users that are semantically similar. This approach ensures that the recommendations are contextually relevant, even when the blogs don’t share identical keywords, enhancing the user’s experience by connecting them with content that aligns more closely with their interests and search intent.

Limitations

This approach for finding similar blogs is good enough for our simple recommendation service, but it might have certain limitations in real-world applications.

Our similarity search returns the nearest k neighbors as recommendations, but there might be scenarios where no similar blog might exist or the neighbors might have significant score differences. To deal with this, you can set a threshold to filter out recommendations below a certain score. Experiment with different threshold values and observe their impact on recommendation quality.
If your use case involves a small dataset and the relationships between user preferences and item features are straightforward and well-defined, traditional methods like content-based or collaborative filtering might be more efficient and effective than vector search.

Further Improvements

Using LLM for Content Validation: Implement a verification step using large language models (LLMs) to assess the relevance and validity of recommended content. This approach can ensure that the suggestions are not only similar in context but also meaningful and appropriate for your audience.
Metadata-based Embeddings: Instead of generating embeddings from the entire blog content, utilize LLMs to extract key metadata such as themes, intent, tone, or key points. Create embeddings based on this extracted metadata, which can lead to more efficient and targeted recommendations, focusing on the core essence of the content rather than its entirety.

Conclusion

Our journey concludes here, but yours is just beginning. Armed with the knowledge of vector search, vector databases, and embeddings, you’re now ready to build a recommendation service that doesn’t just guide users to content but connects them to the stories, insights, and experiences they seek. It’s not just about building a service; it’s about enriching the digital exploration experience, one recommendation at a time.

December 6, 2023

Mage: Your New Go-To Tool for Data Orchestration
In our journey to automate data pipelines, we’ve used tools like Apache Airflow, Dagster, and Prefect to manage complex workflows. However, as data automation continues to change, we’ve added a new tool to our toolkit: Mage AI.

Mage AI isn’t just another tool; it’s a solution to the evolving demands of data automation. This blog aims to explain how Mage AI is changing the way we automate data pipelines by addressing challenges and introducing innovative features. Let’s explore this evolution, understand the problems we face, and see why we’ve adopted Mage AI.

What is Mage AI?

Mage is a user-friendly open-source framework created for transforming and merging data. It’s a valuable tool for developers handling substantial data volumes efficiently. At its heart, Mage relies on “data pipelines,” made up of code blocks. These blocks can run independently or as part of a larger pipeline. Together, these blocks form a structure known as a directed acyclic graph (DAG), which helps manage dependencies. For example, you can use Mage for tasks like loading data, making transformations, or exportation.

Mage Architecture:

Before we delve into Mage’s features, let’s take a look at how it works.

When you use Mage, your request begins its journey in the Mage Server Container, which serves as the central hub for handling requests, processing data, and validation. Here, tasks like data processing and real-time interactions occur. The Scheduler Process ensures tasks are scheduled with precision, while Executor Containers, designed for specific tasks like Python or AWS, carry out the instructions.

Mage’s scalability is impressive, allowing it to handle growing workloads effectively. It can expand both vertically and horizontally to maintain top-notch performance. Mage efficiently manages data, including code, data, and logs, and takes security seriously when handling databases and sensitive information. This well-coordinated system, combined with Mage’s scalability, guarantees reliable data pipelines, blending technical precision with seamless orchestration.

Scaling Mage:

To enhance Mage’s performance and reliability as your workload expands, it’s crucial to scale its architecture effectively. In this concise guide, we’ll concentrate on four key strategies for optimizing Mage’s scalability:
1. Horizontal Scaling: Ensure responsiveness by running multiple Mage Server and Scheduler instances. This approach keeps the system running smoothly, even during peak usage.
2. Multiple Executor Containers: Deploy several Executor Containers to handle concurrent task execution. Customize them for specific executors (e.g., Python, PySpark, or AWS) to scale task processing horizontally as needed.
3. External Load Balancers: Utilize external load balancers to distribute client requests across Mage instances. This not only boosts performance but also ensures high availability by preventing overloading of a single server.
4. Scaling for Larger Datasets: To efficiently handle larger datasets, consider:
a. Allocating more resources to executors, empowering them to tackle complex data transformations.

b. Mage supports direct data warehouse transformation and native Spark integration for massive datasets.

Features:

1) Interactive Coding Experience

Mage offers an interactive coding experience tailored for data preparation. Each block in the editor is a modular file that can be tested, reused, and chained together to create an executable data pipeline. This means you can build your data pipeline piece by piece, ensuring reliability and efficiency.

2) UI/IDE for Building and Managing Data Pipelines

Mage takes data pipeline development to the next level with a user-friendly integrated development environment (IDE). You can build and manage your data pipelines through an intuitive user interface, making the process efficient and accessible to both data scientists and engineers.

3) Multiple Languages Support

Mage supports writing pipelines in multiple languages such as Python, SQL, and R. This language versatility means you can work with the languages you’re most comfortable with, making your data preparation process more efficient.

4) Multiple Types of Pipelines

Mage caters to diverse data pipeline needs. Whether you require standard batch pipelines, data integration pipelines, streaming pipelines, Spark pipelines, or DBT pipelines, Mage has you covered.

5) Built-In Engineering Best Practices

Mage is not just a tool; it’s a promoter of good coding practices. It enables reusable code, data validation in each block, and operationalizes data pipelines with built-in observability, data quality monitoring, and lineage. This ensures that your data pipelines are not only efficient but also maintainable and reliable.

6) Dynamic Blocks

Dynamic blocks in Mage allow the output of a block to dynamically create additional blocks. These blocks are spawned at runtime, with the total number of blocks created being equal to the number of items in the output data of the dynamic block multiplied by the number of its downstream blocks.

7) Triggers
- Schedule Triggers: These triggers allow you to set specific start dates and intervals for pipeline runs. Choose from daily, weekly, or monthly, or even define custom schedules using Cron syntax. Mage’s Schedule Triggers put you in control of when your pipelines execute.
- Event Triggers: With Event Triggers, your pipelines respond instantly to specific events, such as the completion of a database query or the creation of a new object in cloud storage services like Amazon S3 or Google Storage. Real-time automation at your fingertips.
- API Triggers: API Triggers enable your pipelines to run in response to specific API calls. Whether it’s customer requests or external system interactions, these triggers ensure your data workflows stay synchronized with the digital world.
Different types of Block:

Data Loader: Within Mage, Data Loaders are ready-made templates designed to seamlessly link up with a multitude of data sources. These sources span from Postgres, Bigquery, Redshift, and S3 to various others. Additionally, Mage allows for the creation of custom data loaders, enabling connections to APIs. The primary role of Data Loaders is to facilitate the retrieval of data from these designated sources.

Data Transformer: Much like Data Loaders, Data Transformers provide predefined functions such as handling duplicates, managing missing data, and excluding specific columns. Alternatively, you can craft your own data transformations or merge outputs from multiple data loaders to preprocess and sanitize the data before it advances through the pipeline.

Data Exporter: Data Exporters within Mage empower you to dispatch data to a diverse array of destinations, including databases, data lakes, data warehouses, or local storage. You can opt for predefined export templates or craft custom exporters tailored to your precise requirements.

Custom Blocks: Custom blocks in the Mage framework are incredibly flexible and serve various purposes. They can store configuration data and facilitate its transmission across different pipeline stages. Additionally, they prove invaluable for logging purposes, allowing you to categorize and visually distinguish log entries for enhanced organization.

Sensor: A Sensor, a specialized block within Mage, continuously assesses a condition until it’s met or until a specified time duration has passed. When a block depends on a sensor, it remains inactive until the sensor confirms that its condition has been satisfied. Sensors are especially valuable when there’s a need to wait for external dependencies or handle delayed data before proceeding with downstream tasks.

Getting Started with Mage

There are two ways to run mage, either using docker or using pip:
Docker Command

Create a new working directory where all the mage files will be stored.

Then, in that working directory, execute this command:

Windows CMD:

docker run -it -p 6789:6789 -v %cd%:/home/src mageai/mageai /app/run_app.sh mage start [project_name]

Linux CMD:

docker run -it -p 6789:6789 -v $(pwd):/home/src mageai/mageai /app/run_app.sh mage start [project_name]

Using Pip (Working directory):

pip install mage-ai

Mage start [project_name]

You can browse to http://localhost:6789/overview to get to the Mage UI.

Let’s build our first pipelineto fetch CSV files from the API for data loading, do some useful transformations, and export that data to our local database.

Dataset invoices CSV files stored in the current directory of columns:

(1) First Name; (2) Last Name; (3) E-mail; (4) Product ID; (5) Quantity; (6) Amount; (7) Invoice Date; (8) Address; (9) City; (10) Stock Code

Create a new pipeline and select a standard batch (we’ll be implementing a batch pipeline) from the dashboard and give it a unique ID.

Project structure:

├── mage_data

└── [project_name]

    ├── charts

    ├── custom

    ├── data_exporters

    ├── data_loaders

    ├── dbt

    ├── extensions

    ├── pipelines

    │ └── [pipeline_name]

    │ ├── __init__.py

    │ └── metadata.yaml

    ├── scratchpads

    ├── transformers

    ├── utils

    ├── __init__.py

    ├── io_config.yaml

    ├── metadata.yaml

    └── requirements.txt

This pipeline consists of all the block files, including data loader, transformer, charts, and configuration files for our pipeline io_config.yaml and metadata.yaml files. All block files will contain decorators’ inbuilt function where we will be writing our code.

1. We begin by loading a CSV file from our local directory, specifically located at /home/src/invoice.csv. To achieve this, we select the “Local File” option from the Templates dropdown and configure the Data Loader block accordingly. Running this configuration will allow us to confirm if the CSV file loads successfully.

2. In the next step, we introduce a Transformer block using a generic template. On the right side of the user interface, we can observe the directed acyclic graph (DAG) tree. To establish the data flow, we edit the parent of the Transformer block, linking it either directly to the Data Loader block or via the user interface.

The Transformer block operates on the data frame received from the upstream Data Loader block, which is passed as the first argument to the Transformer function.

3. Our final step involves exporting the DataFrame to a locally hosted PostgreSQL database. We incorporate a Data Export block and connect it to the Transformer block.

To establish a connection with the PostgreSQL database, it is imperative to configure the database credentials in the io_config.yaml file. Alternatively, these credentials can be added to environmental variables.

With these steps completed, we have successfully constructed a foundational batch pipeline. This pipeline efficiently loads, transforms, and exports data, serving as a fundamental building block for more advanced data processing tasks.

Mage vs Other tools:

Consistency Across Environments: Some orchestration tools may exhibit inconsistencies between local development and production environments due to varying configurations. Mage tackles this challenge by providing a consistent and reproducible workflow environment through a single configuration file that can be executed uniformly across different environments.

Reusability: Achieving reusability in workflows can be complex in some tools. Mage simplifies this by allowing tasks and workflows to be defined as reusable components within a Magefile, making it effortless to share them across projects and teams.

Data Passing: Efficiently passing data between tasks can be a challenge in certain tools, especially when dealing with large datasets. Mage streamlines data passing through straightforward function arguments and returns, enabling seamless data flow and versatile data handling.

Testing: Some tools lack user-friendly testing utilities, resulting in manual testing and potential coverage gaps. Mage simplifies testing with a robust testing framework that enables the definition of test cases, inputs, and expected outputs directly within the Mage file.

Debugging: Debugging failed tasks can be time-consuming with certain tools. Mage enhances debugging with detailed logs and error messages, offering clear insights into the causes of failures and expediting issue resolution.

Conclusion:

Mage offers a streamlined and user-friendly approach to data pipeline orchestration, addressing common challenges with simplicity and efficiency. Its single-container deployment, visual interface, and robust features make it a valuable tool for data professionals seeking an intuitive and consistent solution for managing data workflows.
November 29, 2023
Unlocking Legal Insights: Effortless Document Summarization with OpenAI’s LLM and LangChain
‍

The Rising Demand for Legal Document Summarization:
- In a world where data, information, and legal complexities is prevalent, the volume of legal documents is growing rapidly. Law firms, legal professionals, and businesses are dealing with an ever-increasing number of legal texts, including contracts, court rulings, statutes, and regulations.
- These documents contain important insights, but understanding them can be overwhelming. This is where the demand for legal document summarization comes in.
- In this blog, we’ll discuss the increasing need for summarizing legal documents and how modern technology is changing the way we analyze legal information, making it more efficient and accessible.
Overview OpenAI and LangChain
- We’ll use the LangChain framework to build our application with LLMs. These models, powered by deep learning, have been extensively trained on large text datasets. They excel in various language tasks like translation, sentiment analysis, chatbots, and more.
- LLMs can understand complex text, identify entities, establish connections, and generate coherent content. We can use meta LLaMA LLMs, OpenAI LLMs and others as well. For this case, we will be using OpenAI’s LLM.
‍
- OpenAI is a leader in the field of artificial intelligence and machine learning. They have developed powerful Large Language Models (LLMs) that are capable of understanding and generating human-like text.
- These models have been trained on vast amounts of textual data and can perform a wide range of natural language processing tasks.
LangChain is an innovative framework designed to simplify and enhance the development of applications and systems that involve natural language processing (NLP) and large language models (LLMs).

It provides a structured and efficient approach for working with LLMs like OpenAI’s GPT-3 and GPT-4 to tackle various NLP tasks. Here’s an overview of LangChain’s key features and capabilities:
- Modular NLP Workflow: Build flexible NLP pipelines using modular blocks.
- Chain-Based Processing: Define processing flows using chain-based structures.
- Easy Integration: Seamlessly integrate LangChain with other tools and libraries.
- Scalability: Scale NLP workflows to handle large datasets and complex tasks.
- Extensive Language Support: Work with multiple languages and models.
- Data Visualization: Visualize NLP pipeline results for better insights.
- Version Control: Track changes and manage NLP workflows efficiently.
- Collaboration: Enable collaborative NLP development and experimentation.
Setting Up Environment

Setting Up Google Colab

Google Colab provides a powerful and convenient platform for running Python code with the added benefit of free GPU support. To get started, follow these steps:
1. Visit Google Colab: Open your web browser and navigate to Google Colab.
2. Sign In or Create a Google Account: You’ll need to sign in with your Google account to use Google Colab. If you don’t have one, you can create an account for free.
3. Create a New Notebook: Once signed in, click on “New Notebook” to create a new Colab notebook.
4. Choose Python Version: In the notebook, click on “Runtime” in the menu and select “Change runtime type.” Choose your preferred Python version (usually Python 3) and set the hardware accelerator to “GPU.” Also, make sure to turn on the “Internet” toggle.
OpenAI API Key Generation:-
1. Visit the OpenAI Website Go to the OpenAI website.
2. Sign In or Create an Account Sign in or create a new OpenAI account.
3. Generate a New API Key Access the API section and generate a new API key.
4. Name Your API Key Give your API key a name that reflects its purpose.
5. Copy the API Key Copy the generated API key to your clipboard.
6. Store the API Key Safely Securely store the API key and do not share it publicly.
Understanding Legal Document Summarization Workflow

1. Map Step:
- At the heart of our legal document summarization process is the Map-Reduce paradigm.
- In the Map step, we treat each legal document individually. Think of it as dissecting a large puzzle into smaller, manageable pieces.
- For each document, we employ a sophisticated Language Model (LLM). This LLM acts as our expert, breaking down complex legal language and extracting meaningful content.
- The LLM generates concise summaries for each document section, essentially translating legalese into understandable insights.
- These individual summaries become our building blocks, our pieces of the puzzle.
2. Reduce Step:
- Now, let’s shift our focus to the Reduce step.
- Here’s where we bring everything together. We’ve generated summaries for all the document sections, and it’s time to assemble them into a cohesive whole.
- Imagine the Reduce step as the puzzle solver. It takes all those individual pieces (summaries) and arranges them to form the big picture.
- The goal is to produce a single, comprehensive summary that encapsulates the essence of the entire legal document.
3. Compression – Ensuring a Smooth Fit:
- One challenge we encounter is the potential length of these individual summaries. Some legal documents can produce quite lengthy summaries.
- To ensure a smooth flow within our summarization process, we’ve introduced a compression step.
4. Recursive Compression:
- In some cases, even the compressed summaries might need further adjustment.
- That’s where the concept of recursive compression comes into play.
- If necessary, we’ll apply compression multiple times, refining and optimizing the summaries until they seamlessly fit into our summarization pipeline.
Let’s Get Started

Step 1: Installing python libraries

Create a new notebook in Google Colab and install the required Python libraries.
```
!pip install openai langchain tiktoken
```
OpenAI: Installed to access OpenAI’s powerful language models for legal document summarization.

LangChain: Essential for implementing document mapping, reduction, and combining workflows efficiently.

Tiktoken: Helps manage token counts within text data, ensuring efficient usage of language models and avoiding token limit issues.

Step 2: Adding OpenAI API key to Colab

Integrate your openapi key in Google Colab Secrets.
```
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
API_KEY= user_secrets.get_secret("YOUR_SECRET_KEY_NAME")
```
Step 3: Initializing OpenAI LLM

Here, we import the OpenAI module from LangChain and initialize it with the provided API key to utilize advanced language models for document summarization.
```
from langchain.llms import OpenAI
llm = OpenAI(openai_api_key=API_KEY)
```
Step 4: Splitting text by Character

The Text Splitter, in this case, overcomes the token limit by breaking down the text into smaller chunks that are each within the token limit. This ensures that the text can be processed effectively by the language model without exceeding its token capacity.

The “chunk_overlap” parameter allows for some overlap between chunks to ensure that no information is lost during the splitting process.
```
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=120
)
```
Step 5 : Loading PDF documents
```
from langchain.document_loaders import PyPDFLoader
def chunks(pdf_file_path):
    loader = PyPDFLoader(pdf_file_path)
    docs = loader.load_and_split()
    return docs
```
It initializes a PyPDFLoader object named “loader” using the provided PDF file path. This loader is responsible for loading and processing the contents of the PDF file.

It then uses the “loader” to load and split the PDF document into smaller “docs” or document chunks. These document chunks likely represent different sections or pages of the PDF file.

Finally, it returns the list of document chunks, making them available for further processing or analysis.

Step 6: Map Reduce Prompt Templates

Import libraries required for the implementation of LangChain MapReduce.
```
from langchain.chains.mapreduce import MapReduceChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import ReduceDocumentsChain, MapReduceDocumentsChain
from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
```
map_template = """The following is a set of documents {docs} Based on this list of docs, summarised into meaningful Helpful Answer:""" map_prompt = PromptTemplate.from_template(map_template) map_chain = LLMChain(llm=llm, prompt=map_prompt) reduce_template = """The following is set of summaries: {doc_summaries} Take these and distil it into a final consolidated summary with title(mandatory) in bold with important key points . Helpful Answer:""" reduce_prompt = PromptTemplate.from_template(reduce_template) reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)
```
map_template = """The following is a set of documents
{docs}
Based on this list of docs, summarised into meaningful
Helpful Answer:"""

map_prompt = PromptTemplate.from_template(map_template)
map_chain = LLMChain(llm=llm, prompt=map_prompt)

reduce_template = """The following is set of summaries:
{doc_summaries}
Take these and distil it into a final consolidated summary with title(mandatory) in bold with important key points . 
Helpful Answer:"""

reduce_prompt = PromptTemplate.from_template(reduce_template)
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)
```
Template Definition:

The code defines two templates, map_template and reduce_template, which serve as structured prompts for instructing a language model on how to process and summarise sets of documents.

LLMChains for Mapping and Reduction:

Two LLMChains, map_chain, and reduce_chain, are configured with these templates to execute the mapping and reduction steps in the document summarization process, making it more structured and manageable.

Step 7 : Map and Reduce LLM Chains
```
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain, document_variable_name="doc_summaries"
)
reduce_documents_chain = ReduceDocumentsChain(
    combine_documents_chain=combine_documents_chain,
    collapse_documents_chain=combine_documents_chain,
    token_max=5000,
)
```
```
map_reduce_chain = MapReduceDocumentsChain(
    llm_chain=map_chain,
    reduce_documents_chain=reduce_documents_chain,
    document_variable_name="docs",
    return_intermediate_steps=False,
)
```
Combining Documents Chain (combine_documents_chain):
- This chain plays a crucial role in the document summarization process. It takes the individual legal document summaries, generated in the “Map” step, and combines them into a single, cohesive text string.
- By consolidating the summaries, it prepares the data for further processing in the “Reduce” step. The resulting combined document string is assigned the variable name “doc_summaries.”
Reduce Documents Chain (reduce_documents_chain):
- This chain represents the final phase of the summarization process. Its primary function is to take the combined document string from the combine_documents_chain and perform in-depth reduction and summarization.
- To address potential issues related to token limits (where documents may exceed a certain token count), this chain offers a clever solution. It can recursively collapse or compress lengthy documents into smaller, more manageable chunks.
- This ensures that the summarization process remains efficient and avoids token limit constraints. The maximum token limit for each chunk is set at 5,000 tokens, helping control the size of the summarization output.
Map-Reduce Documents Chain (map_reduce_chain):
- This chain follows the well-known MapReduce paradigm, a framework often used in distributed computing for processing and generating large datasets. In the “Map” step, it employs the map_chain to process each individual legal document.
- This results in initial document summaries. In the subsequent “Reduce” step, the chain uses the reduce_documents_chain to consolidate these initial summaries into a final, comprehensive document summary.
- The summarization result, representing the distilled insights from the legal documents, is stored in the variable named “docs” within the LLM chain.
Step 8: Summarization Function
```
def summarize_pdf(file_path):
    split_docs = text_splitter.split_documents(chunks(file_path))
    return map_reduce_chain.run(split_docs)

result_sumary=summarize_pdf(file_path)
print(result_summary)
```
Our summarization process centers around the ‘summarize_pdf’ function. This function takes a PDF file path as input and follows a two-step approach.

First, it splits the PDF into manageable sections using the ‘text_splitter’ module. Then, it runs the ‘map_reduce_chain,’ which handles the summarization process.

By providing the PDF file path as input, you can easily generate a concise summary of the legal document within the Google Colab environment, thanks to LangChain and LLM.

Output

1. Original Document – https://www.safetyforward.com/docs/legal.pdf

This document is about not using mobile phones while driving a motor vehicle and prohibits disabling its motion restriction features.

Summarization –

2. Original Document – https://static.abhibus.com/ks/pdf/Loan-Agreement.pdf

India and the International Bank for Reconstruction and Development have formed an agreement for the Sustainable Urban Transport Project, focusing on sustainable transportation while adhering to anti-corruption guidelines.

Summarization –

Limitations :

Complex Legal Terminology:

LLMs may struggle with accurately summarizing documents containing intricate legal terminology, which requires domain-specific knowledge to interpret correctly.

Loss of Context:

Summarization processes, especially in lengthy legal documents, may result in the loss of important contextual details, potentially affecting the comprehensiveness of the summaries.

Inherent Bias:

LLMs can inadvertently introduce bias into summaries based on the biases present in their training data. This is a critical concern when dealing with legal documents that require impartiality.

Document Structure:

Summarization models might not always understand the hierarchical or structural elements of legal documents, making it challenging to generate summaries that reflect the intended structure.

Limited Abstraction:

LLMs excel at generating detailed summaries, but they may struggle with abstracting complex legal arguments, which is essential for high-level understanding.

Conclusion :
- In a nutshell, this project uses LangChain and OpenAI’s LLM to bring in a fresh way of summarizing legal documents. This collaboration makes legal document management more accurate and efficient.
- However, we faced some big challenges, like handling lots of legal documents and dealing with AI bias. As we move forward, we need to find new ways to make our automated summarization even better and meet the demands of the legal profession.
- In the future, we’re committed to improving our approach. We’ll focus on fine-tuning algorithms for more accuracy and exploring new techniques, like combining different methods, to keep enhancing legal document summarization. Our aim is to meet the ever-growing needs of the legal profession.
September 14, 2023
Apache Flink – A Solution for Real-Time Analytics
In today’s world, data is being generated at an unprecedented rate. Every click, every tap, every swipe, every tweet, every post, every like, every share, every search, and every view generates a trail of data. Businesses are struggling to keep up with the speed and volume of this data, and traditional batch-processing systems cannot handle the scale and complexity of this data in real-time.

This is where streaming analytics comes into play, providing faster insights and more timely decision-making. Streaming analytics is particularly useful for scenarios that require quick reactions to events, such as financial fraud detection or IoT data processing. It can handle large volumes of data and provide continuous monitoring and alerts in real-time, allowing for immediate action to be taken when necessary.

Stream processing or real-time analytics is a method of analyzing and processing data as it is generated, rather than in batches. It allows for faster insights and more timely decision-making. Popular open-source stream processing engines include Apache Flink, Apache Spark Streaming, and Apache Kafka Streams. In this blog, we are going to talk about Apache Flink and its fundamentals and how it can be useful for streaming analytics.

Introduction

Apache Flink is an open-source stream processing framework first introduced in 2014. Flink has been designed to process large amounts of data streams in real-time, and it supports both batch and stream processing. It is built on top of the Java Virtual Machine (JVM) and is written in Java and Scala.

Flink is a distributed system that can run on a cluster of machines, and it has been designed to be highly available, fault-tolerant, and scalable. It supports a wide range of data sources and provides a unified API for batch and stream processing, which makes it easy to build complex data processing applications.

Advantages of Apache Flink

Real-time analytics is the process of analyzing data as it is generated. It requires a system that can handle large volumes of data in real-time and provide insights into the data as soon as possible. Apache Flink has been designed to meet these requirements and has several advantages over other real-time data processing systems.
1. Low Latency: Flink processes data streams in real-time, which means it can provide insights into the data almost immediately. This makes it an ideal solution for applications that require low latency, such as fraud detection and real-time recommendations.
2. High Throughput: Flink has been designed to handle large volumes of data and can scale horizontally to handle increasing volumes of data. This makes it an ideal solution for applications that require high throughput, such as log processing and IoT applications.
3. Flexible Windowing: Flink provides a flexible windowing API that enables the creation of complex windows for processing data streams. This enables the creation of windows based on time, count, or custom triggers, which makes it easy to create complex data processing applications.
4. Fault Tolerance: Flink is designed to be highly available and fault-tolerant. It can recover from failures quickly and can continue processing data even if some of the nodes in the cluster fail.
5. Compatibility: Flink is compatible with a wide range of data sources, including Kafka, Hadoop, and Elasticsearch. This makes it easy to integrate with existing data processing systems.
Flink Architecture

Apache Flink processes data streams in a distributed manner. The Flink cluster consists of several nodes, each of which is responsible for processing a portion of the data. The nodes communicate with each other using a messaging system, such as Apache Kafka.

The Flink cluster processes data streams in parallel by dividing the data into small chunks, or partitions, and processing them independently. Each partition is processed by a task, which is a unit of work that runs on a node in the cluster.

Flink provides several APIs for building data processing applications, including the DataStream API, the DataSet API, and the Table API. The below diagram illustrates what a Flink cluster looks like.

Apache Flink Cluster
- Flink application runs on a cluster.
- A Flink cluster has a job manager and a bunch of task managers.
- A job manager is responsible for effective allocation and management of computing resources.
- Task managers are responsible for the execution of a job.
Flink Job Execution
1. Client system submits job graph to the job manager
- A client system prepares and sends a dataflow/job graph to the job manager.
- It can be your Java/Scala/Python Flink application or the CLI.
- The runtime and program execution do not include the client.
- After submitting the job, the client can either disconnect and operate in detached mode or remain connected to receive progress reports in attached mode.
‍

Given below is an illustration of how the job graph converted from code looks like

Job Graph
1. The job graph is converted to an execution graph by the job manager
- The execution graph is a parallel version of the job graph.
- For each job vertex, it contains an execution vertex per parallel subtask.
- An operator that exhibits a parallelism level of 100 will consist of a single job vertex and 100 execution vertices.
Given below is an illustration of what an execution graph looks like:

Execution Graph
1. Job manager submits the parallel instances of execution graph to task managers
- Execution resources in Flink are defined through task slots.
- Each task manager will have one or more task slots, each of which can run one pipeline of parallel tasks.
- A pipeline consists of multiple successive tasks
Parallel instances of execution graph being submitted to task slots

Flink Program

Flink programs look like regular programs that transform DataStreams. Each program consists of the same basic parts:
- Obtain an execution environment
ExecutionEnvironment is the context in which a program is executed. This is how execution environment is set up in Flink code:
```
ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment(); // if program is running on local machine
ExecutionEnvironment env = new CollectionEnvironment(); // if source is collections
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // will do the right thing based on context
```
- Connect to data stream
We can use an instance of the execution environment to connect to the data source which can be file System, a streaming application or collection. This is how we can connect to data source in Flink:
```
DataSet<String> data = env.readTextFile("file:///path/to/file"); // to read from file
DataSet<User> users = env.fromCollection( /* get elements from a Java Collection */); // to read from collections
DataSet<User> users = env.addSource(/*streaming application or database*/);
```
- Perform Transformations
We can perform transformation on the events/data that we receive from the data sources.
A few of the data transformation operations are map, filter, keyBy, flatmap, etc.
- Specify where to send the data
Once we have performed the transformation/analytics on the data that is flowing through the stream, we can specify where we will send the data.
The destination can be a filesystem, database, or data streams.
```
 dataStream.sinkTo(/*streaming application or database api */);
```
Flink Transformations
1. Map: Takes one element at a time from the stream and performs some transformation on it, and gives one element of any type as an output.
  
  Given below is an example of Flink’s map operator:
```
stream.map(new MapFunction<Integer, String>()
{
public String map(Integer integer)
{
return " input -> "+integer +" : " +		
" output -> " +
""+numberToWords(integer	
.toString().	
toCharArray()); // converts number to words
}
}).print();
```
1. Filter: Evaluates a boolean function for each element and retains those for which the function returns true.
Given below is an example of Flink’s filter operator:
```
stream.filter(new FilterFunction<Integer>()
{
public String filter(Integer integer) throws Exception
{
return integer%2 != 0;
}
}).print();
```
‍
1. Reduce: A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.
Given below is an example of Flink’s reduce operator:
```
DataStream<Integer> stream = env.fromCollection(data);
stream.countWindowAll(3)
.reduce(new ReduceFunction<Integer>(){
public Integer reduce(Integer integer, Integer t1)  throws Exception
{
return integer+=t1;
}
}).print();
```
Input :

Output :
1. KeyBy:
- Logically partitions a stream into disjoint partitions.
- All records with the same key are assigned to the same partition.
- Internally, keyBy() is implemented with hash partitioning.
The figure below illustrates how key by operator works in Flink.

Fault Tolerance
- Flink combines stream replay and checkpointing to achieve fault tolerance.
- At a checkpoint, each operator’s corresponding state and the specific point in each input stream are marked.
- Whenever Checkpointing is done, a snapshot of the data of all the operators is saved in the state backend, which is generally the job manager’s memory or configurable durable storage.
- Whenever there is a failure, operators are reset to the most recent state in the state backend, and event processing is resumed.
Checkpointing
- Checkpointing is implemented using stream barriers.
- Barriers are injected into the data stream at the source. E.g., kafka, kinesis, etc.
- Barriers flow with the records as part of the data stream.
Refer below diagram to understand how checkpoint barriers flow with the events:

Checkpoint Barriers

Saving Snapshots
- Operators snapshot their state at the point in time when they have received all snapshot barriers from their input streams, and before emitting the barriers to their output streams.
- Once a sink operator (the end of a streaming DAG) has received the barrier n from all of its input streams, it acknowledges that snapshot n to the checkpoint coordinator.
- After all sinks have acknowledged a snapshot, it is considered completed.
The below diagram illustrates how checkpointing is achieved in Flink with the help of barrier events, state backends, and checkpoint table.

Checkpointing

Recovery
- Flink selects the latest completed checkpoint upon failure.
- The system then re-deploys the entire distributed dataflow.
- Gives each operator the state that was snapshotted as part of the checkpoint.
- The sources are set to start reading the stream from the position given in the snapshot.
- For example, in Apache Kafka, that means telling the consumer to start fetching messages from an offset given in the snapshot.
Scalability

A Flink job can be scaled up and scaled down as per requirement.

This can be done manually by:
1. Triggering a savepoint (manually triggered checkpoint)
2. Adding/Removing nodes to/from the cluster
3. Restarting the job from savepoint
OR

Automatically by Reactive Scaling:
- The configuration of a job in Reactive Mode ensures that it utilizes all available resources in the cluster at all times.
- Adding a Task Manager will scale up your job, and removing resources will scale it down.
- Reactive Mode restarts a job on a rescaling event, restoring it from the latest completed checkpoint.
- The only downside is that it works only in standalone mode.
Alternatives
- Spark Streaming: It is an open-source distributed computing engine that has added streaming capabilities, but Flink is optimized for low-latency processing of real-time data streams and supports more complex processing scenarios.
- Apache Storm: It is another open-source stream processing system that has a steeper learning curve than Flink and uses a different architecture based on spouts and bolts.
- Apache Kafka Streams: It is a lightweight stream processing library built on top of Kafka, but it is not as feature-rich as Flink or Spark, and is better suited for simpler stream processing tasks.
Conclusion

In conclusion, Apache Flink is a powerful solution for real-time analytics. With its ability to process data in real-time and support for streaming data sources, it enables businesses to make data-driven decisions with minimal delay. The Flink ecosystem also provides a variety of tools and libraries that make it easy for developers to build scalable and fault-tolerant data processing pipelines.

One of the key advantages of Apache Flink is its support for event-time processing, which allows it to handle delayed or out-of-order data in a way that accurately reflects the sequence of events. This makes it particularly useful for use cases such as fraud detection, where timely and accurate data processing is critical.

Additionally, Flink’s support for multiple programming languages, including Java, Scala, and Python, makes it accessible to a broad range of developers. And with its seamless integration with popular big data platforms like Hadoop and Apache Kafka, it is easy to incorporate Flink into existing data infrastructure.

In summary, Apache Flink is a powerful and flexible solution for real-time analytics, capable of handling a wide range of use cases and delivering timely insights that drive business value.

References
March 9, 2023
An Introduction to Stream Processing & Analytics
What is Stream Processing and Analytics?

Stream processing is a technology used to process large amounts of data in real-time as it is generated rather than storing it and processing it later.

Think of it like a conveyor belt in a factory. The conveyor belt constantly moves, bringing in new products that need to be processed. Similarly, stream processing deals with data that is constantly flowing, like a stream of water. Just like the factory worker needs to process each product as it moves along the conveyor belt, stream processing technology processes each piece of data as it arrives.

Stateful and stateless processing are two different approaches to stream processing, and the right choice depends on the specific requirements and needs of the application.

Stateful processing is useful in scenarios where the processing of an event or data point depends on the state of previous events or data points. For example, it can be used to maintain a running total or average across multiple events or data points.

Stateless processing, on the other hand, is useful in scenarios where the processing of an event or data point does not depend on the state of previous events or data points. For example, in a simple data transformation application, stateless processing can be used to transform each event or data point independently without the need to maintain state.

Streaming analytics refers to the process of analyzing and processing data in real time as it is generated. Streaming analytics enable applications to react to events and make decisions in near real time.

Why Stream Processing and Analytics?

Stream processing is important because it allows organizations to make real-time decisions based on the data they are receiving. This is particularly useful in situations where timely information is critical, such as in financial transactions, network security, and real-time monitoring of industrial processes.

For example, in financial trading, stream processing can be used to analyze stock market data in real time and make split-second decisions to buy or sell stocks. In network security, it can be used to detect and respond to cyber-attacks in real time. And in industrial processes, it can be used to monitor production line efficiency and quickly identify and resolve any issues.

Stream processing is also important because it can process massive amounts of data, making it ideal for big data applications. With the growth of the Internet of Things (IoT), the amount of data being generated is growing rapidly, and stream processing provides a way to process this data in real time and derive valuable insights.

Collectively, stream processing provides organizations with the ability to make real-time decisions based on the data they are receiving, allowing them to respond quickly to changing conditions and improve their operations.

How is it different from Batch Processing?

Batch Data Processing:

Batch Data Processing is a method of processing where a group of transactions or data is collected over a period of time and is then processed all at once in a “batch”. The process begins with the extraction of data from its sources, such as IoT devices or web/application logs. This data is then transformed and integrated into a data warehouse. The process is generally called the Extract, Transform, Load (ETL) process. The data warehouse is then used as the foundation for an analytical layer, which is where the data is analyzed, and insights are generated.

Stream/Real-time Data Processing:

Real-Time Data Streaming involves the continuous flow of data that is generated in real-time, typically from multiple sources such as IoT devices or web/application logs. A message broker is used to manage the flow of data between the stream processors, the analytical layer, and the data sink. The message broker ensures that the data is delivered in the correct order and that it is not lost. Stream processors used to perform data ingestion and processing. These processors take in the data streams and process them in real time. The processed data is then sent to an analytical layer, where it is analyzed, and insights are generated.

Processes involved in Stream processing and Analytics:

The process of stream processing can be broken down into the following steps:
- Data Collection: The first step in stream processing is collecting data from various sources, such as sensors, social media, and transactional systems. The data is then fed into a stream processing system in real time.
- Data Ingestion: Once the data is collected, it needs to be ingested or taken into the stream processing system. This involves converting the data into a standard format that can be processed by the system.
- Data Processing: The next step is to process the data as it arrives. This involves applying various processing algorithms and rules to the data, such as filtering, aggregating, and transforming the data. The processing algorithms can be applied to individual events in the stream or to the entire stream of data.
- Data Storage: After the data has been processed, it is stored in a database or data warehouse for later analysis. The storage can be configured to retain the data for a specific amount of time or to retain all the data.
- Data Analysis: The final step is to analyze the processed data and derive insights from it. This can be done using data visualization tools or by running reports and queries on the stored data. The insights can be used to make informed decisions or to trigger actions, such as sending notifications or triggering alerts.
It’s important to note that stream processing is an ongoing process, with data constantly being collected, processed, and analyzed in real time. The visual representation of this process can be represented as a continuous cycle of data flowing through the system, being processed and analyzed at each step along the way.

Stream Processing Platforms & Frameworks:

Stream Processing Platforms & Tools are software systems that enable the collection, processing, and analysis of real-time data streams.

Stream Processing Frameworks:

A stream processing framework is a software library or framework that provides a set of tools and APIs for developers to build custom stream processing applications. Frameworks typically require more development effort and configuration to set up and use. They provide more flexibility and control over the stream processing pipeline but also require more development and maintenance resources.

Examples: Apache Spark Streaming, Apache Flink, Apache Beam, Apache Storm, Apache Samza

Let’s first look into the most commonly used stream processing frameworks: Apache Flink & Apache Spark Streaming.

Apache Flink :

Flink is an open-source, unified stream-processing and batch-processing framework. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner, making it ideal for processing huge amounts of data in real-time.
- Flink provides out-of-the-box checkpointing and state management, two features that make it easy to manage enormous amounts of data with relative ease.
- The event processing function, the filter function, and the mapping function are other features that make handling a large amount of data easy.
Flink also comes with real-time indicators and alerts which make abig difference when it comes to data processing and analysis.

Note: We have discussed the stream processing and analytics in detail in Stream Processing and Analytics with Apache Flink

Apache Spark Streaming :

Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads. Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data.
- Great for solving complicated transformative logic
- Easy to program
- Runs at blazing speeds
- Processes large data within a fraction of second
Stream Processing Platforms:

A stream processing platform is an end-to-end solution for processing real-time data streams. Platforms typically require less development effort and maintenance as they provide pre-built tools and functionality for processing, analyzing, and visualizing data.

Examples: Apache Kafka, Amazon Kinesis, Google Cloud Pub-Sub

Let’s look into the most commonly used stream processing platforms: Apache Kafka & AWS Kinesis.

Apache Kafka:

Apache Kafka is an open-source distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
- Because it’s an open-source, “Kafka generally requires a higher skill set to operate and manage, so it’s typically used for development and testing.
- APIs allow “producers” to publish data streams to “topics;” a “topic” is a partitioned log of records; a “partition” is ordered and immutable; “consumers” subscribe to “topics.”
- It can run on a cluster of “brokers” with partitions split across cluster nodes.
- Messages can be effectively unlimited in size (2GB).
‍AWS Kinesis:

Amazon Kinesis is a cloud-based service on Amazon Web Services (AWS) that allows you to ingest real-time data such as application logs, website clickstreams, and IoT telemetry data for machine learning and analytics, as well as video and audio.
- Amazon Kinesis is a SaaS offering, reducing the complexities in the design, build, and manage stages compared to open-source Apache Kafka. It’s ideally suited for building microservices architectures.
- “Producers” can push data as soon as it is put on the stream. Kinesis breaks the stream across “shards” (which are like partitions).
- Shards have a hard limit on the number of transactions and data volume per second. If you need more volume, you must subscribe to more shards. You pay for what you use.
- Most maintenance and configurations are hidden from the user. Scaling is easy (adding shards) compared to Kafka.
- Maximum message size is 1MB.
Three Characteristics of Event Stream processing Platform:

Publish and Subscribe:

In a publish-subscribe model, producers publish events or messages to streams or topics, and consumers subscribe to streams or topics to receive the events or messages. This is similar to a message queue or enterprise messaging system. It allows for the decoupling of the producer and consumer, enabling them to operate independently and asynchronously.

Store streams of events in a fault-tolerant way

This means that the platform is able to store and manage events in a reliable and resilient manner, even in the face of failures or errors. To achieve fault tolerance, event streaming platforms typically use a variety of techniques, such as replicating data across multiple nodes, and implementing data recovery and failover mechanisms.

Process streams of events in real-time, as they occur

This means that the platform can process and analyze data as it is generated rather than waiting for data to be batch-processed or stored for later processing.

Challenges when designing the stream processing and analytics solution:

Stream processing is a powerful technology, but there are also several challenges associated with it, including:
- Late arriving data: Data that is delayed or arrives out of order can disrupt the processing pipeline and lead to inaccurate results. Stream processing systems need to be able to handle out-of-order data and reconcile it with the data that has already been processed.
- Missing data: If data is missing or lost, it can impact the accuracy of the processing results. Stream processing systems need to be able to identify missing data and handle it appropriately, whether by skipping it, buffering it, or alerting a human operator.
- Duplicate data: Duplicate data can lead to over-counting and skewed results. Stream processing systems need to be able to identify and de-duplicate data to ensure accurate results.
- Data skew: data skew occurs when there is a disproportionate amount of data for certain key fields or time periods. This can lead to performance issues, processing delays, and inaccurate results. Stream processing systems need to be able to handle data skew by load balancing and scaling resources appropriately.
- Fault tolerance: Stream processing systems need to be able to handle hardware and software failures without disrupting the processing pipeline. This requires fault-tolerant design, redundancy, and failover mechanisms.
- Data security and privacy: Real-time data processing often involves sensitive data, such as personal information, financial data, or intellectual property. Stream processing systems need to ensure that data is securely transmitted, stored, and processed in compliance with regulatory requirements.
- Latency: Another challenge with stream processing is latency or the amount of time it takes for data to be processed and analyzed. In many applications, the results of the analysis need to be produced in real-time, which puts pressure on the stream processing system to process the data quickly.
- Scalability: Stream processing systems must be able to scale to handle large amounts of data as the amount of data being generated continues to grow. This can be a challenge because the systems must be designed to handle data in real-time while also ensuring that the results of the analysis are accurate and reliable.
- Maintenance: Maintaining a stream processing system can also be challenging, as the systems are complex and require specialized knowledge to operate effectively. In addition, the systems must be able to evolve and adapt to changing requirements over time.
Despite these challenges, stream processing remains an important technology for organizations that need to process data in real time and make informed decisions based on that data. By understanding these challenges and designing the systems to overcome them, organizations can realize the full potential of stream processing and improve their operations.

Key benefits of stream processing and analytics:
- Real-time processing keeps you in sync all the time:
For Example: Suppose an online retailer uses a distributed system to process orders. The system might include multiple components, such as a web server, a database server, and an application server. The different components could be kept in sync by real-time processing by processing orders as they are received and updating the database accordingly. As a result, orders would be accurate and processed efficiently by maintaining a consistent view of the data.
- Real-time data processing is More Accurate and timely:
For Example a financial trading system that processes data in real-time can help to ensure that trades are executed at the best possible prices, improving the accuracy and timeliness of the trades.
- Deadlines are met with Real-time processing:
For example: In a control system, it may be necessary to respond to changing conditions within a certain time frame in order to maintain the stability of the system.
- Real-time processing is quite reactive:
For example, a real-time processing system might be used to monitor a manufacturing process and trigger an alert if it detects a problem or to analyze sensor data from a power plant and adjust the plant’s operation in response to changing conditions.
- Real-time processing involves multitasking:
For example, consider a real-time monitoring system that is used to track the performance of a large manufacturing plant. The system might receive data from multiple sensors and sources, including machine sensors, temperature sensors, and video cameras. In this case, the system would need to be able to multitask in order to process and analyze data from all of these sources in real time and to trigger alerts or take other actions as needed.
- Real-time processing works independently:
For example, a real-time processing system may rely on a database or message queue to store and retrieve data, or it may rely on external APIs or services to access additional data or functionality.

Use case studies:

There are many real-life examples of stream processing in different industries that demonstrate the benefits of this technology. Here are a few examples:
- Financial Trading: In the financial industry, stream processing is used to analyze stock market data in real time and make split-second decisions to buy or sell stocks. This allows traders to respond to market conditions in real time and improve their chances of making a profit.
- Network Security: Stream processing is also used in network security to detect and respond to cyber-attacks in real-time. By processing network data in real time, security systems can quickly identify and respond to threats, reducing the risk of a data breach.
- Industrial Monitoring: In the industrial sector, stream processing is used to monitor production line efficiency and quickly identify and resolve any issues. For example, it can be used to monitor the performance of machinery and identify any potential problems before they cause a production shutdown.
- Social Media Analysis: Stream processing is also used to analyze social media data in real time. This allows organizations to monitor brand reputation, track customer sentiment, and respond to customer complaints in real time.
- Healthcare: In the healthcare industry, stream processing is used to monitor patient data in real time and quickly identify any potential health issues. For example, it can be used to monitor vital signs and alert healthcare providers if a patient’s condition worsens.
These are just a few examples of the many real-life applications of stream processing. Across all industries, stream processing provides organizations with the ability to process data in real time and make informed decisions based on the data they are receiving.

How to start stream analytics?
- Our recommendation in building a dedicated platform is to keep the focus on choosing a diverse stream processor to pair with your existing analytical interface.
- Or, keep an eye on vendors who offer both stream processing and BI as a service.
Resources:

Here are some useful resources for learning more about stream processing:

Videos:
- Velotio Tech Shorts | Stream Processing & Analytics : Velotio Tech Shorts | Stream Processing & Analytics
- Velotio tech Shorts | Streaming Analytics using Flink Velotio Tech Shorts | Streaming Analytics Using Flink
- Apache Kafka: https://www.youtube.com/watch?v=0IADOFI3uV4&ab_channel=Confluent
- Apache Flink: https://www.youtube.com/watch?v=UyJ6UzorX9s&ab_channel=FlinkForward
- Apache Spark Streaming: https://www.youtube.com/watch?v=dbBgKjhKGOA&ab_channel=ApacheSpark
Tutorials:
- Stream Processing with Apache Kafka: https://kafka.apache.org/quickstart
- Apache Spark Streaming: https://spark.apache.org/docs/latest/streaming-programming-guide.html
Articles:
- Stream Processing: https://en.wikipedia.org/wiki/Stream_processing
These resources will provide a good starting point for learning more about stream processing and how it can be used to solve real-world problems.

Conclusion:

Real-time data analysis and decision-making require stream processing and analytics in diverse industries, including finance, healthcare, and e-commerce. Organizations can improve operational efficiency, customer satisfaction, and revenue growth by processing data in real time. A robust infrastructure, skilled personnel, and efficient algorithms are required for stream processing and analytics. Businesses need stream processing and analytics to stay competitive and agile in today’s fast-paced world as data volumes and complexity continue to increase.
February 22, 2023
Cube – An Innovative Framework to Build Embedded Analytics
Historically, embedded analytics was thought of as an integral part of a comprehensive business intelligence (BI) system. However, when we considered our particular needs, we soon realized something more innovative was necessary. That is when we came across Cube (formerly CubeJS), a powerful platform that could revolutionize how we think about embedded analytics solutions.

This new way of modularizing analytics solutions means businesses can access the exact services and features they require at any given time without purchasing a comprehensive suite of analytics services, which can often be more expensive and complex than necessary.

Furthermore, Cube makes it very easy to link up data sources and start to get to grips with analytics, which provides clear and tangible benefits for businesses. This new tool has the potential to be a real game changer in the world of embedded analytics, and we are very excited to explore its potential.

Understanding Embedded Analytics

When you read a word like “embedded analytics” or something similar, you probably think of an HTML embed tag or an iFrame tag. This is because analytics was considered a separate application and not part of the SaaS application, so the market had tools specifically for analytics.

“Embedded analytics is a digital workplace capability where data analysis occurs within a user’s natural workflow, without the need to toggle to another application. Moreover, embedded analytics tends to be narrowly deployed around specific processes such as marketing campaign optimization, sales lead conversions, inventory demand planning, and financial budgeting.” – Gartner

Embedded Analytics is not just about importing data into an iFrame—it’s all about creating an optimal user experience where the analytics feel like they are an integral part of the native application. To ensure that the user experience is as seamless as possible, great attention must be paid to how the analytics are integrated into the application. This can be done with careful thought to design and by anticipating user needs and ensuring that the analytics are intuitive and easy to use. This way, users can get the most out of their analytics experience.

Existing Solutions

With the rising need for SaaS applications and the number of SaaS applications being built daily, analytics must be part of the SaaS application.

We have identified three different categories of exciting solutions available in the market.

Traditional BI Platforms

Many tools, such as GoodData, Tableau, Metabase, Looker, and Power BI, are part of the big and traditional BI platforms. Despite their wide range of features and capabilities, these platforms need more support with their Big Monolith Architecture, limited customization, and less-than-intuitive user interfaces, making them difficult and time-consuming.

Here are a few reasons these are not suitable for us:
- They lack customization, and their UI is not intuitive, so they won’t be able to match our UX needs.
- They charge a hefty amount, which is unsuitable for startups or small-scale companies.
- They have a big monolith architecture, making integrating with other solutions difficult.
New Generation Tools

The next experiment taking place in the market is the introduction of tools such as Hex, Observable, Streamlit, etc. These tools are better suited for embedded needs and customization, but they are designed for developers and data scientists. Although the go-to-market time is shorter, all these tools cannot integrate into SaaS applications.

Here are a few reasons why these are not suitable for us:
- They are not suitable for non-technical people and cannot integrate with Software-as-a-Service (SaaS) applications.
- Since they are mainly built for developers and data scientists, they don’t provide a good user experience.
- They are not capable of handling multiple data sources simultaneously.
- They do not provide pre-aggregation and caching solutions.
In House Tools

Building everything in-house, instead of paying other platforms to build everything from scratch, is possible using API servers and GraphQL. However, there is a catch: the requirements for analytics are not straightforward, which will require a lot of expertise to build, causing a big hurdle in adaptation and resulting in a longer time-to-market.

Here are a few reasons why these are not suitable for us:
- Building everything in-house requires a lot of expertise and time, thus resulting in a longer time to market.
- It requires developing a secure authentication and authorization system, which adds to the complexity.
- It requires the development of a caching system to improve the performance of analytics.
- It requires the development of a real-time system for dynamic dashboards.
- It requires the development of complex SQL queries to query multiple data sources.
‍

Typical Analytics Features

If you want to build analytics features, the typical requirements look like this:

Multi-Tenancy

When developing software-as-a-service (SaaS) applications, it is often necessary to incorporate multi-tenancy into the architecture. This means multiple users will be accessing the same software application, but with a unique and individualized experience. To guarantee that this experience is not compromised, it is essential to ensure that the same multi-tenancy principles are carried over into the analytics solution that you are integrating into your SaaS application. It is important to remember that this will require additional configuration and setup on your part to ensure that all of your users have access to the same level of tools and insights.

Intuitive Charts

If you look at some of the available analytics tools, they may have good charting features, but they may not be able to meet your specific UX needs. In today’s world, many advanced UI libraries and designs are available, which are often far more effective than the charting features of analytics tools. Integrating these solutions could help you create a more user-friendly experience tailored specifically to your business requirements.

Security

You want to have authentication and authorization for your analytics so that managers can get an overview of the analytics for their entire team, while individual users can only see their own analytics. Furthermore, you may want to grant users with certain roles access to certain analytics charts and other data to better understand how their team is performing. To ensure that your analytics are secure and that only the right people have access to the right information, it is vital to set up an authentication and authorization system.

Caching

Caching is an incredibly powerful tool for improving the performance and economics of serving your analytics. By implementing a good caching solution, you can see drastic improvements in the speed and efficiency of your analytics, while also providing an improved user experience. Additionally, the cost savings associated with this approach can be quite significant, providing you with a greater return on investment. Caching can be implemented in various ways, but the most effective approaches are tailored to the specific needs of your analytics. By leveraging the right caching solutions, you can maximize the benefits of your analytics and ensure that your users have an optimized experience.

Real-time

Nowadays, every successful SaaS company understands the importance of having dynamic and real-time dashboards; these dashboards provide users with the ability to access the latest data without requiring them to refresh the tab each and every time. By having real-time dashboards, companies can ensure their customers have access to the latest information, which can help them make more informed decisions. This is why it is becoming increasingly important for SaaS organizations to invest in robust, low-latency dashboard solutions that can deliver accurate, up-to-date data to their customers.

Drilldowns

Drilldown is an incredibly powerful analytics capability that enables users to rapidly transition from an aggregated, top-level overview of their data to a more granular, in-depth view. This can be achieved simply by clicking on a metric within a dashboard or report. With drill-down, users can gain a greater understanding of the data by uncovering deeper insights, allowing them to more effectively evaluate the data and gain a more accurate understanding of their data trends.

Data Sources

With the prevalence of software as a service (SaaS) applications, there could be a range of different data sources used, including PostgreSQL, DynamoDB, and other types of databases. As such, it is important for analytics solutions to be capable of accommodating multiple data sources at once to provide the most comprehensive insights. By leveraging the various sources of information, in conjunction with advanced analytics, businesses can gain a thorough understanding of their customers, as well as trends and behaviors. Additionally, accessing and combining data from multiple sources can allow for more precise predictions and recommendations, thereby optimizing the customer experience and improving overall performance.

Budget

Pricing is one of the most vital aspects to consider when selecting an analytics tool. There are various pricing models, such as AWS Quick-sight, which can be quite complex, or per-user basis costs, which can be very expensive for larger organizations. Additionally, there is custom pricing, which requires you to contact customer care to get the right pricing; this can be quite a difficult process and may cause a big barrier to adoption. Ultimately, it is important to understand the different pricing models available and how they may affect your budget before selecting an analytics tool.

After examining all the requirements, we came across a solution like Cube, which is an innovative solution with the following features:
- Open Source: Since it is open source, you can easily do a proof-of-concept (POC) and get good support, as any vulnerabilities will be fixed quickly.
- Modular Architecture: It can provide good customizations, such as using Cube to use any custom charting library you prefer in your current framework.
- Embedded Analytics-as-a-Code: You can easily replicate your analytics and version control it, as Cube is analytics in the form of code.
- Cloud Deployments: It is a new-age tool, so it comes with good support with Docker or Kubernetes (K8s). Therefore, you can easily deploy it on the cloud.
Cube Architecture

Let’s look at the Cube architecture to understand why Cube is an innovative solution.
- Cube supports multiple data sources simultaneously; your data may be stored in Postgres, Snowflake, and Redshift, and you can connect to all of them simultaneously. Additionally, they have a long list of data sources they can support.
- Cube provides analytics over a REST API; very few analytics solutions provide chart data or metrics over REST APIs.
- The security you might be using for your application can easily be mirrored for Cube. This helps simplify the security aspects, as you don’t need to maintain multiple tokens for the app and analytics tool.
- Cube provides a unique way to model your data in JSON format; it’s more similar to an ORM. You don’t need to write complex SQL queries; once you model your data, Cube will generate the SQL to query the data source.
- Cube has very good pre-aggregation and caching solutions.
Cube Deep Dive

Let’s look into different concepts that we just saw briefly in the architecture diagram.

Data Modeling

Cube

A cube represents a table of data and is conceptually similar to a view in SQL. It’s like an ORM where you can define schema, extend it, or define abstract cubes to make use of code reusable. For example, if you have a Customer table, you need to write a Cube for it. Using Cubes, you can build analytical queries.

Each cube contains definitions of measures, dimensions, segments, and joins between cubes. Cube bifurcates columns into measures and dimensions. Similar to tables, every cube can be referenced in another cube. Even though a cube is a table representation, you can choose which columns you want to expose for analytics. You can only add columns you want to expose to analytics; this will translate into SQL for the dimensions and measures you use in the SQL query (Push Down Mechanism).
```
cube('Orders', {
  sql: `SELECT * FROM orders`,
});
```
Dimensions

You can think about a dimension as an attribute related to a measure, for example, the measure userCount. This measure can have different dimensions, such as country, age, occupation, etc.

Dimensions allow us to further subdivide and analyze the measure, providing a more detailed and comprehensive picture of the data.
```
cube('Orders', {

  ...,

  dimensions: {
    status: {
      sql: `status`,
      type: `string`},
  },
});
```
Measures

These parameters/SQL columns allow you to define the aggregations for numeric or quantitative data. Measures can be used to perform calculations such as sum, minimum, maximum, average, and count on any set of data.

Measures also help you define filters if you want to add some conditions for a metric calculation. For example, you can set thresholds to filter out any data that is not within the range of values you are looking for.

Additionally, measures can be used to create additional metrics, such as the ratio between two different measures or the percentage of a measure. With these powerful tools, you can effectively analyze and interpret your data to gain valuable insights.
```
cube('Orders', {

  ...,

  measures: {
    count: {
      type: `count`,
    },
  },
});
```
Joins

Joins define the relationships between cubes, which then allows accessing and comparing properties from two or more cubes at the same time. In Cube, all joins are LEFT JOINs. This also allows you to represent one-to-one, many-to-one relationships easily.
```
cube('Orders', {

  ...,

  joins: {
    LineItems: {
      relationship: `belongsTo`,
      // Here we use the `CUBE` global to refer to the current cube,
      // so the following is equivalent to `Orders.id = LineItems.order_id`
      sql: `${CUBE}.id = ${LineItems}.order_id`,
    },
  },
});
```
There are three kinds of join relationships:
- belongsTo
- hasOne
- hasMany
Segments

Segments are filters predefined in the schema instead of a Cube query. Segments help pre-build complex filtering logic, simplifying Cube queries and making it easy to re-use common filters across a variety of queries.

To add a segment that limits results to completed orders, we can do the following:
```
cube('Orders', {
  ...,
  segments: {
    onlyCompleted: {
      sql: `${CUBE}.status = 'completed'`},
  },
});
```
Pre-Aggregations

Pre-aggregations are a powerful way of caching frequently-used, expensive queries and keeping the cache up-to-date periodically. The most popular roll-up pre-aggregation is summarized data of the original cube grouped by any selected dimensions of interest. It works on “measure types” like count, sum, min, max, etc.

Cube analyzes queries against a defined set of pre-aggregation rules to choose the optimal one that will be used to create pre-aggregation table. When there is a smaller dataset that queries execute over, the application works well and delivers responses within acceptable thresholds. However, as the size of the dataset grows, the time-to-response from a user’s perspective can often suffer quite heavily. It specifies attributes from the source, which Cube uses to condense (or crunch) the data. This simple yet powerful optimization can reduce the size of the dataset by several orders of magnitude, and ensures subsequent queries can be served by the same condensed dataset if any matching attributes are found.

Even granularity can be specified, which defines the granularity of data within the pre-aggregation. If set to week, for example, then Cube will pre-aggregate the data by week and persist it to Cube Store.

Cube can also take care of keeping pre-aggregations up-to-date with the refreshKey property. By default, it is set to every: ‘1 hour’.
```
cube('Orders', {

  ...,

  preAggregations: {
    main: {
      measures: [CUBE.count],
      dimensions: [CUBE.status],
      timeDimension: CUBE.createdAt,
      granularity: 'day',
    },
  },
});
```
Additional Cube Concepts

Let’s look into some of the additional concepts that Cube provides that make it a very unique solution.

Caching

Cube provides a two-level caching system. The first level is in-memory cache, which is active by default. Cube in-memory cache acts as a buffer for your database when there is a burst of requests hitting the same data from multiple concurrent users, while pre-aggregations are designed to provide the right balance between time to insight and querying performance.

The second level of caching is called pre-aggregations, and requires explicit configuration to activate.

Drilldowns

Drilldowns are a powerful feature to facilitate data exploration. It allows building an interface to let users dive deeper into visualizations and data tables. See ResultSet.drillDown() on how to use this feature on the client side.

A drilldown is defined on the measure level in your data schema. It is defined as a list of dimensions called drill members. Once defined, these drill members will always be used to show underlying data when drilling into that measure.

Subquery

You can use subqueries within dimensions to reference measures from other cubes inside a dimension. Under the hood, it behaves as a correlated subquery, but is implemented via joins for optimal performance and portability.

For example, the following SQL can be written using a subquery in cubes as:
```
SELECT
  id,
  (SELECT SUM(amount)FROM dealsWHERE deals.sales_manager_id = sales_managers.id)AS deals_amount
FROM sales_managers
GROUPBY 1
```
Cube Representation
cube(`Deals`, { sql: `SELECT * FROM deals`, measures: { amount: { sql: `amount`, type: `sum`, }, }, }); cube(`SalesManagers`, { sql: `SELECT * FROM sales_managers`, joins: { Deals: { relationship: `hasMany`, sql: `${SalesManagers}.id = ${Deals}.sales_manager_id`, }, }, measures: { averageDealAmount: { sql: `${dealsAmount}`, type: `avg`, }, }, dimensions: { dealsAmount: { sql: `${Deals.amount}`, type: `number`, subQuery: true, }, }, });
```
cube(`Deals`, {
  sql: `SELECT * FROM deals`,
  measures: {
    amount: {
      sql: `amount`,
      type: `sum`,
    },
  },
});

cube(`SalesManagers`, {
  sql: `SELECT * FROM sales_managers`,

  joins: {
    Deals: {
      relationship: `hasMany`,
      sql: `${SalesManagers}.id = ${Deals}.sales_manager_id`,
    },
  },

  measures: {
    averageDealAmount: {
      sql: `${dealsAmount}`,
      type: `avg`,
    },
  },

  dimensions: {
    dealsAmount: {
      sql: `${Deals.amount}`,
      type: `number`,
      subQuery: true,
    },
  },
});
```
Apart from these, Cube also provides advanced concepts such as Export and Import, Extending Cubes, Data Blending, Dynamic Schema Creation, and Polymorphic Cubes. You can read more about them in the Cube documentation.

Getting Started with Cube

Getting started with Cube is very easy. All you need to do is follow the instructions on the Cube documentation page.

To get started you can use Docker to get started quickly. With Docker, you can install Cube in a few easy steps:

1. In a new folder for your project, run the following command:
```
docker run -p 4000:4000 -p 3000:3000 
  -v ${PWD}:/cube/conf 
  -e CUBEJS_DEV_MODE=true 
  cubejs/cube
```
2. Head to http://localhost:4000 to open Developer Playground.

The Developer Playground has a database connection wizard that loads when Cube is first started up and no .env file is found. After database credentials have been set up, an .env file will automatically be created and populated with the same credentials.

Click on the type of database to connect to, and you’ll be able to enter credentials:

After clicking Apply, you should see available tables from the configured database. Select one to generate a data schema. Once the schema is generated, you can execute queries on the Build tab.****

Conclusion

Cube is a revolutionary, open-source framework for building embedded analytics applications. It offers a unified API for connecting to any data source, comprehensive visualization libraries, and a data-driven user experience that makes it easy for developers to build interactive applications quickly. With Cube, developers can focus on the application logic and let the framework take care of the data, making it an ideal platform for creating data-driven applications that can be deployed on the web, mobile, and desktop. It is an invaluable tool for any developer interested in building sophisticated analytics applications quickly and easily.

‍
January 25, 2023
Modern Data Stack: The What, Why and How?
This post will provide you with a comprehensive overview of the modern data stack (MDS), including its benefits, how it’s components differ from its predecessors’, and what its future holds.

“Modern” has the connotation of being up-to-date, of being better. This is true for MDS, but how exactly is MDS better than what was before?

What was the data stack like?…

A few decades back, the map-reduce technological breakthrough made it possible to efficiently process large amounts of data in parallel on multiple machines.

It provided the backbone of a standard pipeline that looked like:

‍

It was common to see HDFS used for storage, spark for computing, and hive to perform SQL queries on top.

To run this, we had people handling the deployment and maintenance of Hadoop on their own.

This core attribute of the setup eventually became a pain point and made it complex and inefficient in the long run.

Being on-prem while facing growing heavier loads meant scalability became a huge concern.

Hence, unlike today, the process was much more manual. Adding more RAM, increasing storage, and rolling out updates manually reduced productivity

Moreover,
- The pipeline wasn’t modular; components were tightly coupled, causing failures when deciding to shift to something new.
- Teams committed to specific vendors and found themselves locked in, by design, for years.
- Setup was complex, and the infrastructure was not resilient. Random surges in data crashed the systems. (This randomness in demand has only increased since the early decade of internet, due to social media-triggered virality.)
- Self-service was non-existent. If you wanted to do anything with your data, you needed data engineers.
- Observability was a myth. Your pipeline is failing, but you’re unaware, and then you don’t know why, where, how…Your customers become your testers, knowing more about your system’s issues.
- Data protection laws weren’t as formalized, especially the lack of policies within the organization. These issues made the traditional setup inefficient in solving modern problems, and as we all know…
For an upgraded modern setup, we needed something that is scalable, has a smaller learning curve, and something that is feasible for both a seed-stage startup or a fortune 500.

Standing on the shoulders of tech innovations from the 2000s, data engineers started building a blueprint for MDS tooling with three core attributes:

Cloud Native (or the ocean)

Arguably the definitive change of the MDS era, the cloud reduces the hassle of on-prem and welcomes auto-scaling horizontally or vertically in the era of virality and spikes as technical requirements.

Modularity

The M in MDS could stand for modular.

You can integrate any MDS tool into your existing stack, like LEGO blocks.

You can test out multiple tools, whether they’re open source or managed, choose the best fit, and iteratively build out your data infrastructure.

This mindset helps instill a habit of avoiding vendor lock-in by continuously upgrading your architecture with relative ease.

By moving away from the ancient, one-size-fits-all model, MDS recognizes the uniqueness of each company’s budget, domain, data types, and maturity—and provides the correct solution for a given use case.

Ease of Use

MDS tools are easier to set up. You can start playing with these tools within a day.

Importantly, the ease of use is not limited to technical engineers.

Owing to the rise of self-serve and no-code tools like tableau—data is finally democratized for usage for all kinds of consumers. SQL remains crucial, but for basic metric calculations PMs, Sales, Marketing, etc., can use a simple drag and drop in the UI (sometimes even simpler than Excel pivot tables).

MDS also enables one to experiment with different architectural frameworks for their use case. For example, ELT vs. ETL (explained under Data Transformation).

‍

But, one might think such improvements mean MDS is the v1.1 of Data Stack, a tech upgrade that ultimately uses data to solve similar problems.

Fortunately, that’s far from the case.

MDS enables data to solve more human problems across the org—problems that employees have long been facing but could never systematically solve for, helping generate much more value from the data.

Beyond these, employees want transparency and visibility into how any metric was calculated and which data source in Snowflake was used to build what specific tableau dashboard.

Critically, with compliance finally being focused on, orgs need solutions for giving the right people the right access at the right time.

Lastly, as opposed to previous eras, these days, even startups have varied infrastructure components with data; if you’re a PM tasked with bringing insights, how do you know where to start? What data assets the organization has?

Besides these problem statements being tackled, MDS builds a culture of upskilling employees in various data concepts.

Data security, governance, and data lineage are important irrespective of department or persona in the organization.

From designers to support executives, the need for a data-driven culture is a given.

You’re probably bored of hearing how good the MDS is and want to deconstruct it into its components.

Let’s dive in.

SOURCES

In our modern era, every product is inevitably becoming a tech product

From a smart bulb to an orbiting satellite, each generates data in its own unique flavor of frequency of generation, data format, data size, etc.

Social media, microservices, IoT devices, smart devices, DBs, CRMs, ERPs, flat files, and a lot more…

‍

INGESTION

Post creation of data, how does one “ingest” or take in that data for actual usage? (the whole point of investing).

Roughly, there are three categories to help describe the ingestion solutions:

Generic tools allow us to connect various data sources with data storages.

E.g.: we can connect Google Ads or Salesforce to dump data into BigQuery or S3.

These generic tools highlight the modularity and low or no code barrier aspect in MDS.

Things are as easy as drag and drop, and one doesn’t need to be fluent in scripting.

Then we have programmable tools as well, where we get more control over how we ingest data through code

For example, we can write Apache Airflow DAGs in Python to load data from S3 and dump it to Redshift.

Intermediary – these tools cater to a specific use case or are coupled with the source itself.

E.g. – Snowpipe, a part of the data source snowflake itself, allows us to load data from files as soon as it’s available at the source.

DATA STORAGE‍

Where do you ingest data into?

Here, we’ve expanded from HDFS & SQL DBs to a wider variety of formats (noSQL, document DB).

Depending on the use case and the way you interact with data, you can choose from a DW, DB, DL, ObjectStores, etc.

You might need a standard relational DB for transactions in finance, or you might be collecting logs. You might be experimenting with your product at an early stage and be fine with noSQL without worrying about prescribing schemas.

One key feature to note is that—most are cloud-based. So, no more worrying about scalability and we pay only for what we use.

PS: Do stick around till the end for new concepts of Lake House and reverse ETL (already prevalent in the industry).

DATA TRANSFORMATION

‍

The stored raw data must be cleaned and restructured into the shape we deem best for actual usage. This slicing and dicing is different for every kind of data.

For example, we have tools for the E-T-L way, which can be categorized into SaaS and Frameworks, e.g., Fivetran and Spark respectively.

Interestingly, the cloud era has given storage computational capability such that we don’t even need an external system for transformation, sometimes.

With this rise of E-LT, we leverage the processing capabilities of cloud data warehouses or lake houses. Using tools like DBT, we write templated SQL queries to transform our data in the warehouses or lake house itself.

This is enabling analysts to perform heavy lifting of traditional DE problems

We also see stream processing where we work with applications where “micro” data is processed in real time (analyzed as soon as it’s produced, as opposed to large batches).

DATA VISUALIZATION

The ability to visually learn from data has only improved in the MDS era with advanced design, methodology, and integration.

With Embedded analytics, one can integrate analytical capabilities and data visualizations into the software application itself.

External analytics, on the other hand, are used to build using your processed data. You choose your source, create a chart, and let it run.

DATA SCIENCE, MACHINE LEARNING, MLOps

Source: https://medium.com/vertexventures/thinking-data-the-modern-data-stack-d7d59e81e8c6

In the last decade, we have moved beyond ad-hoc insight generation in Jupyter notebooks to

production-ready, real-time ML workflows, like recommendation systems and price predictions. Any startup can and does integrate ML into its products.

Most cloud service providers offer machine learning models and automated model building as a service.

MDS concepts like data observation are used to build tools for ML practitioners, whether its feature stores (a feature store is a central repository that provides entity values as of a certain time), or model monitoring (checking data drift, tracking model performance, and improving model accuracy).

This is extremely important as statisticians can focus on the business problem not infrastructure.

This is an ever-expanding field where concepts for ex MLOps (DevOps for the ML pipelines—optimizing workflows, efficient transformations) and Synthetic media (using AI to generate content itself) arrive and quickly become mainstream.

ChatGPT is the current buzz, but by the time you’re reading this, I’m sure there’s going to be an updated one—such is the pace of development.

DATA ORCHESTRATION

With a higher number of modularized tools and source systems comes complicated complexity.

More steps, processes, connections, settings, and synchronization are required.

Data orchestration in MDS needs to be Cron on steroids.

Using a wide variety of products, MDS tools help bring the right data for the right purposes based on complex logic.

DATA OBSERVABILITY

Data observability is the ability to monitor and understand the state and behavior of data as it flows through an organization’s systems.

In a traditional data stack, organizations often rely on reactive approaches to data management, only addressing issues as they arise. In contrast, data observability in an MDS involves adopting a proactive mindset, where organizations actively monitor and understand the state of their data pipelines to identify potential issues before they become critical.

Monitoring – a dashboard that provides an operational view of your pipeline or system

Alerting – both for expected events and anomalies

Tracking – ability to set and track specific events

Analysis – automated issue detection that adapts to your pipeline and data health

Logging – a record of an event in a standardized format for faster resolution

SLA Tracking – Measure data quality against predefined standards (cost, performance, reliability)

Data Lineage – graph representation of data assets showing upstream/downstream steps.

DATA GOVERNANCE & SECURITY

Data security is a critical consideration for organizations of all sizes and industries and needs to be prioritized to protect sensitive information, ensure compliance, and preserve business continuity.

The introduction of stricter data protection regulations, such as the General Data Protection Regulation (GDPR) and CCPA, introduced a huge need in the market for MDS tools, which efficiently and painlessly help organizations govern and secure their data.

DATA CATALOG

Now that we have all the components of MDS, from ingestion to BI, we have so many sources, as well as things like dashboards, reports, views, other metadata, etc., that we need a google like engine just to navigate our components.

This is where a data catalog helps; it allows people to stitch the metadata (data about your data: the #rows in your table, the column names, types, etc.) across sources.

This is necessary to help efficiently discover, understand, trust, and collaborate on data assets.

We don’t want PMs & GTM to look at different dashboards for adoption data.

‍

Previously, the sole purpose of the original data pipeline was to aggregate and upload events to Hadoop/Hive for batch processing. Chukwa collected events and wrote them to S3 in Hadoop sequence file format. In those days, end-to-end latency was up to 10 minutes. That was sufficient for batch jobs, which usually scan data at daily or hourly frequency.

With the emergence of Kafka and Elasticsearch over the last decade, there has been a growing demand for real-time analytics on Netflix. By real-time, we mean sub-minute latency. Instead of starting from scratch, Netflix was able to iteratively grow its MDS as per changes in market requirements.

Source: https://blog.transform.co/data-talks/the-metric-layer-why-you-need-it-examples-and-how-it-fits-into-your-modern-data-stack/

This is a snapshot of the MDS stack a data-mature company like Netflix had some years back where instead of a few all in one tools, each data category was solved by a specialized tool.

FUTURE COMPONENTS OF MDS?

DATA MESH

Source: https://martinfowler.com/articles/data-monolith-to-mesh.html

The top picture shows how teams currently operate, where no matter the feature or product on the Y axis, the data pipeline’s journey remains the same moving along the X. But in an ideal world of data mesh, those who know the data should own its journey.

As decentralization is the name of the game, data mesh is MDS’s response to this demand for an architecture shift where domain owners use self-service infrastructure to shape how their data is consumed.

DATA LAKEHOUSE

Source: https://www.altexsoft.com/blog/data-lakehouse/

We have talked about data warehouses and data lakes being used for data storage.

Initially, when we only needed structured data, data warehouses were used. Later, with big data, we started getting all kinds of data, structured and unstructured.

So, we started using Data Lakes, where we just dumped everything.

The lakehouse tries to combine the best of both worlds by adding an intelligent metadata layer on top of the data lake. This layer basically classifies and categorizes data such that it can be interpreted in a structured manner.

Also, all the data in the lake house is open, meaning that it can be utilized by all kinds of tools. They are generally built on top of open data formats like parquet so that they can be easily accessed by all the tools.

End users can simply run their SQLs as if they’re querying a DWH.

REVERSE ETL

Suppose you’re a salesperson using Salesforce and want to know if a lead you just got is warm or cold (warm indicating a higher chance of conversion).

The attributes about your lead, like salary and age are fetched from your OLTP into a DWH, analyzed, and then the flag “warm” is sent back to Salesforce UI, ready to be used in live operations.

METRICS LAYER

The Metric layer will be all about consistency, accessibility, and trust in the calculations of metrics.

Earlier, for metrics, you had v1 v1.1 Excels with logic scattered around.

Currently, in the modern data stack world, each team’s calculation is isolated in the tool they are used to. For example, BI would store metrics in tableau dashboards while DEs would use code.

A metric layer would exist to ensure global access of the metrics to every other tool in the data stack.

For example, DBT metrics layer helps define these in the warehouse—something accessible to both BI and engineers. Similarly, looker, mode, and others have their unique approach to it.

In summary, this blog post discussed the modern data stack and its advantages over older approaches. We examined the components of the modern data stack, including data sources, ingestion, transformation, and more, and how they work together to create an efficient and effective system for data management and analysis. We also highlighted the benefits of the modern data stack, including increased efficiency, scalability, and flexibility.

As technology continues to advance, the modern data stack will evolve and incorporate new components and capabilities.
January 4, 2023