Blog

  • Confluent Kafka vs. Amazon Managed Streaming for Apache Kafka (AWS MSK) vs. on-premise Kafka

    Overview:

    In the evolving landscape of distributed streaming platforms, Kafka has become the foundation of real-time data processing. As organizations opt for multiple deployment options, choosing Confluent Kafka, AWS MSK, and on-premises Kafka becomes important. This blog aims to provide an overview and comparison of these three Kafka deployment methods to help readers decide based on their needs.

    Here’s a list of key bullet points to consider when comparing Confluent Kafka, AWS MSK, and on-premise Kafka:

    1. Deployment and Management
    2. Communication
    3. Scalability
    4. Performance
    5. Cost Considerations
    6. Security

    So, let’s go through all the parameters one by one.

    Deployment and Management: 

    Deployment and management refer to the processes involved in setting up, configuring, and overseeing the operation of a Kafka infrastructure. Each deployment option—Confluent Kafka, AWS MSK, and On-Premise Kafka—has its own approach to these aspects.

    1. Deployment:

    -> Confluent Kafka

    • Confluent Kafka deployment involves setting up the Confluent Platform, which extends the open-source Kafka with additional tools.
    • Users must install and configure Confluent components, such as Confluent Server, Connect, and Control Center.
    • Deployment may vary based on whether it’s on-premise or in the cloud using Confluent Cloud.

    -> AWS MSK

    • AWS MSK streamlines deployment as it is a fully managed service.
    • Users create an MSK cluster through the AWS Management Console or API, specifying configurations like instance types, storage, and networking.
    • AWS handles the underlying infrastructure, automating tasks like scaling, patching, and maintenance.

    -> On-Premise Kafka:

    • On-premise deployment requires manual setup and configuration of Kafka brokers, Zookeeper, and other components.
    • Organizations must provision and maintain the necessary hardware and networking infrastructure.
    • Deployment may involve high availability, fault tolerance, and scalability considerations based on organizational needs.

    2. Management:

    -> Confluent Kafka:

    • Confluent provides additional management tools beyond what’s available in open-source Kafka, including Confluent Control Center for monitoring and Confluent Hub for extensions.
    • Users have control over configurations and can leverage Confluent’s tools for streamlining tasks like data integration and schema management.

    -> AWS MSK:

    • AWS MSK offers a managed environment where AWS takes care of routine management tasks.
    • AWS Console and APIs provide tools for monitoring, scaling, and configuring MSK clusters.
    • AWS handles maintenance tasks such as applying patches and updates to the Kafka software.

    -> On-Premise Kafka:

    • On-premise management involves manual oversight of the entire Kafka infrastructure.
    • Organizations have full control but must handle tasks like software updates, monitoring configurations, and addressing any issues that arise.
    • Management may require coordination between IT and operations teams to ensure smooth operation.

    Communication

    There are three main components of communication protocols, network configurations, and inter-cluster communication, so let’s look at them one by one.

    1. Protocols

    -> Confluent Kafka:

    • Utilizes standard Kafka protocols such as TCP for communication between Kafka brokers and clients.
    • Confluent components may communicate using REST APIs and other protocols specific to Confluent Platform extensions.

    -> AWS MSK:

    • Relies on standard Kafka protocols for communication between clients and brokers.
    • Also employs protocols like TLS/SSL for secure communication.

    -> On-Premise Kafka:

    • Standard Kafka protocols are used for communication between components, including TCP for broker-client communication.
    • Specific protocols may vary based on the organization’s network configuration.

    2. Network Configuration

    -> Confluent Kafka:

    • Requires network configuration for Kafka brokers and other Confluent components.
    • Configuration may include specifying listener ports, security settings, and inter-component communication settings.

    -> AWS MSK:

    • Network configuration involves setting up Virtual Private Cloud (VPC) settings, security groups, and subnet configurations.
    • AWS MSK integrates with AWS networking services, allowing organizations to define network parameters.

    -> On-Premise Kafka:

    • Network configuration includes defining IP addresses, ports, and firewall rules for Kafka brokers.
    • Organizations have full control over the on-premise network infrastructure, allowing for custom configurations.

    3. Inter-Cluster Communication:

    -> Confluent Kafka:

    • Confluent components within the same cluster communicate using Kafka protocols.
    • Communication between Confluent Kafka clusters or with external systems may involve additional configurations.

    -> AWS MSK:

    • AWS MSK clusters can communicate with each other within the same VPC or across different VPCs using standard Kafka protocols.
    • AWS networking services facilitate secure and efficient inter-cluster communication.

    -> On-Premise Kafka:

    • Inter-cluster communication on-premise involves configuring network settings to enable communication between Kafka clusters.
    • Organizations have full control over the network architecture for inter-cluster communication.

    Scaling

    Scalability is crucial for ensuring that Kafka deployments can handle varying workloads and grow as demands increase. Whether through vertical or horizontal scaling, the goal is to maintain performance, reliability, and efficiency in the face of changing requirements. Each deployment option—Confluent Kafka, AWS MSK, and on-premise Kafka—provides mechanisms for achieving scalability, with differences in how scaling is implemented and managed.

    -> Confluent Kafka:

    • Horizontal scaling is a fundamental feature of Kafka, allowing for adding more broker nodes to a Kafka cluster to handle increased message throughput.
    • Kafka’s partitioning mechanism allows data to be distributed across multiple broker nodes, enabling parallel processing and improved scalability.
    • Scaling can be achieved dynamically by adding or removing broker nodes, providing flexibility in adapting to varying workloads.

    -> AWS MSK:

    • AWS MSK leverages horizontal scaling by allowing users to adjust the number of broker nodes in a cluster based on demand.
    • The managed service handles the underlying infrastructure and scaling tasks automatically, simplifying the process for users.
    • Scaling with AWS MSK is designed to be seamless, with the service managing the distribution of partitions across broker nodes.

    -> On-Premise Kafka:

    • Scaling Kafka on-premise involves manually adding or removing broker nodes based on the organization’s infrastructure and capacity planning.
    • Organizations need to consider factors such as hardware limitations, network configurations, and load balancing when scaling on-premise Kafka.

    Performance

    Performance in Kafka, whether it’s Confluent Kafka, AWS MSK (managed streaming for Kafka), or on-premise Kafka, is a critical aspect that directly impacts the throughput, latency, and efficiency of the system. Here’s a brief overview of performance considerations for each:

    -> Confluent Kafka

    • Enhancements and Extensions: Confluent Kafka builds upon the open-source Kafka with additional tools and features, potentially providing optimizations and performance enhancements. This may include tools for monitoring, data integration, and schema management.
    • Customization: Organizations using Confluent Kafka have the flexibility to customize configurations and performance settings based on their specific use cases and requirements.
    • Scalability: Confluent Kafka inherits Kafka’s fundamental scalability features, allowing for horizontal scaling by adding more broker nodes to a cluster.

    -> AWS MSK

    • Managed Environment: AWS MSK is a fully managed service, meaning AWS takes care of many operational aspects, including patches, updates, and scaling. This managed environment can contribute to a more streamlined and optimized performance.
    • Scalability: AWS MSK allows for horizontal scaling by adjusting the number of broker nodes in a cluster dynamically. This elasticity contributes to the efficient handling of varying workloads.
    • Integration with AWS Services: Integration with other AWS services can enhance performance in specific scenarios, such as leveraging high-performance storage solutions or integrating with AWS networking services.

    -> On-Premise Kafka

    • Control and Customization: On-premise Kafka deployments provide organizations with complete control over the infrastructure, allowing for fine-tuning and customization to meet specific performance requirements.
    • Scalability Challenges: Scaling on-premise Kafka may present challenges related to manual provisioning of hardware, potential limitations in scalability due to physical constraints, and increased complexity in managing a distributed system.
    • Hardware Considerations: Performance is closely tied to the quality of hardware chosen for on-premise deployment. Organizations must invest in suitable hardware to meet performance expectations.

    Cost Considerations

    Cost considerations for Kafka deployments involve evaluating the direct and indirect expenses associated with setting up, managing, and maintaining a Kafka infrastructure. Kafka cost considerations span licensing, infrastructure, scalability, operations, data transfer, and ongoing support. The choice between Confluent Kafka, AWS MSK, or on-premise Kafka should be made by evaluating these costs in alignment with the organization’s budget, requirements, and preferences.

    Here’s a brief overview of key cost considerations:

    1. Deployment Costs

    -> Confluent Kafka:

    • Involves licensing costs for Confluent Platform, which provides additional tools and features beyond open-source Kafka.
    • Infrastructure costs for hosting Confluent Kafka, whether on-premise or in the cloud.

    -> AWS MSK:

    • Pay-as-you-go pricing model for AWS MSK, where users pay for the resources consumed by the Kafka cluster.
    • Costs include AWS MSK service charges, as well as associated AWS resource costs such as EC2 instances, storage, and data transfer.

    -> On-Premise Kafka:

    • Upfront costs for hardware, networking equipment, and software licenses.
    • Ongoing operational costs, including maintenance, power, cooling, and any required hardware upgrades.

    2. Scalability Costs

    -> Confluent Kafka:

    • Scaling Confluent Kafka may involve additional licensing costs if more resources are needed.
    • Infrastructure costs increase with the addition of more nodes or resources.

    -> AWS MSK:

    • Scaling AWS MSK is dynamic, and users are billed based on the resources consumed during scaling events.
    • Costs may increase with additional broker nodes and associated AWS resources.

    -> On-Premise Kafka:

    • Scaling on-premise Kafka requires manual provisioning of additional hardware, incurring upfront and ongoing costs.
    • Organizations must consider the total cost of ownership (TCO) when planning for scalability.

    3. Operational Costs

    -> Confluent Kafka:

    • Operational costs include manpower for managing and monitoring Confluent Kafka.
    • Costs associated with maintaining Confluent extensions and tools.

    -> AWS MSK:

    • AWS MSK is a managed service, reducing the operational burden on the user.
    • Operational costs may include AWS support charges and personnel for configuring and monitoring the Kafka environment.

    -> On-Premise Kafka:

    • Organizations bear full operational responsibility, incurring staffing, maintenance, monitoring tools, and ongoing support costs.

    4. Data Transfer Costs:

    -> Confluent Kafka:

    • Costs associated with data transfer depend on the chosen deployment model (e.g., cloud-based deployments may have data transfer costs).

    -> AWS MSK:

    • Data transfer costs may apply for communication between AWS MSK clusters and other AWS services.

    -> On-Premise Kafka:

    • Data transfer costs may be associated with network usage, especially if there are multiple geographically distributed Kafka clusters.

    Check the below images of pricing figures for Confluent Kafka and AWS MSK side by side.

    Security

    Security involves implementing measures to protect data, ensure confidentiality, integrity, and availability, and mitigate risks associated with unauthorized access or data breaches. Here’s a brief overview of key security considerations for Kafka deployments:

    1. Authentication:

    -> Confluent Kafka:

    • Supports various authentication mechanisms, including SSL/TLS for encrypted communication and SASL (Simple Authentication and Security Layer) for user authentication.
    • Role-based access control (RBAC) allows administrators to define and manage user roles and permissions.

    -> AWS MSK:

    • Integrates with AWS Identity and Access Management (IAM) for user authentication and authorization.
    • IAM policies control access to AWS MSK resources and actions.

    -> On-Premise Kafka:

    • Authentication mechanisms depend on the chosen Kafka distribution, but SSL/TLS and SASL are common.
    • LDAP or other external authentication systems may be integrated for user management.

    2. Authorization:

    -> Confluent Kafka:

    • RBAC allows fine-grained control over what users and clients can do within the Kafka environment.
    • Access Control Lists (ACLs) can be used to restrict access at the topic level.

    -> AWS MSK:

    • IAM policies define permissions for accessing and managing AWS MSK resources.
    • Fine-grained access control can be enforced using IAM roles and policies.

    -> On-Premise Kafka:

    • ACLs are typically used to specify which users or applications have access to specific Kafka topics or resources.
    • Access controls depend on the Kafka distribution and configuration.

    3. Encryption:

    -> Confluent Kafka:

    • Supports encryption in transit using SSL/TLS for secure communication between clients and brokers.
    • Optionally supports encryption at rest for data stored on disks.

    -> AWS MSK:

    • Encrypts data in transit using SSL/TLS for communication between clients and brokers.
    • Provides options for encrypting data at rest using AWS Key Management Service (KMS).

    -> On-Premise Kafka:

    • Encryption configurations depend on the Kafka distribution and may involve configuring SSL/TLS for secure communication and encryption at rest using platform-specific tools.

    Here are some key points to evaluate AWS MSK, Confluent Kafka, and on-premise Kafka, along with their advantages and disadvantages.

    Conclusion

    • The best approach is to evaluate the level of Kafka expertise in-house and the consistency of your workload.
    • On one side of the spectrum, with less expertise and uncertain workloads, Confluent’s breadth of features and support comes out ahead.
    • On the other side, with experienced Kafka engineers and a very predictable workload, you can save money with MSK.
    • For specific requirements and full control, you can go with on-premise Kafka.
  • Streamline Kubernetes Storage Upgrades

    Introduction:

    As technology advances, organizations are constantly seeking ways to optimize their IT infrastructure to enhance performance, reduce costs, and gain a competitive edge. One such approach involves migrating from traditional storage solutions to more advanced options that offer superior performance and cost-effectiveness. 

    In this blog post, we’ll explore a recent project (On Azure) where we successfully migrated our client’s applications from Disk type Premium SSD to Premium SSD v2. This migration led to performance improvements and cost savings for our client.

    Prerequisites:

    Before initiating this migration, ensure the following prerequisites are in place:

    1. Kubernetes Cluster: Ensure you have a working K8S cluster to host your applications.
    2. Velero Backup Tool: Install Velero, a widely-used backup and restoration tool tailored for Kubernetes environments.

    Overview of Velero:

    Velero stands out as a powerful tool designed for robust backup, restore, and migration solutions within Kubernetes clusters. It plays a crucial role in ensuring data safety and continuity during complex migration operations.

    Refer to the article on Velero installation and configuration.

    Strategic Plan Overview:

    There is two methods for upgrading storage classes:

    • Migration via Velero and CSI Integration: 

    This approach leverages Velero’s capabilities in conjunction with CSI integration to achieve a seamless and efficient migration.

    • Using Cloud Methods: 

    This method involves leveraging cloud provider-specific procedures. It includes steps like taking a snapshot of the disk, creating a new disk from the snapshot, and then establishing a Kubernetes volume using disk referencing. 

    Step-by-Step Guide:

    Migration via Velero and CSI Integration:

    Step 1 : Storage Class for Premium SSD v2

    Define a new storage class that supports Azure Premium SSD v2 disks. This storage class will be used to provision new persistent volumes during the restore process.

    # We have taken azure storage class example
    
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
     name: premium-ssd-v2
    parameters:
     cachingMode: None
     skuName: PremiumV2_LRS # (Disk Type)
    provisioner: disk.csi.azure.com
    reclaimPolicy: Delete
    volumeBindingMode: WaitForFirstConsumer
    allowVolumeExpansion: true

    Step 2: Volume Snapshot Class

    Introduce a Volume Snapshot Class to enable snapshot creation for persistent volumes. This class will be utilized for capturing the current state of persistent volumes before restoring them using Premium SSD v2.

    apiVersion: snapshot.storage.k8s.io/v1
    kind: VolumeSnapshotClass
    metadata:
      name: disk-snapshot-class
    driver: disk.csi.azure.com
    deletionPolicy: Delete
    parameters:
      incremental: "false"

    Step 3: Update Velero Deployment and Daemonset

    Enable CSI (Container Storage Interface) support in both the Velero deployment and the node-agent daemonset. This modification allows Velero to interact with the Cloud Disk CSI driver for provisioning and managing persistent volumes. Additionally, configure the Velero client to utilize the CSI plugin, ensuring that Velero utilizes the Cloud Disk CSI driver for backup and restore operations.

    # Enable CSI Server side features 
    
    $ Kubectl -n velero edit deployment/velero
    $ kubectl -n velero edit daemonset/restic
    
    # Add below --features=EnableCSI flag in both resources 
    
        spec:
          containers:
          - args:
            - server
            - --features=EnableCSI
    
    # Enable client side features 
    
    $ velero client config set features=EnableCSI

    Step 4: Take Velero Backup

    Create a Velero backup of all existing persistent volumes stored on Disk Premium SSD. These backups serve as a safety net in case of any unforeseen issues during the migration process. And we can use the include and exclude flags with the velero backup commands.

    Reference Article : https://velero.io/docs/v1.12/resource-filtering 

    # run the below command for taking backup 
    $ velero backup create backup_name --include-namespaces namespace_name

    Step 5: ConfigMap Deployment 

    Deploy a ConfigMap in the Velero namespace. This ConfigMap defines the mapping between the old storage class (Premium SSD) and the new storage class (Premium SSD v2). During the restore process, Velero will use this mapping to recreate the persistent volumes using the new storage class.

    apiVersion: v1
    data:
      # managed-premium : premium-ssd-v2
      older storage_class name : new storage_class name
    kind: ConfigMap
    metadata:
      labels:
        velero.io/change-storage-class: RestoreItemAction
        velero.io/plugin-config: ""
      name: storage-class-config
      namespace: velero

    Step 6: Velero Restore Operation

    Initiate the Velero restore process. This will replace the existing persistent volumes with new ones provisioned using Disk Premium SSD v2. The ConfigMap will ensure that the restored persistent volumes utilize the new storage class. 

    Reference article: https://velero.io/docs/v1.12/restore-reference 

    # run the below command for restoring from backups to different namespace 
    $ velero restore create restore-name --from-backup backup-name --namespace-mappings namespace1:namespace2
    # verify the new restored resources in namespace2
    $ kubectl get pvc,pv,pod -n namespace2

    Step 7: Verification & Testing

    Verify that all applications continue to function correctly after the restore process. Check for any performance improvements and cost savings as a result of the migration to Premium SSD v2.

    Step 8: Post-Migration Cleanup

    Remove any temporary resources created during the migration process, such as Volume Snapshots, and the custom Volume Snapshot Class. And delete the old persistent volume claims (PVCs) that were associated with the Premium SSD disks. This will trigger the automatic deletion of the corresponding persistent volumes (PVs) and Azure Disk storage.

    Impact:

    It’s less risky because all new objects are created while retaining other copies with snapshots. And during the scheduling of new pods, the new Premium SSD v2 disks will be provisioned in the same zone as the node where the pod is being scheduled. While the content of the new disks is restored from the snapshot, there may be some downtime expected. The duration of downtime depends on the size of the disks being restored.

    Conclusion:

    Migrating from any storage class to a newer, more performant one using Velero can provide significant benefits for your organization. By leveraging Velero’s comprehensive backup and restore functionalities, you can effectively migrate your applications to the new storage class while maintaining data integrity and application functionality. Whether you’re upgrading from Premium SSD to Premium SSD v2 or transitioning to a completely different storage provider. By adopting this approach, organizations can reap the rewards of enhanced performance, reduced costs, and simplified storage management.

  • Vector Search: The New Frontier in Personalized Recommendations

    Introduction

    Imagine you are a modern-day treasure hunter, not in search of hidden gold, but rather the wealth of knowledge and entertainment hidden within the vast digital ocean of content. In this realm, where every conceivable topic has its own sea of content, discovering what will truly captivate you is like finding a needle in an expansive haystack.

    This challenge leads us to the marvels of recommendation services, acting as your compass in this digital expanse. These services are the unsung heroes behind the scenes of your favorite platforms, from e-commerce sites that suggest enticing products to streaming services that understand your movie preferences better than you might yourself. They sift through immense datasets of user interactions and content features, striving to tailor your online experience to be more personalized, engaging, and enriching.

    But what if I told you that there is a cutting-edge technology that can take personalized recommendations to the next level? Today, I will take you through a journey to build a blog recommendation service that understands the contextual similarities between different pieces of content, transcending beyond basic keyword matching. We’ll harness the power of vector search, a technology that’s revolutionizing personalized recommendations. We’ll explore how recommendation services are traditionally implemented, and then briefly discuss how vector search enhances them.

    Finally, we’ll put this knowledge to work, using OpenAI’s embedding API and Elasticsearch to create a recommendation service that not only finds content but also understands and aligns it with your unique interests.

    Exploring the Landscape: Traditional Recommendation Systems and Their Limits

    Traditionally, these digital compasses, or recommendation systems, employ methods like collaborative and content-based filtering. Imagine sitting in a café where the barista suggests a coffee based on what others with similar tastes enjoyed (collaborative filtering) or based on your past coffee choices (content-based filtering). While these methods have been effective in many scenarios, they come with some limitations. They often stumble when faced with the vast and unstructured wilderness of web data, struggling to make sense of the diverse and ever-expanding content landscape. Additionally, when user preferences are ambiguous or when you want to recommend content by truly understanding it on a semantic level, traditional methods may fall short.

    Enhancing Recommendation with Vector Search and Vector Databases

    Our journey now takes an exciting turn with vector search and vector databases, the modern tools that help us navigate this unstructured data. These technologies transform our café into a futuristic spot where your coffee preference is understood on a deeper, more nuanced level.

    Vector Search: The Art of Finding Similarities

    Vector search operates like a seasoned traveler who understands the essence of every place visited. Text, images, or sounds can be transformed into numerical vectors, like unique coordinates on a map. The magic happens when these vectors are compared, revealing hidden similarities and connections, much like discovering that two seemingly different cities share a similar vibe.

    Vector Databases: Navigating Complex Data Landscapes

    Imagine a vast library of books where each book captures different aspects of a place along with its coordinates. Vector databases are akin to this library, designed to store and navigate these complex data points. They easily handle intricate queries over large datasets, making them perfect for our recommendation service, ensuring no blog worth reading remains undiscovered.

    Embeddings: Semantic Representation

    In our journey, embeddings are akin to a skilled artist who captures not just the visuals but the soul of a landscape. They map items like words or entire documents into real-number vectors, encapsulating their deeper meaning. This helps in understanding and comparing different pieces of content on a semantic level, letting the recommendation service show you things that really match your interests.

    Sample Project: Blog Recommendation Service

    Project Overview

    Now, let’s craft a simple blog recommendation service using OpenAI’s embedding APIs and Elasticsearch as a vector database. The goal is to recommend blogs similar to the current one the user is reading, which can be shown in the read more or recommendation section.

    Our blogs service will be responsible for indexing the blogs, finding similar one,  and interacting with the UI Service.

    Tools and Setup

    We will need the following tools to build our service:

    • OpenAI Account: We will be using OpenAI’s embedding API to generate the embeddings for our blog content. You will need an OpenAI account to use the APIs. Once you have created an account, please create an API key and store it in a secure location.
    • Elasticsearch: A popular database renowned for its full-text search capabilities, which can also be used as a vector database, adept at storing and querying complex embeddings with its dense_vector field type.
    • Docker: A tool that allows developers to package their applications and all the necessary dependencies into containers, ensuring that the application runs smoothly and consistently across different computing environments.
    • Python: A versatile programming language for developers across diverse fields, from web development to data science.

    The APIs will be created using the FastAPI framework, but you can choose any framework.

    Steps

    First, we’ll create a BlogItem class to represent each blog. It has only three fields, which will be enough for this demonstration, but real-world entities would have more details to accommodate a wider range of properties and functionalities.

    class BlogItem:
        blog_id: int
        title: str
        content: str
    
        def __init__(self, blog_id: int, title: str, content: str):
            self.blog_id = blog_id
            self.title = title
            self.content = content

    Elasticsearch Setup:

    • To store the blog data along with its embedding in Elasticsearch, we need to set up a local Elasticsearch cluster and then create an index for our blogs. You can also use a cloud-based version if you have already procured one for personal use.
    • Install Docker or Docker Desktop on your machine and create Elasticsearch and Kibana docker containers using the below docker compose file. Run the following command to create and start the services in the background:
    • docker compose -f /path/to/your/docker-compose/file up -d. 
    • You can exclude the file path if you are in the same directory as your docker-compose.yml file. The advantage of using docker compose is that it allows you to clean up these resources with just one command.
    • docker compose -f /path/to/your/docker-compose/file down.
    version: '3'
    
    services:
    
      elasticsearch:
        image: docker.elastic.co/elasticsearch/elasticsearch:<version>
        container_name: elasticsearch
        environment:
          - node.name=docker-cluster
          - discovery.type=single-node
          - cluster.routing.allocation.disk.threshold_enabled=false
          - bootstrap.memory_lock=true
          - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
          - xpack.security.enabled=true
          - ELASTIC_PASSWORD=YourElasticPassword
          - "ELASTICSEARCH_USERNAME=elastic"
        ulimits:
          memlock:
            soft: -1
            hard: -1
        volumes:
          - esdata:/usr/share/elasticsearch/data
        ports:
          - "9200:9200"
        networks:
          - esnet
    
      kibana:
        image: docker.elastic.co/kibana/kibana:<version>
        container_name: kibana
        environment:
          ELASTICSEARCH_HOSTS: http://elasticsearch:9200
          ELASTICSEARCH_USERNAME: elastic
          ELASTICSEARCH_PASSWORD: YourElasticPassword
        ports:
          - "5601:5601"
        networks:
          - esnet
        depends_on:
          - elasticsearch
    
    networks:
      esnet:
        driver: bridge
    
    volumes:
      esdata:
        driver: local

    • Connect to the local ES instance and create an index. Our “blogs” index will have a unique blog ID, blog title, blog content, and an embedding field to store the vector representation of blog content. The text-embedding-ada-002 model we have used here produces vectors with 1536 dimensions; hence, it’s important to use the same in our embeddings field in the blogs index.
    import logging
    from elasticsearch import Elasticsearch
    
    # Setting up logging
    logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] %(filename)s:%(lineno)d - %(message)s',
                        datefmt='%Y-%m-%d %H:%M:%S')
    # Elasticsearch client setup
    es = Elasticsearch(hosts="http://localhost:9200",
                       basic_auth=("elastic", "YourElasticPassword"))
    
    
    # Create an index with a mappings for embeddings
    def create_index(index_name: str, embedding_dimensions: int):
        try:
            es.indices.create(
                index=index_name,
                body={
                    'mappings': {
                        'properties': {
                            'blog_id': {'type': 'long'},
                            'title': {'type': 'text'},
                            'content': {'type': 'text'},
                            'embedding': {'type': 'dense_vector', 'dims': embedding_dimensions}
                        }
                    }
                }
            )
        except Exception as e:
            logging.error(f"An error occurred while creating index {index_name} : {e}")
    
    
    # Sample usage
    create_index("blogs", 1536)

    Create Embeddings AND Index Blogs:

    • We use OpenAI’s Embedding API to get a vector representation of our blog title and content. I am using the 002 model here, which is recommended by Open AI for most use cases. The input to the text-embedding-ada-002 should not exceed 8291 tokens (1000 tokens are roughly equal to 750 words) and cannot be empty.
    from openai import OpenAI, OpenAIError
    
    api_key = ‘YourApiKey’
    client = OpenAI(api_key=api_key)
    
    def create_embeddings(text: str, model: str = "text-embedding-ada-002") -> list[float]:
        try:
            text = text.replace("n", " ")
            response = client.embeddings.create(input=[text], model=model)
            logging.info(f"Embedding created successfully")
            return response.data[0].embedding
        except OpenAIError as e:
            logging.error(f"An OpenAI error occurred while creating embedding : {e}")
            raise
        except Exception as e:
            logging.exception(f"An unexpected error occurred while creating embedding  : {e}")
            raise

    • When the blogs get created or the content of the blog gets updated, we will call the create_embeddings function to get text embedding and store it in our blogs index.
    # Define the index name as a global constant
    
    ELASTICSEARCH_INDEX = 'blogs'
    
    def index_blog(blog_item: BlogItem):
        try:
            es.index(index=ELASTICSEARCH_INDEX, body={
                'blog_id': blog_item.blog_id,
                'title': blog_item.title,
                'content': blog_item.content,
                'embedding': create_embeddings(blog_item.title + "n" + blog_item.content)
            })
        except Exception as e:
            logging.error(f"Failed to index blog with blog id {blog_item.blog_id} : {e}")
            raise

    • Create a Pydantic model for the request body:
    class BlogItemRequest(BaseModel):
        title: str
        content: str

    • Create an API to save blogs to Elasticsearch. The UI Service would call this API when a new blog post gets created.
    @app.post("/blogs/")
    def save_blog(response: Response, blog_item: BlogItemRequest) -> dict[str, str]:
        # Create a BlogItem instance from the request data
        try:
            blog_id = get_blog_id()
            blog_item_obj = BlogItem(
                blog_id=blog_id,
                title=blog_item.title,
                content=blog_item.content,
            )
    
            # Call the index_blog method to index the blog
            index_blog(blog_item_obj)
            return {"message": "Blog indexed successfully", "blog_id": str(blog_id)}
        except Exception as e:
            logging.error(f"An error occurred while indexing blog with blog_id {blog_id} : {e}")
            response.status_code = status.HTTP_500_INTERNAL_SERVER_ERROR
            return {"error": "Failed to index blog"}

    Finding Relevant Blogs:

    • To find blogs that are similar to the current one, we will compare the current blog’s vector representation with other blogs present in the Elasticsearch index using the cosine similarity function.
    • Cosine similarity is a mathematical measure used to determine the cosine of the angle between two vectors in a multi-dimensional space, often employed to assess the similarity between two documents or data points. 
    • The cosine similarity score ranges from -1 to 1. As the cosine similarity score increases from -1 to 1, it indicates an increasing degree of similarity between the vectors. Higher values represent greater similarity.
    • Create a custom exception to handle a scenario when a blog for a given ID is not present in Elasticsearch.
    class BlogNotFoundException(Exception):
        def __init__(self, message="Blog not found"):
            self.message = message
            super().__init__(self.message)

    • First, we will check if the current blog is present in the blogs index and get its embedding. This is done to prevent unnecessary calls to Open AI APIs as it consumes tokens. Then, we would construct an Elasticsearch dsl query to find the nearest neighbors and return their blog content.
    def get_blog_embedding(blog_id: int) -> Optional[Dict]:
        try:
            response = es.search(index=ELASTICSEARCH_INDEX, body={
                'query': {
                    'term': {
                        'blog_id': blog_id
                    }
                },
                '_source': ['title', 'content', 'embedding']  # Fetch title, content and embedding
            })
    
            if response['hits']['hits']:
                logging.info(f"Blog found with blog_id {blog_id}")
                return response['hits']['hits'][0]['_source']
            else:
                logging.info(f"No blog found with blog_id {blog_id}")
                return None
        except Exception as e:
            logging.error(f"Error occurred while searching for blog with blog_id {blog_id}: {e}")
            raise
    
    
    def find_similar_blog(current_blog_id: int, num_neighbors=2) -> list[dict[str, str]]:
        try:
            blog_data = get_blog_embedding(current_blog_id)
            if not blog_data:
                raise BlogNotFoundException(f"Blog not found for id:{current_blog_id}")
            blog_embedding = blog_data['embedding']
            if not blog_embedding:
                blog_embedding = create_embeddings(blog_data['title'] + 'n' + blog_data['content'])
            # Find similar blogs using the embedding
            response = es.search(index=ELASTICSEARCH_INDEX, body={
                'size': num_neighbors + 1,  # Retrieve extra result as we'll exclude the current blog
                '_source': ['title', 'content', 'blog_id', '_score'],
                'query': {
                    'bool': {
                        'must': {
                            'script_score': {
                                'query': {'match_all': {}},
                                'script': {
                                    'source': "cosineSimilarity(params.query_vector, 'embedding')",
                                    'params': {'query_vector': blog_embedding}
                                }
                            }
                        },
                        'must_not': {
                            'term': {
                                'blog_id': current_blog_id  # Exclude the current blog
                            }
                        }
                    }
                }
            })
    
            # Extract and return the hits
            hits = [
                {
                    'title': hit['_source']['title'],
                    'content': hit['_source']['content'],
                    'blog_id': hit['_source']['blog_id'],
                    'score': f"{hit['_score'] * 100:.2f}%"
                }
                for hit in response['hits']['hits']
                if hit['_source']['blog_id'] != current_blog_id
            ]
    
            return hits
        except Exception as e:
            logging.error(f"An error while finding similar blogs: {e}")
            raise

    • Define a Pydantic model for the response:
    class BlogRecommendation(BaseModel):
        blog_id: int
        title: str
        content: str
        score: str

    • Create an API that would be used by UI Service to find similar blogs as the current one user is reading:
    @app.get("/recommend-blogs/{current_blog_id}")
    def recommend_blogs(
            response: Response,
            current_blog_id: int,
            num_neighbors: Optional[int] = 2) -> Union[Dict[str, str], List[BlogRecommendation]]:
        try:
            # Call the find_similar_blog function to get recommended blogs
            recommended_blogs = find_similar_blog(current_blog_id, num_neighbors)
            return recommended_blogs
        except BlogNotFoundException as e:
            response.status_code = status.HTTP_400_BAD_REQUEST
            return {"error": f"Blog not found for id:{current_blog_id}"}
        except Exception as e:
            response.status_code = status.HTTP_500_INTERNAL_SERVER_ERROR
            return {"error": "Unable to process the request"}

    • The below flow diagram summarizes all the steps we have discussed so far:

    Testing the Recommendation Service

    • Ideally, we would be receiving the blog ID from the UI Service and passing the recommendations back, but for illustration purposes, we’ll be calling the recommend blogs API with some test inputs from my test dataset. The blogs in this sample dataset have concise titles and content, which are sufficient for testing purposes, but real-world blogs will be much more detailed and have a significant amount of data. The test dataset has around 1000 blogs on various categories like healthcare, tech, travel, entertainment, and so on.
    • A sample from the test dataset:

     

    • Test Result 1: Medical Research Blog

      Input Blog: Blog_Id: 1, Title: Breakthrough in Heart Disease Treatment, Content: Researchers have developed a new treatment for heart disease that promises to be more effective and less invasive. This breakthrough could save millions of lives every year.


    • Test Result 2: Travel Blog

      Input Blog: Blog_Id: 4, Title: Travel Tips for Sustainable Tourism, Content: How to travel responsibly and sustainably.


    I manually tested multiple blogs from the test dataset of 1,000 blogs, representing distinct topics and content, and assessed the quality and relevance of the recommendations. The recommended blogs had scores in the range of 87% to 95%, and upon examination, the blogs often appeared very similar in content and style.

    Based on the test results, it’s evident that utilizing vector search enables us to effectively recommend blogs to users that are semantically similar. This approach ensures that the recommendations are contextually relevant, even when the blogs don’t share identical keywords, enhancing the user’s experience by connecting them with content that aligns more closely with their interests and search intent.

    Limitations

    This approach for finding similar blogs is good enough for our simple recommendation service, but it might have certain limitations in real-world applications.

    • Our similarity search returns the nearest k neighbors as recommendations, but there might be scenarios where no similar blog might exist or the neighbors might have significant score differences. To deal with this, you can set a threshold to filter out recommendations below a certain score. Experiment with different threshold values and observe their impact on recommendation quality. 
    • If your use case involves a small dataset and the relationships between user preferences and item features are straightforward and well-defined, traditional methods like content-based or collaborative filtering might be more efficient and effective than vector search.

    Further Improvements

    • Using LLM for Content Validation: Implement a verification step using large language models (LLMs) to assess the relevance and validity of recommended content. This approach can ensure that the suggestions are not only similar in context but also meaningful and appropriate for your audience.
    • Metadata-based Embeddings: Instead of generating embeddings from the entire blog content, utilize LLMs to extract key metadata such as themes, intent, tone, or key points. Create embeddings based on this extracted metadata, which can lead to more efficient and targeted recommendations, focusing on the core essence of the content rather than its entirety.

    Conclusion

    Our journey concludes here, but yours is just beginning. Armed with the knowledge of vector search, vector databases, and embeddings, you’re now ready to build a recommendation service that doesn’t just guide users to content but connects them to the stories, insights, and experiences they seek. It’s not just about building a service; it’s about enriching the digital exploration experience, one recommendation at a time.

  • Unlocking the Potential of Knowledge Graphs: Exploring Graph Databases

    There is a growing demand for data-driven insights to help businesses make better decisions and stay competitive. To meet this need, organizations are turning to knowledge graphs as a way to access and analyze complex data sets. In this blog post, I will discuss what knowledge graphs are, what graph databases are, how they differ from hierarchical databases, the benefits of graphical representation of data, and more. Lastly, we’ll discuss some of the challenges of graph databases and how they can be overcome.

    What Is a Knowledge Graph?

    A knowledge graph is a visual representation of data or knowledge. In order to make the relationships between various types of facts and data easy to see and understand, facts and data are organized into a graph structure. A knowledge graph typically consists of nodes, which stand in for entities like people or objects, and edges, which stand in for the relationships among these entities.

    Each node in a knowledge graph has characteristics and attributes that describe it. For instance, the node of a person might contain properties like name, age, and occupation. Edges between nodes reveal information about their connections. This makes knowledge graphs a powerful tool for representing and understanding data.

    Benefits of a Knowledge Graph

    There are a number of benefits to using knowledge graphs. 

    • Knowledge graphs(KG) provide a visual representation of data that can be easily understood. This makes it easier to quickly identify patterns, and correlations. 
    • Additionally, knowledge graphs make it simple to locate linkage data by allowing us to quickly access a particular node and obtain all of its child information.
    • These graphs are highly scalable, meaning they can support huge volumes of data. This makes them ideal for applications such as artificial intelligence (AI) and machine learning (ML).
    • Finally, knowledge graphs can be used to connect various types of data, including text, images, and videos, in addition to plain text. This makes them a great tool for data mining and analysis.

    ‍What are Graph Databases?

    Graph databases are used to store and manage data in the form of a graph. Unlike traditional databases, they offer a more flexible representation of data using nodes, edges, and properties. Graph databases are designed to support queries that require traversing relationships between different types of data.

    Graph databases are well-suited for applications that require complex data relationships, such as AI and ML. They are also more efficient than traditional databases in queries that involve intricate data relationships, as they can quickly process data without having to make multiple queries.

    Source: Techcrunch

    Comparing Graph Databases to Hierarchical Databases

    It is important to understand the differences between graph databases and hierarchical databases. But first, what is a hierarchical database? Hierarchical databases are structured in a tree-like form, with each record in the database linked to one or more other records. This structure makes hierarchical databases ideal for storing data that is organized in a hierarchical manner, such as an organizational chart. However, hierarchical databases are less efficient at handling complex data relationships. To understand with an example, suppose we have an organization with a CEO at the top, followed by several vice presidents, who are in turn responsible for several managers, who are responsible for teams of employees.

     

    In a hierarchical database, this structure would be represented as a tree, with the CEO at the root, and each level of the organization represented by a different level of the tree. For example:

     

    In a graph database, this same structure would be represented as a graph, with each node representing an entity (e.g., a person), and each edge representing a relationship (e.g., reporting to). For example:

    (Vice President A) — reports_to –> (CEO)

    (Vice President B) — reports_to –> (CEO)

    (Vice President A) — manages –> (Manager A1)

    (Vice President B) — manages –> (Manager B1)

    (Manager A1) — manages –> (Employee A1.1)

    (Manager B1) — manages –> (Employee B1.1)

    (Manager B1) — manages –> (Employee B1.2)

     

    As you can see, in a graph database, the relationships between entities are explicit and can be easily queried and traversed. In a hierarchical database, the relationships are implicit and can be more difficult to work with if the hierarchy becomes more complex. Hence the reason graph databases are better suited for complex data relationships is that it gives them the flexibility to easily store and query data.

    Creating a Knowledge Graph from Scratch

    We will now understand how to create a knowledge graph using an example below where we’ll use a simple XML file that contains information about some movies, and we’ll use an XSLT stylesheet to transform the XML data into RDF format along with some python libraries to help us in the overall process.

    Let’s consider an XML file having movie information:

    <movies>
      <movie id="tt0083658">
        <title>Blade Runner</title>
        <year>1982</year>
        <director rid="12341">Ridley Scott</director>
        <genre>Action</genre>
      </movie>
    
      <movie id="tt0087469">
        <title>Top Gun</title>
        <year>1986</year>
        <director rid="65217">Tony Hank</director>
        <genre>Thriller</genre>
      </movie>
    </movies>

    As discussed, to convert this data into a knowledge graph, we will be using an XSL file, now a question may arise that what is an XSL file? Well, XSL files are stylesheet documents that are used to transform XML data. To explore more on XSL, visit here, but don’t worry as we will be starting from scratch. 

    Moving ahead, we also need to know that to convert any data into graph data, we need to use an ontology; there are many ontologies available, like OWL ontology or EBUCore ontology. But what is an ontology? Well, in the context of knowledge graphs, an ontology is a formal specification of the relationships and constraints that exist within a specific domain or knowledge domain. It provides a vocabulary and a set of rules for representing and sharing knowledge, allowing machines to reason about the data they are working with. EBUCore is an ontology developed by the European Broadcasting Union (EBU) to provide a standardized metadata model for the broadcasting industry (OTT platforms, media companies, etc.). Further references on EBUCore can be found here.

    We will be using the below XSL for transforming the above XML with movie info.

    <?xml version="1.0"?>
    <xsl:stylesheet version="1.0"
                    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                    xmlns:ebucore="http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#"
                    >
        <xsl:template match="movies">
            <rdf:RDF>
                <xsl:apply-templates select="movie"/>
            </rdf:RDF>
        </xsl:template>
    
        <xsl:template match="movie">
            <ebucore:Feature>
    
                <xsl:template match="title">
                    <ebucore:title>
                        <xsl:value-of select="."/>
                    </ebucore:title>
                </xsl:template>
    
                <xsl:template match="year">
                    <ebucore:dateBroadcast>
                        <xsl:value-of select="."/>
                    </ebucore:dateBroadcast>
                </xsl:template>
    
                <xsl:template match="director">
                    <ebucore:hasParticipatingAgent>
                        <ebucore:Agent>                                        
                            <ebucore:hasRole>Director</ebucore:hasRole>
                            <ebucore:agentName>
                                <xsl:value-of select="."/>
                            </ebucore:agentName>
                        </ebucore:Agent>
                    </ebucore:hasParticipatingAgent>
                </xsl:template>
    
                <xsl:template match="genre">
                    <ebucore:hasGenre>
                        <xsl:value-of select="."/>
                    </ebucore:hasGenre>
                </xsl:template>
    
            </ebucore:Feature>
        </xsl:template>
    </xsl:stylesheet>

    To start with XSL, the first line, “<?xml version=”1.0″?>” defines the version of document. The second line opens a stylesheet defining the XSL version we will be using and further having XSL, RDF, EBUCore as their namespaces. These namespaces are required as we will be using elements of those classes to avoid name conflicts in our XML document. The xsl:template match defines which element to match in the XML, as we want to match from the start of the XML. Since movies are the root element of our XML, we will be using xsl:template match=”movies”. 

    After that, we open an RDF tag to start our knowledge graph, this element will contain all the movie details, and hence we are using xsl:apply-templates on “movie” as in our XML we have multiple <movie> elements nested inside <movies> tag. To get further details from <movie> elements, we define a template matching all movie elements, which will help us to fetch all the required details. The tag <ebucore:Feature> defines that all of our contents belong to a feature which is an alternate name for “movie” in EBUCore ontology. We then match details like title, year, genre, etc., from XML and define their corresponding value from EBUCore, like ebucore:title, ebucore:dateBroadcast, and ebucore:hasGenre respectively. 

    Now that we have the XSL ready, we will need to apply this XSL on our XML and get RDF data out of it by following the below Python code:

    import lxml.etree as ET
    import xml.dom.minidom as xm
    
    movie_data = """
                <movies>
                    <movie id="tt0083658">
                        <title>Blade Runner</title>
                        <year>1982</year>
                        <director rid="12341">Ridley Scott</director>
                        <genre>Action</genre>
                    </movie>
                        <movie id="tt0087469">
                        <title>Top Gun</title>
                        <year>1986</year>
                        <director rid="65217">Tony Hank</director>
                        <genre>Thriller</genre>
                    </movie>
                </movies>
                """
    
    xslt_file = "transform_movies.xsl"
    xslt_root = ET.parse(xslt_file)
    transform = ET.XSLT(xslt_root)
    rss_root = ET.fromstring(movie_data)
    result_tree = transform(rss_root)
    result_string = ET.tostring(result_tree)
    
    # Converting bytes to string and pretty formatting
    dom_transformed = xm.parseString(result_string)
    pretty_xml_as_string = dom_transformed.toprettyxml()
    
    # Saving the output
    with open('Path_to_downloads/output_movie.xml', "w") as f:
        f.write(pretty_xml_as_string)
    # Print Output
    print(pretty_xml_as_string)

    The above code will generate the following output:

    <?xml version="1.0" ?>
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:ebucore="http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#">
    
    <ebucore:Feature>
       <ebucore:title>Blade Runner</ebucore:title>
       <ebucore:dateBroadcast>1982</ebucore:dateBroadcast>
       <ebucore:hasParticipatingAgent>
           <ebucore:Agent>
               <ebucore:hasRole>Director</ebucore:hasRole>
               <ebucore:agentName>Ridley Scott</ebucore:agentName>
           </ebucore:Agent>
       </ebucore:hasParticipatingAgent>
       <ebucore:hasGenre>Action</ebucore:hasGenre>
    </ebucore:Feature>
    
    <ebucore:Feature>
       <ebucore:title>Top Gun</ebucore:title>
       <ebucore:dateBroadcast>1986</ebucore:dateBroadcast>
       <ebucore:hasParticipatingAgent>
           <ebucore:Agent>
               <ebucore:hasRole>Director</ebucore:hasRole>
               <ebucore:agentName>Tony Hank</ebucore:agentName>
           </ebucore:Agent>
       </ebucore:hasParticipatingAgent>
       <ebucore:hasGenre>Thriller</ebucore:hasGenre>
    </ebucore:Feature>
    
    </rdf:RDF>

    This output is an RDF XML, which will now be converted to a Graph and we will also visualize it using the following code:

    Note: Install the following library before proceeding.

    pip install graphviz rdflib

    from rdflib import Graph
    from rdflib import Graph, Namespace
    from rdflib.tools.rdf2dot import rdf2dot
    from graphviz import render
    
    # Creating empty graph object
    graph = Graph()
    graph.parse(result_string, format="xml")
    # Saving graph/ttl(Terse RDF Triple Language) in movies.ttl file
    graph.serialize(destination=f"Downloads/movies.ttl", format="ttl")
    
    # Steps to Visualize the generated graph
    # Define a namespace for the RDF data
    ns = Namespace("http://example.com/movies#")
    graph.bind("ex", ns)
    
    # Serialize the RDF data to a DOT file
    dot_file = open("Downloads/movies.dot", "w")
    rdf2dot(graph, dot_file, opts={"label": "Movies Graph", "rankdir": "LR"})
    dot_file.close()
    
    # Render the DOT file to a PNG image
    
    render("dot", "png", "Downloads/movies.dot")

    Finally, the above code will yield a movies.dot.png file in the Downloads folder location, which will look something like this:

    This clearly represents the relationship between edges and nodes along with all information in a well-formatted way.

    Examples of Knowledge Graphs

    Now that we have knowledge of how we can create a knowledge graph, let’s explore the big players that are using such graphs for their operations.

    Google Knowledge Graph: This is one of the most well-known examples of a knowledge graph. It is used by Google to enhance its search results with additional information about entities, such as people, places, and things. For example, if you search for “Barack Obama,” the Knowledge Graph will display a panel with information about his birthdate, family members, education, career, and more. All this information is stored in the form of nodes and edges making it easier for the Google search engine to retrieve related information of any topic.

    DBpedia: This is a community-driven project that extracts structured data from Wikipedia and makes it available as a linked data resource. It is primarily used for graph analysis and executing SPARQL queries. It contains information on millions of entities, such as people, places, and things, and their relationships with one another. DBpedia can be used to power applications like question-answering systems, recommendation engines, and more. One of the key advantages of DBpedia is that it is an open and community-driven project, which means that anyone can contribute to it and use it for their own applications. This has led to a wide variety of applications built on top of DBpedia, from academic research to commercial products.

    As we have discussed the examples of knowledge graphs, one should know that they all use SPARQL queries to retrieve data from their huge corpus of graphs. So, let’s write one such query to retrieve data from the knowledge graph created by us for movie data. We will be writing a query to retrieve all movies’ Genre information along with Movie Titles.

    from rdflib.plugins.sparql import prepareQuery
    
    # Define the SPARQL query
    query = prepareQuery('''
        PREFIX ebucore: <http://www.ebu.ch/metadata/ontologies/ebucore/ebucore#>
        SELECT ?genre ?title
        WHERE {
          ?movie a ebucore:Feature ;
                 ebucore:hasGenre ?genre ;
                 ebucore:title ?title .
        }
    ''', initNs={"ebucore": ns})
    
    
    # Execute the query and print the results
    results = graph.query(query)
    for row in results:
        genre, title = row
        print(f"Movie Genre: {genre}, Movie Title: {title}")

    Challenges with Graph Databases:

    Data Complexity: One of the primary challenges with graph databases and knowledge graphs is data complexity. As the size and complexity of the data increase, it can become challenging to manage and query the data efficiently.

    Data Integration: Graph databases and knowledge graphs often need to integrate data from different sources, which can be challenging due to differences in data format, schema, and structure.

    Query Performance: Knowledge graphs are often used for complex queries, which can be slow to execute, especially for large datasets.

    Knowledge Representation: Representing knowledge in a graph database or knowledge graph can be challenging due to the diversity of concepts and relationships that need to be modeled accurately. One should have experience with ontologies, relationships, and business use cases to curate a perfect representation

    Bonus: How to Overcome These Challenges:

    • Use efficient indexing and query optimization techniques to handle data complexity and improve query performance.
    • Use data integration tools and techniques to standardize data formats and structures to improve data integration.
    • Use distributed computing and partitioning techniques to scale the database horizontally.
    • Use caching and precomputing techniques to speed up queries.
    • Use ontology modeling and semantic reasoning techniques to accurately represent knowledge and relationships in the graph database or knowledge graph.

    Conclusion

    In conclusion, graph databases, and knowledge graphs are powerful tools that offer several advantages over traditional relational databases. They enable flexible modeling of complex data and relationships, which can be difficult to achieve using a traditional tabular structure. Moreover, they enhance query performance for complex queries and enable new use cases such as recommendation engines, fraud detection, and knowledge management.

    Despite the aforementioned challenges, graph databases and knowledge graphs are gaining popularity in various industries, ranging from finance to healthcare, and are expected to continue playing a significant role in the future of data management and analysis.

  • Mage: Your New Go-To Tool for Data Orchestration

    In our journey to automate data pipelines, we’ve used tools like Apache Airflow, Dagster, and Prefect to manage complex workflows. However, as data automation continues to change, we’ve added a new tool to our toolkit: Mage AI.

    Mage AI isn’t just another tool; it’s a solution to the evolving demands of data automation. This blog aims to explain how Mage AI is changing the way we automate data pipelines by addressing challenges and introducing innovative features. Let’s explore this evolution, understand the problems we face, and see why we’ve adopted Mage AI.

    What is Mage AI?

    Mage is a user-friendly open-source framework created for transforming and merging data. It’s a valuable tool for developers handling substantial data volumes efficiently. At its heart, Mage relies on “data pipelines,” made up of code blocks. These blocks can run independently or as part of a larger pipeline. Together, these blocks form a structure known as a directed acyclic graph (DAG), which helps manage dependencies. For example, you can use Mage for tasks like loading data, making transformations, or exportation.

    Mage Architecture:

    Before we delve into Mage’s features, let’s take a look at how it works.

    When you use Mage, your request begins its journey in the Mage Server Container, which serves as the central hub for handling requests, processing data, and validation. Here, tasks like data processing and real-time interactions occur. The Scheduler Process ensures tasks are scheduled with precision, while Executor Containers, designed for specific tasks like Python or AWS, carry out the instructions.

    Mage’s scalability is impressive, allowing it to handle growing workloads effectively. It can expand both vertically and horizontally to maintain top-notch performance. Mage efficiently manages data, including code, data, and logs, and takes security seriously when handling databases and sensitive information. This well-coordinated system, combined with Mage’s scalability, guarantees reliable data pipelines, blending technical precision with seamless orchestration.

    Scaling Mage:

    To enhance Mage’s performance and reliability as your workload expands, it’s crucial to scale its architecture effectively. In this concise guide, we’ll concentrate on four key strategies for optimizing Mage’s scalability:

    1. Horizontal Scaling: Ensure responsiveness by running multiple Mage Server and Scheduler instances. This approach keeps the system running smoothly, even during peak usage.
    2. Multiple Executor Containers: Deploy several Executor Containers to handle concurrent task execution. Customize them for specific executors (e.g., Python, PySpark, or AWS) to scale task processing horizontally as needed.
    3. External Load Balancers: Utilize external load balancers to distribute client requests across Mage instances. This not only boosts performance but also ensures high availability by preventing overloading of a single server.
    4. Scaling for Larger Datasets: To efficiently handle larger datasets, consider:

    a. Allocating more resources to executors, empowering them to tackle complex data transformations.

    b. Mage supports direct data warehouse transformation and native Spark integration for massive datasets.

    Features: 

    1) Interactive Coding Experience

    Mage offers an interactive coding experience tailored for data preparation. Each block in the editor is a modular file that can be tested, reused, and chained together to create an executable data pipeline. This means you can build your data pipeline piece by piece, ensuring reliability and efficiency.

    2) UI/IDE for Building and Managing Data Pipelines

    Mage takes data pipeline development to the next level with a user-friendly integrated development environment (IDE). You can build and manage your data pipelines through an intuitive user interface, making the process efficient and accessible to both data scientists and engineers.

    3) Multiple Languages Support

    Mage supports writing pipelines in multiple languages such as Python, SQL, and R. This language versatility means you can work with the languages you’re most comfortable with, making your data preparation process more efficient.

    4) Multiple Types of Pipelines

    Mage caters to diverse data pipeline needs. Whether you require standard batch pipelines, data integration pipelines, streaming pipelines, Spark pipelines, or DBT pipelines, Mage has you covered.

    5) Built-In Engineering Best Practices

    Mage is not just a tool; it’s a promoter of good coding practices. It enables reusable code, data validation in each block, and operationalizes data pipelines with built-in observability, data quality monitoring, and lineage. This ensures that your data pipelines are not only efficient but also maintainable and reliable.

    6) Dynamic Blocks

    Dynamic blocks in Mage allow the output of a block to dynamically create additional blocks. These blocks are spawned at runtime, with the total number of blocks created being equal to the number of items in the output data of the dynamic block multiplied by the number of its downstream blocks.

    7) Triggers

    • Schedule Triggers: These triggers allow you to set specific start dates and intervals for pipeline runs. Choose from daily, weekly, or monthly, or even define custom schedules using Cron syntax. Mage’s Schedule Triggers put you in control of when your pipelines execute.
    • Event Triggers: With Event Triggers, your pipelines respond instantly to specific events, such as the completion of a database query or the creation of a new object in cloud storage services like Amazon S3 or Google Storage. Real-time automation at your fingertips.
    • API Triggers: API Triggers enable your pipelines to run in response to specific API calls. Whether it’s customer requests or external system interactions, these triggers ensure your data workflows stay synchronized with the digital world.

    Different types of Block: 

    Data Loader: Within Mage, Data Loaders are ready-made templates designed to seamlessly link up with a multitude of data sources. These sources span from Postgres, Bigquery, Redshift, and S3 to various others. Additionally, Mage allows for the creation of custom data loaders, enabling connections to APIs. The primary role of Data Loaders is to facilitate the retrieval of data from these designated sources.

    Data Transformer: Much like Data Loaders, Data Transformers provide predefined functions such as handling duplicates, managing missing data, and excluding specific columns. Alternatively, you can craft your own data transformations or merge outputs from multiple data loaders to preprocess and sanitize the data before it advances through the pipeline.

    Data Exporter: Data Exporters within Mage empower you to dispatch data to a diverse array of destinations, including databases, data lakes, data warehouses, or local storage. You can opt for predefined export templates or craft custom exporters tailored to your precise requirements.

    Custom Blocks: Custom blocks in the Mage framework are incredibly flexible and serve various purposes. They can store configuration data and facilitate its transmission across different pipeline stages. Additionally, they prove invaluable for logging purposes, allowing you to categorize and visually distinguish log entries for enhanced organization.

    Sensor: A Sensor, a specialized block within Mage, continuously assesses a condition until it’s met or until a specified time duration has passed. When a block depends on a sensor, it remains inactive until the sensor confirms that its condition has been satisfied. Sensors are especially valuable when there’s a need to wait for external dependencies or handle delayed data before proceeding with downstream tasks.

    Getting Started with Mage

    There are two ways to run mage, either using docker or using pip:
    Docker Command

    Create a new working directory where all the mage files will be stored.

    Then, in that working directory, execute this command:

    Windows CMD: 

    docker run -it -p 6789:6789 -v %cd%:/home/src mageai/mageai /app/run_app.sh mage start [project_name]

    Linux CMD:

    docker run -it -p 6789:6789 -v $(pwd):/home/src mageai/mageai /app/run_app.sh mage start [project_name]

    Using Pip (Working directory):

    pip install mage-ai

    Mage start [project_name]

    You can browse to http://localhost:6789/overview to get to the Mage UI.

    Let’s build our first pipelineto fetch CSV files from the API for data loading, do some useful transformations, and export that data to our local database.

    Dataset invoices CSV files stored in the current directory of columns:

     (1) First Name; (2) Last Name; (3) E-mail; (4) Product ID; (5) Quantity; (6) Amount; (7) Invoice Date; (8) Address; (9) City; (10) Stock Code

    Create a new pipeline and select a standard batch (we’ll be implementing a batch pipeline) from the dashboard and give it a unique ID.

    Project structure:

    ├── mage_data

    └── [project_name]

        ├── charts

        ├── custom

        ├── data_exporters

        ├── data_loaders

        ├── dbt

        ├── extensions

        ├── pipelines

        │   └── [pipeline_name]

        │       ├── __init__.py

        │       └── metadata.yaml

        ├── scratchpads

        ├── transformers

        ├── utils

        ├── __init__.py

        ├── io_config.yaml

        ├── metadata.yaml

        └── requirements.txt

    This pipeline consists of all the block files, including data loader, transformer, charts, and configuration files for our pipeline io_config.yaml and metadata.yaml files. All block files will contain decorators’ inbuilt function where we will be writing our code.

    1. We begin by loading a CSV file from our local directory, specifically located at /home/src/invoice.csv. To achieve this, we select the “Local File” option from the Templates dropdown and configure the Data Loader block accordingly. Running this configuration will allow us to confirm if the CSV file loads successfully.

    2. In the next step, we introduce a Transformer block using a generic template. On the right side of the user interface, we can observe the directed acyclic graph (DAG) tree. To establish the data flow, we edit the parent of the Transformer block, linking it either directly to the Data Loader block or via the user interface.

    The Transformer block operates on the data frame received from the upstream Data Loader block, which is passed as the first argument to the Transformer function.

    3. Our final step involves exporting the DataFrame to a locally hosted PostgreSQL database. We incorporate a Data Export block and connect it to the Transformer block.

    To establish a connection with the PostgreSQL database, it is imperative to configure the database credentials in the io_config.yaml file. Alternatively, these credentials can be added to environmental variables.

    With these steps completed, we have successfully constructed a foundational batch pipeline. This pipeline efficiently loads, transforms, and exports data, serving as a fundamental building block for more advanced data processing tasks.

    Mage vs Other tools:

    Consistency Across Environments: Some orchestration tools may exhibit inconsistencies between local development and production environments due to varying configurations. Mage tackles this challenge by providing a consistent and reproducible workflow environment through a single configuration file that can be executed uniformly across different environments.

    Reusability: Achieving reusability in workflows can be complex in some tools. Mage simplifies this by allowing tasks and workflows to be defined as reusable components within a Magefile, making it effortless to share them across projects and teams.

    Data Passing: Efficiently passing data between tasks can be a challenge in certain tools, especially when dealing with large datasets. Mage streamlines data passing through straightforward function arguments and returns, enabling seamless data flow and versatile data handling.

    Testing: Some tools lack user-friendly testing utilities, resulting in manual testing and potential coverage gaps. Mage simplifies testing with a robust testing framework that enables the definition of test cases, inputs, and expected outputs directly within the Mage file.

    Debugging: Debugging failed tasks can be time-consuming with certain tools. Mage enhances debugging with detailed logs and error messages, offering clear insights into the causes of failures and expediting issue resolution.

    Conclusion: 

    Mage offers a streamlined and user-friendly approach to data pipeline orchestration, addressing common challenges with simplicity and efficiency. Its single-container deployment, visual interface, and robust features make it a valuable tool for data professionals seeking an intuitive and consistent solution for managing data workflows.

  • Discover App Features with TipKit

    In today’s digital age, the user experience is paramount. Mobile applications need to be intuitive and user-friendly so that users not only enjoy the app’s main functionalities but also easily navigate through its features. There are instances where a little extra guidance can go a long way, whether it’s introducing users to a fresh feature, showing them shortcuts to complete tasks more efficiently, or simply offering tips on getting the most out of an app. Many developers have traditionally crafted custom overlays or tooltips to bridge this gap, often requiring a considerable amount of effort. But the wait for a streamlined solution is over. After much anticipation, Apple has introduced the TipKit framework, a dedicated tool to simplify this endeavor, enhancing user experience with finesse.

    TipKit

    Introduced at WWDC 2023, TipKit emerges as a beacon for app developers aiming to enhance user engagement and experience. This framework is ingeniously crafted to present mini-tutorials, shining a spotlight on new, intriguing, or yet-to-be-discovered features within an application. Its utility isn’t just confined to a single platform—TipKit boasts integration with iCloud to ensure data synchronization across various devices.

    At the heart of TipKit lies its two cornerstone components: the Tip Protocol and the TipView. These components serve as the foundation, enabling developers to craft intuitive and informative tips that resonate with their user base.

    Tip Protocol

    The essence of TipKit lies in its Tip Protocol, which acts as the blueprint for crafting and configuring content-driven tips. To create your tips tailored to your application’s needs, it’s imperative to conform to the Tip Protocol.

    While every Tip demands a title for identification, the protocol offers flexibility by introducing a suite of properties that can be optionally integrated, allowing developers to craft a comprehensive and informative tip.

    1. title(Text): The title of the Tip.
    2. message(Text): A concise description further elaborates the essence of the Tip, providing users with a deeper understanding.
    3. asset(Image): An image to display on the left side of the Tip view.
    4. id(String): A unique identifier to your tip. Default will be the name of the type that conforms to the Tip protocol.
    5. rules(Array of type Tips.Rule): This can be used to add rules to the Tip that can determine when the Tip needs to be displayed.
    6. options(Array of type Tips.Option): Allows to add options for defining the behavior of the Tip.
    7. actions(Array of type Tips.Action): This will provide primary and secondary buttons in the TipView that could help the user learn more about the Tip or execute a custom action when the user interacts with it.

    Creating a Custom Tip

    Let’s create our first Tip. Here, we are going to show a Tip to help the user understand the functionality of the cart button.

    struct CartItemsTip: Tip {
        var title: Text {
            Text("Click the cart button to see what's in your cart")
        }
        var message: Text? {
            Text("You can edit/remove the items from your cart")
        }
        var image: Image? {
            Image(systemName: "cart")
        }
    }

    TipView

    As the name suggests, TipView is a user interface that represents the Inline Tip. The initializer of TipView requires an instance of the Tip protocol we discussed above, an Edge parameter, which is optional, for deciding the edge of the tip view that displays the arrow.

    Displaying a Tip

    Following are the two ways the Tip can be displayed.

    • Inline

    You can display the tip along with other views. An object of TipView requires a type conforming Tip protocol used to display the Inline tip. As a developer, handling multiple views on the screen could be a complex and time-consuming task. TipKit framework makes it easy for the developers as it automatically adjusts the layout and the position of the TipView to ensure other views are accessible to the user. 

    struct ProductList: View {
        private let cartTip = CartItemsTip()
        var body: some View {
            \ Other views
            
            TipView(cartTip)
            
            \ Other views
        }
    }

    • Popover

    TipKit Frameworks allow you to show a popover Tip for any UI element, e.g., Button, Image, etc. The popover tip appears over the entire screen, thus blocking the other views from user interaction until the tip is dismissed. A popoverTip modifier displays a Popover Tip for any UI element. Consider an example below where a Popover tip is displayed for a cart image.

    private let cartTip = CartItemsTip()
    Button {
       cartTip.invalidate(reason: .actionPerformed)
    } label: {
       Image(systemName: "cart")
           .popoverTip(cartTip)
    }

    Dismissing the Tip

    A TipView can be dismissed in two ways.

    1. The user needs to click on X icon.
    2. Developers can dismiss the Tip programmatically using the invalidate(reason:) method.

    There are 3 options to pass as a reason for dismissing the tip: 

    actionPerformed, userClosedTip, maxDisplayCountExceeded

    private let cartTip = CartItemsTip()
    cartTip.invalidate(reason: .actionPerformed)

    Tips Center

    We have discussed essential points to define and display a tip using Tip protocol and TipView, respectively. Still, there is one last and most important step—to configure and load the tip using the configure method as described in the below example. This is mandatory to display the tips within your application. Otherwise, you will not see tips.

    import SwiftUI
    import TipKit
    
    @main
    struct TipkitDemoApp: App {
        var body: some Scene {
            WindowGroup {
                ContentView()
                    .task {
                        try? Tips.configure([
                            .displayFrequency(.immediate),
                            .datastoreLocation(.applicationDefault)
                        ])
                    }
            }
        }
    }

    If you see the definition of configure method, it should be something like:

    static func configure(@Tips.ConfigurationBuilder options: @escaping () -> some TipsConfiguration = { defaultConfiguration }) async throws

    If you notice, the configure method accepts a list of types conforming to TipsConfiguration. There are two options available for TipsConfiguration, DisplayFrequency and DataStoreLocation. 

    You can set these values as per your requirement. 

    DisplayFrequency

    DisplayFrequnecy allows you to control the frequency of your tips and has multiple options. 

    • Use the immediate option when you do not want to set any restrictions.
    • Use the hourly, daily, weekly, and monthly values to display no more than one tip hourly, weekly, and so on, respectively. 
    • For some situations, you need to set the custom display frequency as TimeInterval, when all the available options could not serve the purpose. In the below example, we have set a custom display frequency that restricts the tips to be displayed once per two days.
    let customDisplayFrequency: TimeInterval = 2 * 24 * 60 * 60
    try? Tips.configure([
         .displayFrequency(customDisplayFrequency),
         .datastoreLocation(.applicationDefault)
    ])

    DatastoreLocation

    This will be used for persisting tips and associated data. 

    You can use the following initializers to decide how to persist tips and data.

    public init(url: URL, shouldReset: Bool = false)

    url: A specific URL location where you want to persist the data.

    shouldReset: If set to true, it will erase all data from the datastore. Resetting all tips present in the application.

    public init(_ location: DatastoreLocation, shouldReset: Bool = false)

    location: A predefined datastore location. Setting a default value ‘applicationDefault’ would persist the datastore in the app’s support directory. 

    public init(groupIdentifier: String, directoryName: String? = nil, shouldReset: Bool = false) throws

    groupIdentifier: The name of the group whose shared directory is used by the group of your team’s applications. Use the optional directoryName to specify a directory within this group.

    directoryName: The optional directoryName to specify a directory within the group.

    Max Display Count

    As discussed earlier, we can set options to define tip behavior. One such option is MaxDisplayCount. Consider that you want to show CartItemsTip whenever the user is on the Home screen. Showing the tip every time a user comes to the Home screen can be annoying or frustrating. To prevent this, one of the solutions, perhaps the easiest, is using MaxDisplayCount. The other solution could be defining a Rule that determines when the tip needs to be displayed. Below is an example showcasing the use of the MaxDisplayCount option for defining CartItemsTip.

    struct CartItemsTip: Tip {
        var title: Text {
            Text("Click here to see what's in your cart")
        }
        
        var message: Text? {
            Text("You can edit/remove the items from your cart")
        }
        
        var image: Image? {
            Image(systemName: "cart")
        }
        
        var options: [TipOption] {
            [ MaxDisplayCount(2) ]
        }
    }

    Rule Based Tips

    Let’s understand how Rules can help you gain more control over displaying your tips. There are two types of Rules: parameter-based rules and event-based rules.

    Parameter Rules

    These are persistent and more useful for State and Boolean comparisons. There are Macros (#Rule, @Parameter) available to define a rule. 

    In the below example, we define a rule that checks if the value stored in static itemsInCart property is greater than or equal to 3. 

    Defining rules ensures displaying tips only when all the conditions are satisfied.

    struct CartTip: Tip {
        
        var title: Text {
            Text("Proceed with buying cart items.")
        }
        
        var message: Text? {
            Text("There are 3 or more items in your cart.")
        }
        
        var image: Image? {
            Image(systemName: "cart")
        }
        
        @Parameter
        static var itemsInCart: Int = 0
        
        var rules: [Rule] {
            #Rule(Self.$itemsInCart) { $0 >= 3 }
        }
    }

    Event Rules

    Event-based rules are useful when we want to track occurrences of certain actions in the app. Each event has a unique identifier id of type string, with which we can differentiate between various events. Whenever the action occurs, we need to use the denote() method to increment the counter. 

    Let’s consider the below example where we want to show a Tip to the user when the user selects the iPhone 14 Pro (256 GB) – Purple product more than 2 times.

    The example below creates a didViewProductDetail event with an associated donation value and donates it anytime the ProductDetailsView appears:

    struct ProductDetailsView: View {
        static let didViewProductDetail = Tips.Event<DidViewProduct>(id: "didViewProductDetail")
        var product: ProductModel
        var body: some View {
            VStack(alignment: .leading) {
                HStack(alignment: .top, content: {
                    Spacer()
                    Image(product.productImage, bundle: Bundle.main)
                    Spacer()
                })
                Text(product.productName)
                    .font(.title3)
                    .lineLimit(nil)
                Text(product.productPrice)
                    .font(.title2)
                Text("Get it by Wednesday, 18 October")
                    .font(.caption2)
                    .lineLimit(nil)
                Spacer()
            }
            .padding()
            .onAppear {
                Self.didViewProductDetail.sendDonation(.init(productID: product.productID, productName: product.productName))
                    }
        }
    }
    
    struct DidViewProduct: Codable, Sendable {
        let productID: UUID
        let productName: String
    }

    The example below creates a display rule for ProductDetailsTip based on the didViewProductDetail event.

    struct ProductDetailsTip: Tip {
        var title: Text {
            Text("Add iPhone 14 Pro (256 GB) - Purple to your cart")
        }
        
        var message: Text? {
            Text("You can edit/remove the items from your cart")
        }
        
        var image: Image? {
            Image(systemName: "cart")
        }
        
        var rules: [Rule] {
            // Tip will only display when the didViewProductDetail event for product name 'iPhone 14 Pro (256 GB) - Purple' has been donated 3 or more times in a day.
            #Rule(ProductDetailsView.didViewProductDetail) {
                $0.donations.donatedWithin(.day).filter( { $0.productName == "iPhone 14 Pro (256 GB) - Purple" }).count >= 3
            }
        }
        
        var actions: [Action] {
            [
                Tip.Action(id: "add-product-to-cart", title: "Add to cart", perform: {
                    print("Product added into the cart")
                })
            ]
        }
    }

    Customization for Tip

    Customization is the key feature as every app has its own theme throughout the application. Customizing tips to gale along with application themes surely enhances the user experience. Although, as of now, there is not much customization offered by the TipKit framework, but we expect it to get upgraded in the future. Below are the available methods for customization of tips.

    public func tipAssetSize(_ size: CGSize) -> some View
    
    public func tipCornerRadius(_ cornerRadius: Double, 
                                antialiased: Bool = true) -> some View
    
    public func tipBackground(_ style: some ShapeStyle) -> some View

    Testing

    Testing tips is very important as a small issue in the implementation of this framework can ruin your app’s user experience. We can construct UI test cases for various scenarios, and tthe following methods can be helpful to test tips.

    • showAllTips
    • hideAllTips
    • showTips([<instance-of-your-tip>])
    • hideTips([<instance-of-your-tip>])

    Pros

    • Compatibility: TipKit is compatible across all the Apple platforms, including iOS, macOs, watchOs, visionOS.
    • Supports both SwiftUI and UIKit
    • Easy implementation and testing
    • Avoiding dependency on third-party libraries

    Cons 

    • Availability: Only available from iOS 17.0, iPadOS 17.0, macOS 14.0, Mac Catalyst 17.0, tvOS 17.0, watchOS 10.0 and visionOS 1.0 Beta. So no backwards compatibility as of now.
    • It might frustrate the user if the application incorrectly implements this framework

    Conclusion

    The TipKit framework is a great way to introduce new features in our application to the user. It is easy to implement, and it enhances the user experience. Having said that, we should avoid extensive use of it as it may frustrate the user. We should always avoid displaying promotional and error messages in the form of tips.

  • Optimizing iOS Memory Usage with Instruments Xcode Tool

    Introduction

    Developing iOS applications that deliver a smooth user experience requires more than just clean code and engaging features. Efficient memory management helps ensure that your app performs well and avoids common pitfalls like crashes and excessive battery drain. 

    In this blog, we’ll explore how to optimize memory usage in your iOS app using Xcode’s powerful Instruments and other memory management tools.

    Memory Management and Usage

    Before we delve into the other aspects of memory optimization, it’s important to understand why it’s so essential:

    Memory management in iOS refers to the process of allocating and deallocating memory for objects in an iOS application to ensure efficient and reliable operation. Proper memory management prevents issues like memory leaks, crashes, and excessive memory usage, which can degrade an app’s performance and user experience. 

    Memory management in iOS primarily involves the use of Automatic Reference Counting (ARC) and understanding how to manage memory effectively.

    Here are some key concepts and techniques related to memory management in iOS:

    1. Automatic Reference Counting (ARC): ARC is a memory management technique introduced by Apple to automate memory management in Objective-C and Swift. With ARC, the compiler automatically inserts retain, release, and autorelease calls, ensuring that memory is allocated and deallocated as needed. Developers don’t need to manually manage memory by calling “retain,” “release,” or “autorelease`” methods as they did in manual memory management in pre-ARC era.
    2. Strong and Weak References: In ARC, objects have strong, weak, and unowned references. A strong reference keeps an object in memory as long as at least one strong reference to it exists. A weak reference, on the other hand, does not keep an object alive. It’s commonly used to avoid strong reference cycles (retain cycles) and potential memory leaks.
    3. Retain Cycles: A retain cycle occurs when two or more objects hold strong references to each other, creating a situation where they cannot be deallocated, even if they are no longer needed. To prevent retain cycles, you can use weak references, unowned references, or break the cycle manually by setting references to “nil” when appropriate.
    4. Avoiding Strong Reference Cycles: To avoid retain cycles, use weak references (and unowned references when appropriate) in situations where two objects reference each other. Also, consider using closure capture lists to prevent strong reference cycles when using closures.
    5. Resource Management: Memory management also includes managing other resources like files, network connections, and graphics contexts. Ensure you release or close these resources when they are no longer needed.
    6. Memory Profiling: The Memory Report in the Debug Navigator of Xcode is a tool used for monitoring and analyzing the memory usage of your iOS or macOS application during runtime. It provides valuable insights into how your app utilizes memory, helps identify memory-related issues, and allows you to optimize the application’s performance.

    Also, use tools like Instruments to profile your app’s memory usage and identify memory leaks and excessive memory consumption.

    Instruments: Your Ally for Memory Optimization

    In Xcode, “Instruments” refer to a set of performance analysis and debugging tools integrated into the Xcode development environment. These instruments are used by developers to monitor and analyze the performance of their iOS, macOS, watchOS, and tvOS applications during development and testing. Instruments help developers identify and address performance bottlenecks, memory issues, and other problems in their code.

     

    Some of the common instruments available in Xcode include:

    1. Allocations: The Allocations instrument helps you track memory allocations and deallocations in your app. It’s useful for detecting memory leaks and excessive memory usage.
    2. Leaks: The Leaks instrument finds memory leaks in your application. It can identify objects that are not properly deallocated.
    3. Time Profiler: Time Profiler helps you measure and analyze the CPU usage of your application over time. It can identify which functions or methods are consuming the most CPU resources.
    4. Custom Instruments: Xcode also allows you to create custom instruments tailored to your specific needs using the Instruments development framework.

    To use these instruments, you can run your application with profiling enabled, and then choose the instrument that best suits your performance analysis goals. 

    Launching Instruments

    Because Instruments is located inside Xcode’s app bundle, you won’t be able to find it in the Finder. 

    To launch Instruments on macOS, follow these steps:

    1. Open Xcode: Instruments is bundled with Xcode, Apple’s integrated development environment for macOS, iOS, watchOS, and tvOS app development. If you don’t have Xcode installed, you can download it from the Mac App Store or Apple’s developer website.
    2. Open Your Project: Launch Xcode and open the project for which you want to use Instruments. You can do this by selecting “File” > “Open” and then navigating to your project’s folder.
    3. Choose Instruments: Once your project is open, go to the “Xcode” menu at the top-left corner of the screen. From the drop-down menu, select “Open Developer Tool” and choose “Instruments.”
    4. Select a Template: Instruments will open, and you’ll see a window with a list of available performance templates on the left-hand side. These templates correspond to the different types of analysis you can perform. Choose the template that best matches the type of analysis you want to conduct. For example, you can select “Time Profiler” for CPU profiling or “Leaks” for memory analysis.
    5. Configure Settings: Depending on the template you selected, you may need to configure some settings or choose the target process (your app) you want to profile. These settings can typically be adjusted in the template configuration area.
    6. Start Recording: Click the red record button in the top-left corner of the Instruments window to start profiling your application. This will launch your app with the selected template and begin collecting performance data.
    7. Analyze Data: Interact with your application as you normally would to trigger the performance scenarios you want to analyze. Instruments will record data related to CPU usage, memory usage, network activity, and other aspects of your app’s performance.
    8. Stop Recording: When you’re done profiling your app, click the square “Stop” button in Instruments to stop recording data.
    9. Analyze Results: After stopping the recording, Instruments will display a detailed analysis of your app’s performance. You can explore various graphs, timelines, and reports to identify and address performance issues.
    10. Save or Share Results: You can save your Instruments session for future reference or share it with colleagues if needed.

    Using the Allocations Instrument

    The “Allocations” instrument helps you monitor memory allocation and deallocation. Here’s how to use it:

    1. Start the Allocations Instrument: In Instruments, select “Allocations” as your instrument.

    2. Profile Your App: Use your app as you normally would to trigger the scenarios you want to profile.

    3. Examine the Memory Allocation Graph: The graph displays memory usage over time. Look for spikes or steady increases in memory usage.

    4. Inspect Objects: The instrument provides a list of objects that have been allocated and deallocated. You can inspect these objects and their associated memory usage.

    5. Call Tree and Source Code: To pinpoint memory issues, use the Call Tree to identify the functions or methods responsible for memory allocation. You can then inspect the associated source code in the Source View.

    Detecting Memory Leaks with the Leaks Instrument

    Retain Cycle

    A retain cycle in Swift occurs when two or more objects hold strong references to each other in a way that prevents them from being deallocated, causing a memory leak. This situation is also known as a “strong reference cycle.” It’s essential to understand retain cycles because they can lead to increased memory usage and potential app crashes.  

    A common scenario for retain cycles is when two objects reference each other, both using strong references. 

    Here’s an example to illustrate a retain cycle:

    class Person {
        var name: String
        var pet: Pet?
    
        init(name: String) {
            self.name = name
        }
    
        deinit {
            print("(name) has been deallocated")
        }
    }
    
    class Pet {
        var name: String
        var owner: Person?
    
        init(name: String) {
            self.name = name
        }
    
        deinit {
            print("(name) has been deallocated")
        }
    }
    
    var rohit: Person? = Person(name: "Rohit")
    var jerry: Pet? = Pet(name: "Jerry")
    
    rohit?.pet = jerry
    jerry?.owner = rohit
    
    rohit = nil
    jerry = nil

    In this example, we have two classes, Person and Pet, representing a person and their pet. Both classes have a property to store a reference to the other class (person.pet and pet.owner).  

    The “Leaks” instrument is designed to detect memory leaks in your app. 

    Here’s how to use it:

    1. Launch Instruments in Xcode: First, open your project in Xcode.  

    2. Commence Profiling: To commence the profiling process, navigate to the “Product” menu and select “Profile.”  

    3. Select the Leaks Instrument: Within the Instruments interface, choose the “Leaks” instrument from the available options.  

    4. Trigger the Memory Leak Scenario: To trigger the scenario where memory is leaked, interact with your application. This interaction, such as creating a retain cycle, will induce the memory leak.

    5. Identify Leaked Objects: The Leaks Instrument will automatically detect and pinpoint the leaked objects, offering information about their origins, including backtraces and the responsible callers.  

    6. Analyze Backtraces and Responsible Callers: To gain insights into the context in which the memory leak occurred, you can inspect the source code in the Source View provided by Instruments.  

    7. Address the Leaks: Armed with this information, you can proceed to fix the memory leaks by making the necessary adjustments in your code to ensure memory is released correctly, preventing future occurrences of memory leaks.

    You should see memory leaks like below in the Instruments.

    The issue in the above code is that both Person and Pet are holding strong references to each other. When you create a Person and a Pet and set their respective references, a retain cycle is established. Even when you set rohit and jerry to nil, the objects are not deallocated, and the deinit methods are not called. This is a memory leak caused by the retain cycle. 

    To break the retain cycle and prevent this memory leak, you can use weak or unowned references. In this case, you can make the owner property in Pet a weak reference because a pet should not own its owner:

    class Pet {
        var name: String
        weak var owner: Person?
    
        init(name: String) {
            self.name = name
        }
    
        deinit {
            print("(name) has been deallocated")
        }
    }

    By making owner a weak reference, the retain cycle is broken, and when you set rohit and jerry to nil, the objects will be deallocated, and the deinit methods will be called. This ensures proper memory management and avoids memory leaks.

    Best Practices for Memory Optimization

    In addition to using Instruments, consider the following best practices for memory optimization:

    1. Release Memory Properly: Ensure that memory is released when objects are no longer needed.

    2. Use Weak References: Use weak references when appropriate to prevent strong reference cycles.

    3. Using Unowned to break retain cycle: An unowned reference does not increment or decrease an object’s reference count. 

    3. Minimize Singletons and Global Variables: These can lead to retained objects. Use them judiciously.

    4. Implement Lazy Loading: Load resources lazily to reduce initial memory usage.

    Conclusion

    Optimizing memory usage is an essential part of creating high-quality iOS apps. 

    Instruments, integrated into Xcode, is a versatile tool that provides insights into memory allocation, leaks, and CPU-intensive code. By mastering these tools and best practices, you can ensure your app is memory-efficient, stable, and provides a superior user experience. Happy profiling!

  • Choosing Between Scrum and Kanban: Finding the Right Fit for Your Project

    In project management and software development, two popular methodologies have emerged as dominant players: Scrum and Kanban. Both offer effective frameworks for managing work, but they have distinct characteristics and applications. Deciding which one to choose for your project can be a critical decision that significantly impacts its success. In this blog post, we’ll dive deep into Scrum and Kanban, exploring their differences, advantages, and when to use each to help you make an informed decision.

    Understanding Scrum

    Scrum is an agile framework that originated in the early 1990s and has since gained widespread adoption across various industries. The structured approach emphasizes teamwork, collaboration, and iterative development. Scrum divides work into time-bound iterations called “sprints,” typically lasting two to four weeks, during which a cross-functional team works to deliver a potentially shippable product increment.

    Key Principles of Scrum:

    1. Roles: Scrum involves three key roles: the Product Owner, Scrum Master, and Development Team. Each has distinct responsibilities to ensure smooth project progress.

    2. Artifacts: Scrum employs several artifacts, including the Product Backlog, Sprint Backlog, and Increment, to maintain transparency and track progress.

    3. Events: Scrum events, such as Sprint Planning, Daily Standup, Sprint Review, and Sprint Retrospective, provide a structured framework for communication and planning.

    Advantages vs. Disadvantages of Scrum:

    Advantages:

    1. Predictable Delivery: Scrum sprints’ time-boxed nature allows for predictable delivery timelines, making it suitable for projects with fixed deadlines.

    Example: Consider a software development project with a hard release date. By using Scrum, the team can plan sprints to ensure that key features are completed in time for the release. This predictability is crucial when dealing with external commitments.

    2. Continuous Improvement: Regular Sprint Retrospectives encourage teams to reflect on their work and identify areas for improvement, fostering a culture of continuous learning and adaptation.

    3. Clear Prioritization: The Product Backlog prioritizes the most important features and tasks, helping teams focus on delivering value.

    4. Enhanced Collaboration: Cross-functional teams collaborate closely throughout the project, leading to better communication and a shared understanding of project goals.

    Disadvantages:

    1. Rigidity: Scrum’s structured nature can be considered too rigid for some projects. It may not adapt to highly variable workloads or projects with frequently changing requirements.

    2. Role Dependencies: The success of Scrum heavily relies on having skilled and dedicated Product Owners, Scrum Masters, and Development Team members. Without these roles, the methodology’s effectiveness can be compromised.

    3. Learning Curve: Implementing Scrum effectively requires a thorough understanding of its principles and practices. Teams new to Scrum may face a learning curve that impacts productivity initially.

    Understanding Kanban

    Kanban, which means “visual signal” in Japanese, originated at Toyota in the 1940s as a manufacturing system. Kanban is a visual and flow-based method for managing work in project management. Unlike Scrum, Kanban does not prescribe specific roles, time-boxed iterations, or ceremonies. Instead, it provides a flexible framework for visualizing work, setting work-in-progress (WIP) limits, and optimizing workflow.

    Key Principles of Kanban

    1. Visualize Work: Kanban boards display work items in columns representing various workflow stages, clearly representing tasks and their status.

    2. Limit Work in Progress (WIP): Kanban sets WIP limits for each column to prevent overloading team members and maintain a steady workflow.

    3. Manage Flow: Kanban focuses on optimizing workflow, ensuring that tasks move smoothly from one stage to the next.

    Advantages vs. Disadvantages of Kanban:

    Advantages:

    1. Flexibility: Kanban adapts well to changing project requirements and workloads, making it ideal for highly variable projects.

    Example: Imagine a customer support team using Kanban. They may receive a varying number of support requests daily. Kanban allows them to adjust their work capacity based on demand, ensuring they can address urgent issues without overloading their resources.

    2. Reduced Waste: By limiting WIP, Kanban minimizes multitasking and reduces wasted effort, improving efficiency.

    3. Continuous Delivery: Teams using Kanban can continuously release features or products, allowing faster response to customer needs.

    4. Simplicity: Kanban is easy to understand and implement, making it accessible to teams with varying experience levels.

    Disadvantages:

    1. Lack of Structure: While flexibility is a strength, Kanban’s lack of defined roles and ceremonies can be a drawback for teams that thrive in more structured environments. It may lead to a lack of clarity in responsibilities and expectations.

    2. Difficulty in Predictability: Kanban’s focus on continuous flow can make it challenging to predict when specific features or tasks will be completed. This can be problematic for projects with strict deadlines.

    3. Dependency on Team Discipline: Kanban relies heavily on the discipline of team members to follow WIP limits and maintain flow. Without a disciplined team, the system can break down.

    Choosing Between Scrum and Kanban

    The decision between Scrum and Kanban should be based on your project’s specific characteristics, requirements, and organizational culture. To help you make an informed choice, consider the following factors:

    Hybrid Approaches

    In some cases, project teams may opt for hybrid approaches that combine elements of both Scrum and Kanban to suit their unique needs. For example, a team might use Scrum for high-priority feature development while employing Kanban for bug fixes and support requests.

    When to Use Scrum and Kanban Together

    In certain situations, it may be advantageous to use both Scrum and Kanban in conjunction to harness the strengths of both methodologies:

    Conclusion

    Choosing between Scrum and Kanban ultimately depends on your project’s specific needs and characteristics, your team’s experience and culture, and the level of flexibility required. Both methodologies offer valuable tools and principles for managing work effectively, but they do so in different ways.

    Scrum provides structure and predictability through time-boxed sprints, making it suitable for projects with defined requirements and fixed deadlines. On the other hand, Kanban offers flexibility and adaptability, making it an excellent choice for projects with variable workloads and evolving requirements.

    Remember that these are not rigid choices; you can always tailor the methodology to fit your project’s unique circumstances. The key is to stay true to the underlying principles of agility, collaboration, and continuous improvement, regardless of whether you choose Scrum, Kanban, or a hybrid approach. Doing so will increase your chances of project success and deliver value to your stakeholders.

  • Unlocking Legal Insights: Effortless Document Summarization with OpenAI’s LLM and LangChain

    The Rising Demand for Legal Document Summarization:

    • In a world where data, information, and legal complexities is prevalent, the volume of legal documents is growing rapidly. Law firms, legal professionals, and businesses are dealing with an ever-increasing number of legal texts, including contracts, court rulings, statutes, and regulations. 
    • These documents contain important insights, but understanding them can be overwhelming. This is where the demand for legal document summarization comes in. 
    • In this blog, we’ll discuss the increasing need for summarizing legal documents and how modern technology is changing the way we analyze legal information, making it more efficient and accessible.

    Overview OpenAI and LangChain

    • We’ll use the LangChain framework to build our application with LLMs. These models, powered by deep learning, have been extensively trained on large text datasets. They excel in various language tasks like translation, sentiment analysis, chatbots, and more. 
    • LLMs can understand complex text, identify entities, establish connections, and generate coherent content. We can use meta LLaMA LLMs, OpenAI LLMs and others as well. For this case, we will be using OpenAI’s LLM.

    • OpenAI is a leader in the field of artificial intelligence and machine learning. They have developed powerful Large Language Models (LLMs) that are capable of understanding and generating human-like text.
    •  These models have been trained on vast amounts of textual data and can perform a wide range of natural language processing tasks.

    LangChain is an innovative framework designed to simplify and enhance the development of applications and systems that involve natural language processing (NLP) and large language models (LLMs). 

    It provides a structured and efficient approach for working with LLMs like OpenAI’s GPT-3 and GPT-4 to tackle various NLP tasks. Here’s an overview of LangChain’s key features and capabilities:

    • Modular NLP Workflow: Build flexible NLP pipelines using modular blocks. 
    • Chain-Based Processing: Define processing flows using chain-based structures. 
    • Easy Integration: Seamlessly integrate LangChain with other tools and libraries.
    • Scalability: Scale NLP workflows to handle large datasets and complex tasks. 
    • Extensive Language Support: Work with multiple languages and models. 
    • Data Visualization: Visualize NLP pipeline results for better insights.
    • Version Control: Track changes and manage NLP workflows efficiently. 
    • Collaboration: Enable collaborative NLP development and experimentation.

    Setting Up Environment

    Setting Up Google Colab

    Google Colab provides a powerful and convenient platform for running Python code with the added benefit of free GPU support. To get started, follow these steps:

    1. Visit Google Colab: Open your web browser and navigate to Google Colab.
    2. Sign In or Create a Google Account: You’ll need to sign in with your Google account to use Google Colab. If you don’t have one, you can create an account for free.
    3. Create a New Notebook: Once signed in, click on “New Notebook” to create a new Colab notebook.
    4. Choose Python Version: In the notebook, click on “Runtime” in the menu and select “Change runtime type.” Choose your preferred Python version (usually Python 3) and set the hardware accelerator to “GPU.” Also, make sure to turn on the “Internet” toggle.

    OpenAI API Key Generation:-

    1. Visit the OpenAI Website Go to the OpenAI website.
    2.  Sign In or Create an Account Sign in or create a new OpenAI account. 
    3. Generate a New API Key Access the API section and generate a new API key. 
    4. Name Your API Key Give your API key a name that reflects its purpose. 
    5. Copy the API Key Copy the generated API key to your clipboard. 
    6. Store the API Key Safely Securely store the API key and do not share it publicly.

    Understanding Legal Document Summarization Workflow

    1. Map Step:

    • At the heart of our legal document summarization process is the Map-Reduce paradigm.
    • In the Map step, we treat each legal document individually. Think of it as dissecting a large puzzle into smaller, manageable pieces.
    • For each document, we employ a sophisticated Language Model (LLM). This LLM acts as our expert, breaking down complex legal language and extracting meaningful content.
    • The LLM generates concise summaries for each document section, essentially translating legalese into understandable insights.
    • These individual summaries become our building blocks, our pieces of the puzzle.

    2. Reduce Step:

    • Now, let’s shift our focus to the Reduce step.
    • Here’s where we bring everything together. We’ve generated summaries for all the document sections, and it’s time to assemble them into a cohesive whole.
    • Imagine the Reduce step as the puzzle solver. It takes all those individual pieces (summaries) and arranges them to form the big picture.
    • The goal is to produce a single, comprehensive summary that encapsulates the essence of the entire legal document.

    3. Compression – Ensuring a Smooth Fit:

    • One challenge we encounter is the potential length of these individual summaries. Some legal documents can produce quite lengthy summaries.
    • To ensure a smooth flow within our summarization process, we’ve introduced a compression step.

    4. Recursive Compression:

    • In some cases, even the compressed summaries might need further adjustment.
    • That’s where the concept of recursive compression comes into play.
    • If necessary, we’ll apply compression multiple times, refining and optimizing the summaries until they seamlessly fit into our summarization pipeline.

    Let’s Get Started

    Step 1: Installing python libraries

    Create a new notebook in Google Colab and install the required Python libraries.

    !pip install openai langchain tiktoken

    OpenAI: Installed to access OpenAI’s powerful language models for legal document summarization.

    LangChain: Essential for implementing document mapping, reduction, and combining workflows efficiently.

    Tiktoken: Helps manage token counts within text data, ensuring efficient usage of language models and avoiding token limit issues.

    Step 2: Adding OpenAI API key to Colab

    Integrate your openapi key in Google Colab Secrets.

    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    API_KEY= user_secrets.get_secret("YOUR_SECRET_KEY_NAME")

    Step 3: Initializing OpenAI LLM

    Here, we import the OpenAI module from LangChain and initialize it with the provided API key to utilize advanced language models for document summarization.

    from langchain.llms import OpenAI
    llm = OpenAI(openai_api_key=API_KEY)

    Step 4: Splitting text by Character

    The Text Splitter, in this case, overcomes the token limit by breaking down the text into smaller chunks that are each within the token limit. This ensures that the text can be processed effectively by the language model without exceeding its token capacity. 

    The “chunk_overlap” parameter allows for some overlap between chunks to ensure that no information is lost during the splitting process.

    from langchain.text_splitter import CharacterTextSplitter
    text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
        chunk_size=1000, chunk_overlap=120
    )

    Step 5 : Loading PDF documents

    from langchain.document_loaders import PyPDFLoader
    def chunks(pdf_file_path):
        loader = PyPDFLoader(pdf_file_path)
        docs = loader.load_and_split()
        return docs

    It initializes a PyPDFLoader object named “loader” using the provided PDF file path. This loader is responsible for loading and processing the contents of the PDF file. 

    It then uses the “loader” to load and split the PDF document into smaller “docs” or document chunks. These document chunks likely represent different sections or pages of the PDF file. 

    Finally, it returns the list of document chunks, making them available for further processing or analysis.

    Step 6: Map Reduce Prompt Templates

    Import libraries required for the implementation of LangChain MapReduce.

    from langchain.chains.mapreduce import MapReduceChain
    from langchain.text_splitter import CharacterTextSplitter
    from langchain.chains import ReduceDocumentsChain, MapReduceDocumentsChain
    from langchain import PromptTemplate
    from langchain.chains import LLMChain
    from langchain.chains.combine_documents.stuff import StuffDocumentsChain

    map_template = """The following is a set of documents
    {docs}
    Based on this list of docs, summarised into meaningful
    Helpful Answer:"""
    
    map_prompt = PromptTemplate.from_template(map_template)
    map_chain = LLMChain(llm=llm, prompt=map_prompt)
    
    reduce_template = """The following is set of summaries:
    {doc_summaries}
    Take these and distil it into a final consolidated summary with title(mandatory) in bold with important key points . 
    Helpful Answer:"""
    
    reduce_prompt = PromptTemplate.from_template(reduce_template)
    reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

    Template Definition: 

    The code defines two templates, map_template and reduce_template, which serve as structured prompts for instructing a language model on how to process and summarise sets of documents. 

    LLMChains for Mapping and Reduction: 

    Two LLMChains, map_chain, and reduce_chain, are configured with these templates to execute the mapping and reduction steps in the document summarization process, making it more structured and manageable.

    Step 7 : Map and Reduce LLM Chains

    combine_documents_chain = StuffDocumentsChain(
        llm_chain=reduce_chain, document_variable_name="doc_summaries"
    )
    reduce_documents_chain = ReduceDocumentsChain(
        combine_documents_chain=combine_documents_chain,
        collapse_documents_chain=combine_documents_chain,
        token_max=5000,
    )

    map_reduce_chain = MapReduceDocumentsChain(
        llm_chain=map_chain,
        reduce_documents_chain=reduce_documents_chain,
        document_variable_name="docs",
        return_intermediate_steps=False,
    )

    Combining Documents Chain (combine_documents_chain): 

    • This chain plays a crucial role in the document summarization process. It takes the individual legal document summaries, generated in the “Map” step, and combines them into a single, cohesive text string. 
    • By consolidating the summaries, it prepares the data for further processing in the “Reduce” step. The resulting combined document string is assigned the variable name “doc_summaries.” 

    Reduce Documents Chain (reduce_documents_chain): 

    • This chain represents the final phase of the summarization process. Its primary function is to take the combined document string from the combine_documents_chain and perform in-depth reduction and summarization. 
    • To address potential issues related to token limits (where documents may exceed a certain token count), this chain offers a clever solution. It can recursively collapse or compress lengthy documents into smaller, more manageable chunks. 
    • This ensures that the summarization process remains efficient and avoids token limit constraints. The maximum token limit for each chunk is set at 5,000 tokens, helping control the size of the summarization output. 

    Map-Reduce Documents Chain (map_reduce_chain): 

    • This chain follows the well-known MapReduce paradigm, a framework often used in distributed computing for processing and generating large datasets. In the “Map” step, it employs the map_chain to process each individual legal document. 
    • This results in initial document summaries. In the subsequent “Reduce” step, the chain uses the reduce_documents_chain to consolidate these initial summaries into a final, comprehensive document summary. 
    • The summarization result, representing the distilled insights from the legal documents, is stored in the variable named “docs” within the LLM chain. 

    Step 8: Summarization Function

    def summarize_pdf(file_path):
        split_docs = text_splitter.split_documents(chunks(file_path))
        return map_reduce_chain.run(split_docs)
    
    result_sumary=summarize_pdf(file_path)
    print(result_summary)

    Our summarization process centers around the ‘summarize_pdf’ function. This function takes a PDF file path as input and follows a two-step approach. 

    First, it splits the PDF into manageable sections using the ‘text_splitter’ module. Then, it runs the ‘map_reduce_chain,’ which handles the summarization process. 

    By providing the PDF file path as input, you can easily generate a concise summary of the legal document within the Google Colab environment, thanks to LangChain and LLM.

    Output

    1. Original Document – https://www.safetyforward.com/docs/legal.pdf

    This document is about not using mobile phones while driving a motor vehicle and prohibits disabling its motion restriction features.

    Summarization –

    2. Original Document – https://static.abhibus.com/ks/pdf/Loan-Agreement.pdf

    India and the International Bank for Reconstruction and Development have formed an agreement for the Sustainable Urban Transport Project, focusing on sustainable transportation while adhering to anti-corruption guidelines.

    Summarization –

    Limitations :

    Complex Legal Terminology: 

    LLMs may struggle with accurately summarizing documents containing intricate legal terminology, which requires domain-specific knowledge to interpret correctly. 

    Loss of Context: 

    Summarization processes, especially in lengthy legal documents, may result in the loss of important contextual details, potentially affecting the comprehensiveness of the summaries. 

    Inherent Bias: 

    LLMs can inadvertently introduce bias into summaries based on the biases present in their training data. This is a critical concern when dealing with legal documents that require impartiality. 

    Document Structure: 

    Summarization models might not always understand the hierarchical or structural elements of legal documents, making it challenging to generate summaries that reflect the intended structure.

    Limited Abstraction: 

    LLMs excel at generating detailed summaries, but they may struggle with abstracting complex legal arguments, which is essential for high-level understanding.

    Conclusion : 

    • In a nutshell, this project uses LangChain and OpenAI’s LLM to bring in a fresh way of summarizing legal documents. This collaboration makes legal document management more accurate and efficient.
    • However, we faced some big challenges, like handling lots of legal documents and dealing with AI bias. As we move forward, we need to find new ways to make our automated summarization even better and meet the demands of the legal profession.
    • In the future, we’re committed to improving our approach. We’ll focus on fine-tuning algorithms for more accuracy and exploring new techniques, like combining different methods, to keep enhancing legal document summarization. Our aim is to meet the ever-growing needs of the legal profession.
  • Unlocking Key Insights in NATS Development: My Journey from Novice to Expert – Part 1

    By examining my personal journey from a NATS novice to mastering its intricacies, this long-form article aims to showcase the importance and applicability of NATS in the software development landscape. Through comprehensive exploration of various topics, readers will gain a solid foundation, advanced techniques, and best practices for leveraging NATS effectively in their projects.

    Introduction

    Our topic for today is about how you can get started with NATS. We are assuming that you are aware of why you need NATS and want to know the concepts of NATS along with a walkthrough for how to deploy those concepts/components in your organization.

    The first part would include the basic concept and Installation Guide, and setup, admin-related CRUD operations, shell scripts which might not be needed immediately but would be good to have in your arsenal. Whereas the second part would be more developer-focused – applying NATS in application, etc. Let’s Begin

    Understanding NATS

    In this section, we will delve into the fundamentals of NATS and its key components.

    A. Definition and Overview

    Nats, which stands for “Naturally Adaptable and Transparent System,” is a lightweight, high-performance messaging system known for its simplicity and scalability. It enables the exchange of messages between applications in a distributed architecture, allowing for seamless communication and increased efficiency.

    B. Architecture Diagram

    To better understand the inner workings of NATS, let’s take a closer look at its architecture. The diagram below illustrates the key components involved in a typical Nats deployment:

    C. Key Features

    NATS offers several key features that make it a powerful messaging system. These include:

    • Publish-Subscribe Model: NATS follows a publish-subscribe model where publishers send messages to subjects and subscribers receive those messages based on their interest in specific subjects.
    • This model allows for flexible and decoupled communication between different parts of an application or across multiple applications.
    • Scalability: With support for horizontal scaling, NATS can handle high loads of message traffic, making it suitable for large-scale systems.
    • Performance: NATS is built for speed, providing low-latency message delivery and high throughput.
    • Reliability: NATS ensures that messages are reliably delivered to subscribers, even in the presence of network interruptions or failures.
    • Security: NATS supports secure communication through various authentication and encryption mechanisms, protecting sensitive data.

    D. Use Cases and Applications

    NATS’ simplicity and versatility make it suitable for a wide range of use cases and applications. Some common use cases include:

    • Real-time data streaming and processing
    • Event-driven architectures
    • Microservices communication
    • IoT (Internet of Things) systems
    • Distributed systems and cloud-native applications

    E. Concepts

    To better grasp the various components and terminologies associated with NATS, let’s explore some key concepts:

    1. NATS server: The NATS server acts as the central messaging infrastructure, responsible for routing messages between publishers and subscribers.
    2. NATS CLI: The NATS command-line interface (CLI) is a tool that provides developers with a command-line interface to interact with the NATS server and perform various administrative tasks.
    3. NATS clients: NATS CLI and clients both are different. The NATS client is an API/code-based approach to access the NATS server. Clients are not as powerful as CLI but are mainly used along with source code to achieve a specified goal. We won’t be covering this as it is not part of the scope.
    4. Routes: Routes allow NATS clusters to bridge and share messages with other nodes within and outside clusters, enabling communication across geographically distributed systems.
    5. Accounts: Accounts in NATS provide isolation and access control mechanisms, ensuring that messages are exchanged securely and only between authorized parties.
    6. Gateway: Gateways list all the servers in different clusters that you want to connect in order to create a supercluster.
    7. SuperCluster: SuperCluster is a powerful feature that allows scaling NATS horizontally across multiple clusters, providing enhanced performance and fault tolerance.

    F. System Requirements

    Before diving into NATS, it’s important to ensure that our system meets the necessary requirements. The system requirements for NATS will vary depending on the specific deployment scenario and use case. However, in general, the minimum requirements include:

    Hardware:

    Network:

    • All the VMs should be part of the same cluster.
    • 4222, 8222, 4248, and 7222 ports should be open for inter-server and client connection.
    • Whitelisting of GitHub EMU account on prod servers (Phase 2).
    • Get AVI VIP for all the clusters from the network team.

    Logs:

    By default, logs will be disabled, but the configuration file will have placeholders for logs enablement. Some of the important changes include:

    • debug: It will show system logs in verbose.
    • trace: It will record every message processed on NATS.
    • logtime, logfile_size_limit, log_file: As the name represents, it will show the time when recording the logs, individual file limit for log files (once filled, auto rotation is done by NATS), and the name of the file, respectively.

    TLS:

    I will be showing the configuration of how to use the certs. Do remember, this setup is being done for the development environment to allow more flexibility towards explaining things and executing it.

    Getting Started with NATS

    In this section, we will guide you through the installation and setup process for NATS.

    Building the Foundation

    First, we will focus on building a strong foundation in NATS by understanding its core concepts and implementing basic messaging patterns.

    A. Understanding NATS Subjects

    In NATS, subjects serve as identifiers that help publishers and subscribers establish communication channels. They are represented as hierarchical strings, allowing for flexibility in message routing and subscription matching.

    B. Exploring Messages, Publishers, and Subscribers

    Messages are the units of data exchanged between applications through NATS. Publishers create and send messages, while subscribers receive and process them based on their subscribed subjects of interest.

    C. Implementing Basic Pub/Sub Pattern

    The publish-subscribe pattern is a fundamental messaging pattern in NATS. It allows publishers to distribute messages to multiple subscribers interested in specific subjects, enabling decoupled and efficient communication between different parts of the system.

    D. JetStream

    JetStream is an advanced addition to NATS that provides durable, persistent message storage and retention policies. It is designed to handle high-throughput streaming scenarios while ensuring data integrity and fault tolerance.

    E. Single Cluster vs. SuperCluster

    NATS supports both single clusters and superclusters. Single clusters are ideal for smaller deployments, whereas superclusters provide the ability to horizontally scale NATS across multiple clusters, enhancing performance and fault tolerance.

    Implementation

    As this blog is about deploying NATS from an admin perspective. We will be using only shell script for this purpose.

    Let’s start with the implementation process:

    Prerequisite

    These commands are required to be run on all the servers hosting Nats-server. In this blog, we will cover a 3-node cluster that will be working at Jetstream.

    Installing NatsCLI and Nats-server:

    mkdir -p rpm

    # NATSCLI

    curl -o rpm/nats-0.0.35.rpm  -L https://github.com/nats-io/natscli/releases/download/v0.0.35/nats-0.0.35-amd64.rpm

    sudo yum install -y rpm/nats-0.0.35.rpm

    # NATS-server

    curl -o rpm/nats-server-2.9.20.rpm  -L https://github.com/nats-io/nats-server/releases/download/v2.9.20/nats-server-v2.9.20-amd64.rpm

    sudo yum install -y rpm/nats-server-2.9.20.rpm

    Local Machine Setup for JetStream:

    # Create User

    sudo useradd –system –home /nats –shell /bin/false nats

    # Jetstream Storage

    sudo mkdir -p /nats/storage

    # Certs

    sudo mkdir -p /nats/certs

    # Logs

    sudo mkdir -p /nats/logs

    # Setting Right Permission

    sudo chown –recursive nats:nats /nats

    sudo chmod 777 /nats

    sudo chmod 777 /nats/storage

    Next, we will create the service file in the servers at /etc/systemd/system/nats.service

    sudo bash -c ‘cat <<EOF > /etc/systemd/system/nats.service

    [Unit]

    Description=NATS Streaming Daemon

    Requires=network-online.target

    After=network-online.target

    ConditionFileNotEmpty=/nats/nats.conf

    [Service]

    #Type=notify

    User=nats

    Group=nats

    ExecStart=/usr/local/bin/nats-server -config=/nats/nats.conf

    #KillMode=process

    Restart=always

    RestartSec=10

    StandardOutput=syslog

    StandardError=syslog

    #TimeoutSec=900

    #LimitNOFILE=65536

    #LimitMEMLOCK=infinity

    [Install]

    WantedBy=multi-user.target

    EOF’

    Full File will look like:

    #!/bin/bash
    
    mkdir -p rpm
    
    # NATSCLI
    curl -o rpm/nats-0.0.35.rpm  -L https://github.com/nats-io/natscli/releases/download/v0.0.35/nats-0.0.35-amd64.rpm
    sudo yum install -y rpm/nats-0.0.35.rpm
    
    # NATS-server
    curl -o rpm/nats-server-2.9.20.rpm  -L https://github.com/nats-io/nats-server/releases/download/v2.9.20/nats-server-v2.9.20-amd64.rpm
    sudo yum install -y rpm/nats-server-2.9.20.rpm
    
    # Create User
    sudo useradd --system --home /nats --shell /bin/false nats
    
    # Jetstream Storage
    sudo mkdir -p /nats/storage
    
    # Certs
    sudo mkdir -p /nats/certs
    
    # Logs
    sudo mkdir -p /nats/logs
    
    # Setting Right Permission
    sudo chown --recursive nats:nats /nats
    sudo chmod 777 /nats
    sudo chmod 777 /nats/storage
    sudo bash -c 'cat <<EOF > /etc/systemd/system/nats.service
    [Unit]
    Description=NATS Streaming Daemon
    Requires=network-online.target
    After=network-online.target
    ConditionFileNotEmpty=/nats/nats.conf
    [Service]
    #Type=notify
    User=nats
    Group=nats
    ExecStart=/usr/local/bin/nats-server -config=/nats/nats.conf
    #KillMode=process
    Restart=always
    RestartSec=10
    StandardOutput=syslog
    StandardError=syslog
    #TimeoutSec=900
    #LimitNOFILE=65536
    #LimitMEMLOCK=infinity
    [Install]
    WantedBy=multi-user.target
    EOF'

    Creating conf file at all the servers at /nats directory

    Server setup

    server_name=nts0

    listen: <IP/DNS-First>:4222 # For other servers edit the IP/DNS remaining in the cluster

    https: <DNS-First>:8222

    #http: <IP/DNS-First>:8222 # Uncommnet this if you are running without tls certs 

    JetStream Configuration

    jetstream {

      store_dir=/nats/storage

      max_mem_store: 6GB

      max_file_store: 90GB

    }

    Intra Cluster Setup

    cluster {

      name: dev-nats # Super Cluster should have unique Cluster names

      host: <IP/DNS-First>

      port: 4248

      routes = [

        nats-route://<IP/DNS-First>:4248

        nats-route://<IP/DNS-Second>:4248

        nats-route://<IP/DNS-Third>:4248

      ]

    }

    Account Setup

    accounts: {

      $SYS: {

        users: [

          { user: admin, password: password }

        ]

      },

      B: {

        users: [

          {user: b, password: b}

        ],

        jetstream: enabled,

        imports: [

        # {stream: {account: “$G”}}

        ]

      },

      C: {

        users: [

          {user: c, password: c}

        ],

        jetstream: enabled,

        imports: [

        ]

      },

      E: {

        users: [

          {user: e, password: e}

        ],

        jetstream: enabled,

        imports: [

        ]

      }

    }

    no_auth_user: e # Change this on every server to have a user in the system which does not need password, allowing local account in supercluster

    We can use “Accounts” to help us provide local and global stream separation, the configuration is identical except for the changes in the no_auth_user which must be unique for each cluster, making the stream only accessible from the given cluster without the need of providing credentials exclusively.

    Gateway Setup: 

    Intra Cluster/Route Setup and Account Setup remain similar and need to be present in another cluster with the cluster having the name “new-dev-nats.”

    gateway {

      name: dev-nats

      listen: <IP/DNS-First>:7222

      gateways: [

        {name: dev-nats, urls: [nats://<IP/DNS-First>:7222, nats://<IP/DNS-Second>:7222, nats://<IP/DNS-Third>:7222]},

        {name: new-dev-nats, urls: [nats://<NEW-IP/DNS-First>:7222, nats://<NEW-IP/DNS-Second>:7222, nats://<NEW-IP/DNS-Third>:7222]}

      ]

    }

    TLS setup

    tls: {

      cert_file: “/nats/certs/natsio.crt”

      key_file: “/nats/certs/natsio.key”

      ca_file: “/nats/certs/natsio_rootCA.pem”

    }

    NOTE: no_auth_user: b is a special directive within NATS. If you choose to keep it seperate accross all the nodes in the cluster, you can have a “local account” setup in supercluster. This is beneficial when you want to publish data which should not be accessible by any other server.

    Complete conf file on <IP/DNS-First> machine would look like this:

    # `server_name`: Unique name for your node; attaching a number with increment value is recommended
    # listen: DNS name for the current node:4222
    # https: DNS name for the current node:8222
    # cluster.name: This is the name of your cluster. It is compulsory for them to be the same across all nodes.
    # cluster.host: DNS name for the current node
    # cluster.routes: List of all the DNS entries which will be part of the cluster in separate lines:4248
    # account.user: Make sure to use proper names here and also keep the same across all the nodes which will be involved as a super cluster
    # no_auth_user: To be unique for individual cluster
    # gateway.name: Should be for the current cluster the node is part of. (Best to match with cluster.name mentioned above)
    # gateway.listen: The same logic mentioned for listen is applicable here with port 7222
    # gateways:Mention all the nodes in all the cluster here with nodes separated logically by the cluster they are part of via name
    # tls: Make sure to have the certs ready to place at /nats/certs
    
    server_name=nts0
    listen: <IP/DNS-First>:4222 # For other servers edit the IP/DNS remaining in the cluster
    https: <DNS-First>:8222
    #http: <IP/DNS-First>:8222 # Uncommnet this if you are running without tls certs 
    
    jetstream {
      store_dir=/nats/storage
      max_mem_store: 6GB
      max_file_store: 90GB
    }
    
    cluster {
      name: dev-nats # Super Cluster should have unique Cluster names
      host: <IP/DNS-First>
      port: 4248
      routes = [
        nats-route://<IP/DNS-First>:4248
        nats-route://<IP/DNS-Second>:4248
        nats-route://<IP/DNS-Third>:4248
      ]
    }
    
    accounts: {
      $SYS: {
        users: [
          { user: admin, password: password }
        ]
      },
      B: {
        users: [
          {user: b, password: b}
        ],
        jetstream: enabled,
        imports: [
        # {stream: {account: "$G"}}
        ]
      },
      C: {
        users: [
          {user: c, password: c}
        ],
        jetstream: enabled,
        imports: [
        ]
      },
      E: {
        users: [
          {user: e, password: e}
        ],
        jetstream: enabled,
        imports: [
        ]
      }
    }
    
    no_auth_user: e # Change this on every server to have a user in the system which does not need password, allowing local account in supercluster
    
    gateway {
      name: dev-nats
      listen: <IP/DNS-First>:7222
      gateways: [
        {name: dev-nats, urls: [nats://<IP/DNS-First>:7222, nats://<IP/DNS-Second>:7222, nats://<IP/DNS-Third>:7222]},
        {name: new-dev-nats, urls: [nats://<NEW-IP/DNS-First>:7222, nats://<NEW-IP/DNS-Second>:7222, nats://<NEW-IP/DNS-Third>:7222]}
      ]
    }
    
    tls: {
      cert_file: "/nats/certs/natsio.crt"
      key_file: "/nats/certs/natsio.key"
      ca_file: "/nats/certs/natsio_rootCA.pem"
    }

    Recap on Conf File Changes

    The configuration file in all the nodes for all the environments will need to be updated, to support “gateway” and “accounts.”

    • Individual changes on all the conf files need to be done.
    • Changes for the gateway will be almost similar except for the change in the name, which will be specific to the local cluster of which the given node is part of.
    • Changes for an “account” will be almost similar except for the “no_auth_user” parameter, which will be specific to the local cluster of which the given node is part of.
    • The “nats-server –signal reload” command should be able to pick up the changes.

    Starting Service

    After adding the certs, re-own the files:

    sudo chown –recursive nats:nats /nats

    Creating firewall rules:

    sudo firewall-cmd –permanent –add-port=4222/tcp

    sudo firewall-cmd –permanent –add-port=8222/tcp

    sudo firewall-cmd –permanent –add-port=4248/tcp

    sudo firewall-cmd –permanent –add-port=7222/tcp

    sudo firewall-cmd –reload

    Start the service:

    sudo systemctl start nats.service

    sudo systemctl enable nats.service

    Check status:

    sudo systemctl status nats.service -l

    Note: Remember to check logs of status commands in node2 and node3; it should show a connection with node1 and also confirm that node1 has been made the leader.

    Setting up the context:

    Setting up context will help us in managing our cluster better with NATSCLI.

    # pass the –tlsca flag in dev because we do not have the DNS registered. In staging and Production the `tlsca` flag will not be needed because certs will be registered.

    nats context add nats –server <IP/DNS-First>:4222,<IP/DNS-Second>:4222,<IP/DNS-Third>:4222 –description “Awesome Nats Servers List” –tlsca /nats/certs/natsio_rootCA.pem –select

    nats context ls

    nats account info

    Complete file for starting service would like this:

    #!/bin/bash
    
    # Own the files
    sudo chown --recursive nats:nats /nats
    
    # Create Firewall Rules
    sudo firewall-cmd --permanent --add-port=4222/tcp
    sudo firewall-cmd --permanent --add-port=8222/tcp
    sudo firewall-cmd --permanent --add-port=4248/tcp
    sudo firewall-cmd --permanent --add-port=7222/tcp
    sudo firewall-cmd --reload
    
    # Start Service
    sudo systemctl start nats.service
    sudo systemctl enable nats.service
    sudo systemctl status nats.service -l
    
    # Setup Context
    # pass the --tlsca flag in dev because we do not have the DNS registered. In staging and Production the `tlsca` flag will not be needed because certs will be registered.
    nats context add nats --server <IP/DNS-First>:4222,<IP/DNS-Second>:4222,<IP/DNS-Third>:4222 --description "Awesome Nats Servers List" --tlsca /nats/certs/natsio_rootCA.pem --select
    
    nats context ls
    nats account info

    Validation: 

    The account info command should shuffle among the servers in the Connected URL string.

    Stream Listing:

    Streams that will be available across the regions would require the credentials. The creds should be common across all clusters:

    The same info can be obtained from the different clusters when the same command is fired:

    To fetch local streams that are present under the no_auth_user:

    And from the different clusters using the same command (without credentials), we should get a different stream:

    Advanced Messaging Patterns with NATS

    In this section, we will explore advanced messaging patterns that leverage the capabilities of NATS for more complex communication scenarios.

    A. Request-Reply Pattern

    The request-reply pattern allows applications to send requests and receive corresponding responses through NATS. It enables synchronous communication, making it suitable for scenarios where immediate responses are required.

    B. Publish-Subscribe Pattern with Wildcards

    Nats introduces the concept of wildcards to the publish-subscribe pattern, allowing subscribers to receive messages based on pattern matching. This enables greater flexibility in subscription matching and expands the possibilities of message distribution.

    C. Queue Groups for Load Balancing and Fault Tolerance

    Queue groups provide load balancing and fault tolerance capabilities in NATS. By grouping subscribers together, NATS ensures that messages are distributed evenly across the subscribers within the group, preventing any single subscriber from being overwhelmed.

    Overcoming Real-World Challenges

    In this section, we will discuss real-world challenges that developers may encounter when working with NATS and explore strategies to overcome them.

    A. Scalability and High Availability in NATS

    As applications grow and message traffic increases, scalability, and high availability become crucial considerations. NATS offers various techniques and features to address these challenges, including clustering, load balancing, and fault tolerance mechanisms.

    B. Securing NATS Communication

    Security is paramount in any messaging system, and NATS provides several mechanisms to secure communication. These include authentication, encryption, access control, and secure network configurations.

    C. Monitoring and Debugging Techniques

    Efficiently monitoring and troubleshooting a NATS deployment is essential for maintaining system health. NATS provides tools and techniques to monitor message traffic, track performance metrics, and identify and resolve potential issues in real time.

    Recovery Scenarios in NATS 

    This section is intended to help in scenarios when NATS services are not usable. Scenarios such as Node failure, Not reachable, Service down, or region down are some examples of such a situation.

    Summary

    In this article, we have embarked on a journey from being a NATS novice to mastering its intricacies. We have explored the importance and applicability of NATS in the software development landscape. Through a comprehensive exploration of NATS’ definition, architecture, key features, and use cases, we have built a strong foundation in NATS. We have also examined advanced messaging patterns and discussed strategies to overcome real-world challenges in scalability, security, and monitoring. Furthermore, we have delved into the Recovery scenarios, which might come in handy when things don’t behave as expected. Armed with this knowledge, developers can confidently utilize NATS to unlock its full potential in their projects.