Category: Type

  • The Next Phase of FinOps: 3 AI-Powered Moves That Matter

    Cloud costs rarely spiral out of control overnight. More often, they drift quietly and steadily until finance teams are left explaining overruns and engineering teams are asked to “optimize” after the fact.

    This reactive approach to FinOps is becoming harder to sustain. Cloud environments today are far more dynamic than the tools and processes designed to manage them. Monthly reviews, static rules, and backward-looking reports simply cannot keep up.

    This is where AI-driven FinOps steps in. Not as another dashboard, but as the next evolution of FinOps itself but one that helps teams predict what’s coming, prevent waste before it happens, and continuously improve performance.

    From Cost Visibility to Cost Intelligence

    Traditional FinOps gives you visibility. You can see where money is being spent, which teams own which resources, and how costs trend over time. That foundation still matters.

    But visibility alone doesn’t answer the questions that really matter now:

    • Where is spend likely to increase next?
    • Which workloads are behaving differently than expected?
    • What should teams act on today, not at the end of the month?

    AI adds intelligence to FinOps by connecting historical patterns with real-time data. Instead of just reporting on spend, AI helps teams understand why costs are changing and what to do about it.

    Predict: Forecasting That Keeps Up with Change

    Forecasting cloud spend has always been difficult. Usage shifts with new releases, customer demand, and infrastructure changes, often making static forecasts outdated almost as soon as they’re created.

    AI-driven FinOps improves this by:

    • Continuously forecasting spend using live usage data
    • Learning from patterns like seasonality and growth trends
    • Adjusting predictions as workloads and architectures evolve

    The result is forecasting that feels less like guesswork and more like guidance. Finance teams gain clearer budget visibility, while engineering teams better understand how their decisions shape future costs.

    Prevent: Catching Anomalies Before They Become Problems

    In many organizations, cost anomalies are discovered only after the bill arrives. By then, teams are already behind.

    AI changes that dynamic. By learning what “normal” looks like for each workload, AI-powered FinOps tools can spot unusual behavior as it happens whether it’s a sudden traffic spike, a misconfigured autoscaling rule, or resources running idle longer than expected.

    Even more important, these alerts are contextual. They don’t just flag a spike; they explain where it’s coming from and why it matters. That clarity helps teams respond faster, with less finger-pointing and fewer manual investigations.

    Perform: Continuous Optimization, Not Periodic Cleanup

    FinOps works best when finance and engineering operate as partners, not gatekeepers and enforcers. AI makes that collaboration easier by translating complex cost data into insights each team can act on.

    With predictive insights in place:

    • Finance teams can focus on planning and accountability, not policing
    • Engineering teams can design with cost in mind, without slowing delivery
    • Optimization becomes ongoing, not something squeezed into quarterly reviews

    Savings are identified earlier, responses are faster, and performance goals stay intact, all without adding operational overhead.

    Case Study: Optimizing Petabyte-Scale Workloads for Cost and Continuity

    The value of AI-driven FinOps becomes clear at scale.

    A content-intelligence platform processing petabytes of data every day needed to control cloud costs without compromising performance or availability. Manual reviews and static optimization rules were no longer enough.

    By introducing predictive planning and real-time anomaly detection, the organization gained early visibility into cost deviations and the ability to act before issues escalated.

    The results were tangible:

    • 20% reduction in cloud costs
    • Improved continuity and workload performance
    • Faster response times with minimal manual effort

    AI didn’t just reduce spend rather it made cost management more predictable and less disruptive.
    Read the full story here- Optimizing Petabyte-Scale Workloads for Cost and Continuity – R Systems

    The R Systems Approach: AI-Powered FinOps, Built for Continuous Optimization

    AI is powerful, but it delivers real value only when embedded into everyday cloud operations.

    R Systems brings together AI-driven forecasting and anomaly detection with continuous optimization practices that align finance, engineering, and operations. The focus is not on one-time savings, but on building a FinOps capability that evolves alongside the cloud environment.

    The outcome is a FinOps model that is proactive, collaborative, and resilient, designed to keep pace with both growth and change.

    Explore our Cloud FinOps capabilities to learn more.

    Why AI-Driven FinOps Matters Now

    As cloud environments grow more complex, the cost of reacting late keeps rising. AI-driven FinOps offers a practical alternative: predict earlier, prevent waste, and perform with confidence.

    For organizations that see cloud efficiency as a long-term discipline and not a quarterly exercise, there AI is no longer optional. It is foundational.

    Let’s move forward together. Start the journey — talk to our Cloud FinOps experts today.

  • Choosing the Right Partner: Why Agentic AI Success Depends Less on Tools and More on Who You Build With

    Agentic AI has moved quickly from experimentation to expectation. Most enterprises today have pilots in motion, proofs of concept delivering early promise, and leadership teams asking a sharper question: How do we scale this safely, reliably, and with real business impact?

    That question is often followed by fatigue. Too many pilots stall. Too many promising demos fail to survive real-world complexity. And too often, the issue isn’t the technology itself.

    The uncomfortable truth is this: most agentic AI failures are not technology failures. They are partner failures.

    As enterprises move from pilots to production especially within Global Capability Centers (GCCs), partner selection has become a strategic decision, not a procurement one. The difference between experimentation and enterprise value increasingly comes down to who you build with.

    Why Partner Choice Matters More Than Ever

    Agentic AI is fundamentally different from earlier waves of automation. It introduces autonomy into business workflows, systems that can sense, decide, and act with limited human intervention.

    That kind of capability doesn’t scale through tools alone.

    Scaling agentic AI requires deep enterprise context, operating-model alignment, strong governance, and ownership of outcomes. Yet many organizations still choose partners based on narrow criteria: a compelling demo, a preferred toolset, or short-term cost efficiency.

    Those choices may work for pilots. They rarely work for production.

    As organizations mature, a clear realization is emerging: the partner matters as much as the platform or often more.

    Innovation Readiness Is Not Optional

    Agentic AI is advancing faster than most enterprise operating models can comfortably absorb. New orchestration patterns, reasoning techniques, safety mechanisms, and runtime optimizations are emerging at a pace that outstrips traditional delivery and governance cycles.

    In such an environment, partner capability cannot remain static. Enterprises need partners with a sustained capacity for innovation not merely the ability to implement what is already familiar.

    The most effective agentic AI partners operate through a mature AI Center of Excellence: one that systematically experiments, evaluates new tools and approaches, and converts what proves viable into production-ready practices before they enter core enterprise systems.

    Without this discipline, organizations risk committing too early to architectural choices that do not age well, making choices that introduce technical debt, constrain future evolution, and limit the scope of autonomy over time.

    Innovation readiness in agentic AI, then, is not a matter of chasing what is new. It is the ability to distinguish signal from noise, to decide deliberately what belongs in production, and to industrialize proven approaches with consistency, safety, and repeatability.

    The Common Partner Pitfalls

    Most enterprises don’t choose the wrong partners intentionally. They choose partners that are right for a different stage of maturity.

    Some common pitfalls we see:

    • Tool-first vendors who excel at showcasing AI capabilities but lack experience running mission-critical enterprise systems.
    • Traditional system integrators with scale and delivery muscle, but limited depth in agentic AI design and orchestration.
    • Niche AI firms that can build impressive pilots but struggle with integration, governance, and long-term operations.
    • Delivery partners focused on execution, not accountability leaving enterprises to own risk, outcomes, and scale alone.
    • Partners who lack domain or functional depth, resulting in agents that understand tools but not the business context, decision logic, or real operational constraints.

    None of these partners are inherently flawed. But agentic AI demands a broader, more integrated capability set.

    The Agentic AI Partner Readiness Checklist

    Before trusting a partner to take agentic AI into production, leaders should ask a simpler, more direct question:

    Can this partner scale autonomy responsibly inside my enterprise?

    Here is a practical checklist to help answer that question.

    1. Enterprise & GCC Readiness

    • Has this partner run large-scale, production systems and not just pilots?
    • Do they understand GCC operating models, governance structures, and decision rights?
    • Can they embed AI ownership into teams, not just deliver projects?

    2. Agentic AI Depth

    • Do they go beyond chatbots and copilots?
    • Have they designed and deployed multi-agent systems in real environments?
    • Do they build in human-in-the-loop controls by default?

    3. Scalability & Reusability

    • Do they think in platforms, not one-off agents?
    • Can their solutions be reused across functions and workflows?
    • Is observability and lifecycle management part of the design and not just an afterthought?

    4. Data & Integration Maturity

    • Can they work with messy, legacy, enterprise data?
    • Do they integrate cleanly with core business systems?
    • Is data governance built into the solution from day one?

    5. Security, Risk & Governance

    • Are guardrails designed in, not bolted on?
    • Can decisions be explained, audited, and governed?
    • Are solutions built for regulated, compliance-heavy environments?

    6. Outcome Ownership

    • Are success metrics tied to business outcomes not activity?
    • Will the partner co-own KPIs, risk, and accountability?
    • Do they stay invested beyond go-live?

    This checklist shifts the conversation from capabilities to credibility.

    Why This Checklist Changes the Conversation

    Used well, this framework changes how enterprises approach agentic AI adoption.

    It shifts the focus from vendors to partners, from pilots to platforms, and from experiments to operating models.

    It also makes one thing clear: scaling agentic AI is not a one-time implementation. It is a capability that must be built, governed, and evolved over time.

    Organizations that succeed tend to work with partners who understand enterprise realities, operate comfortably inside GCC environments, and engineer autonomy with accountability at the core.

    That is where agentic AI becomes sustainable.

    The Partner as a Force Multiplier

    Agentic AI is not a shortcut. It is a long-term capability play.

    The right partner accelerates scale, reduces risk, and protects ROI by ensuring that autonomy is introduced not with disruption but with discipline.

    The wrong partner adds complexity, creates fragility, and leaves enterprises managing outcomes they never fully owned.

    As leaders move from pilots to production, the question is no longer whether agentic AI can deliver value.

    It is whether you have the right partner to deliver it at scale, in the real world, and over time.

    Why Domain & Functional Context Make or Break Agentic AI

    Agentic AI systems do not simply automate tasks, they make decisions inside business workflows. That makes domain and functional context non-negotiable.

    An agent operating in finance, supply chain, customer service, or engineering must understand far more than APIs and prompts. It must respect process boundaries, exception handling, regulatory constraints, and the implicit rules humans apply every day.

    Partners without functional or industry depth often build agents that technically work but fail operationally, producing decisions that are correct in isolation yet wrong in context.

    The most effective partners combine agentic AI engineering with deep functional understanding, enabling agents to operate with judgment, not just intelligence.

  • Less Automation, More Trust: Why Tier-2 Operators Should Start Small with AI

    Every few months, someone in the telecom space claims that the self-healing network is just around the corner. This has been happening for years. Yet, many regional operators are still handling incidents manually, with their engineers triaging alarms and switching between legacy dashboards and SNMP traps.

    And the problem isn’t that operators lack ambition, or the drive for change – it’s that they don’t trust automation enough. That’s because they’ve learned, often the hard way, that even the smallest glitch can take a stable network down in seconds. This brings us to the real barrier to AI adoption in network operations, not technology, but trust. And honestly, that’s a rational response.

    AI’s first job is to earn engineers’ trust, not to replace them

    Most automation stories start from an ideal scenario: clean data, cloud-native infrastructure, and teams fluent in DevOps and data science. However, that’s not the reality for most Tier-2 operators. These are lean teams running multi-vendor environments, juggling with limited budgets and decades-old systems.

    After over 20 years in telecom, at R Systems we’ve worked with operators who’ve run anomaly detection pilots that technically worked but stayed in read-only mode for months because no one in the Network Operations Center (NOC) trusted the system enough to act on its recommendations. That’s rather a failure of design philosophy, than AI. The automation model might be perfect, but if the trust is low, it won’t go live.

    That’s why your first automation should first build trust and then trigger growth and digital transformation. It doesn’t need to be “zero-touch” solution. It needs to be safe and reversible, because engineers trust what they can override.

    Start where failure costs are low and wins are visible

    From what I’ve seen in most Tier-2 operators, about half the workload of their NOC comes from low-impact, repetitive incidents, like interface flaps, link degradations, or simple routing resets.

    These are the perfect starting points for AI. They happen often enough for models to learn quickly, and even if something goes wrong, the impact is minimal. Automating such tasks can cut alert fatigue dramatically, without touching high-risk infrastructure. The goal isn’t to replace engineer teams, but to help them focus on innovation and growth, while allowing AI to handle high-frequency, low-risk tasks.  

    Reversible automation builds confidence, one task at a time

    Every successful small automation builds political capital for bigger steps. Operators gain confidence when they see an AI system take on simple, reversible tasks and get them right.

    Features like explain-why outputs, detailed logs, and one-click rollbacks allow engineers to stay in control. This “supervised automation” mindset is how AI earns its place in runbooks and not the other way around. Because when the NOC team feels that AI is a partner, not a blocker, adoption accelerates naturally.

    AI in the NOC: how your first 90 days will look like

    If you’re wondering where to start, here’s what’s worked in practice:

    Step 1: Identify your top 10 high-frequency, low-risk runbooks.

    Work with your NOC managers and subject matter experts to pinpoint repetitive incident types that drain the most time.

    Step 2: Roll out AI in read-only mode.

    Have the Ops / DevOps teams use it for auto-diagnosis and ticket enrichment. This builds trust with zero risk.

    Step 3: Move to supervised automation with rollback options.

    Let the AI recommend and occasionally execute known-safe actions, with human oversight, to reduce MTTS and false-positive rates.

    If you follow this sequence, you can realistically target a 20–30% reduction in incident triage time within 12 weeks, without ever touching core routing policies.

    What success looks like

    A regional fiber ISP ran a small pilot with AI-based anomaly detection on its edge routers. Before the pilot, the six-person NOC was logging 15+ manual tickets every night.

    After the AI grouped and labeled similar alarms automatically, that number dropped to just four incidents requiring human confirmation. The mean time to resolution (MTTR) went down by 28%.

    That’s not science fiction, it’s what happens when trust comes before automation.

    “Start Small” isn’t playing small

    Some leaders worry that starting with small, reversible AI automations means they’ll fall behind the big players. Actually, it’s the other way around. Tier-1s often spend years (and millions) chasing “autonomous” dreams, but you can deliver measurable value in 90 days with a laptop, good logs, and the right mindset.

    The key is to think of AI not as a leap of faith, but as a series of safe, reversible steps that gradually earn your confidence and your engineers’.

    Because the truth is, AI doesn’t need to replace the human operator to transform the NOC. It just needs to make their 2 a.m. shift a little quieter, a little smarter, and a lot more human.

  • The Insurance Analytics Stack: Future-Proofing Your Investments in BI Tools

    We have seen the same pattern repeat across insurance clients more times than we can count: a significant investment in a “strategic” BI platform, followed by growing frustration just a few years later. The dashboards still run, but the platform starts to feel heavy. Costs increase. New data sources take longer to onboard. Regulatory requirements evolve faster than the analytics stack can adapt.

    For data and BI leaders in insurance, this is not a hypothetical scenario — it’s a familiar one.

    The reality is simple: BI tools age faster than most organizations anticipate. Data volumes grow exponentially, operating models change, and regulatory goalposts continue to shift. In our experience at R Systems, the challenge is rarely the BI tool itself; it’s how tightly business logic, governance, and skills are coupled to that tool.

    The Reality of Today’s Insurance BI Landscape

    There is no such thing as a perfect BI tool — only the right tool for a given context. And in insurance, that context is constantly evolving.

    Over the last decade, our teams have worked across a wide spectrum of analytics environments, from mainframe-driven reporting to cloud-native, AI-enabled platforms. Insurance organizations bring unique complexity to this journey: legacy core systems, fragmented actuarial and claims data, strict compliance requirements, and constant pressure to deliver more insight with fewer resources.

    Most insurers still rely on a familiar set of BI platforms:

    • MicroStrategy
    • Tableau
    • Qlik
    • Oracle BI
    • And increasingly, Power BI

    What we see most often is not a clean replacement of one tool with another, but a multi-tool landscape where new platforms are introduced alongside existing ones. This coexistence phase is where long-term success — or failure — is determined.

    The biggest mistake organizations make is assuming that today’s “strategic BI choice” will remain optimal as business priorities, data platforms, and regulatory expectations evolve.

    A Candid View of the Major BI Platforms in Insurance

    MicroStrategy
    We’ve seen MicroStrategy perform extremely well in large insurance environments that demand strong governance, complex security models, and predictable enterprise reporting. It scales reliably and meets regulatory expectations.
    At the same time, it can feel restrictive for agile analytics or rapid experimentation, especially when business users seek faster self-service capabilities.

    Tableau
    Tableau consistently drives high adoption due to its intuitive visual experience. Actuaries, underwriters, and analysts value the ability to explore data quickly and independently.
    Where insurers often struggle is governance at scale — particularly as data sources proliferate and business logic fragments across workbooks. Without strong discipline, performance and lineage challenges emerge.

    Qlik
    Qlik is often underestimated in insurance contexts. Its associative model excels in ad hoc exploration, especially for claims analysis, fraud detection, and investigative use cases.
    Challenges tend to arise in deeply governed enterprise scenarios or where long-term extensibility and integration with modern data platforms are priorities.

    Oracle BI
    Oracle BI remains a common choice for insurers heavily invested in Oracle ecosystems. It offers robust security and strong integration.
    However, innovation cycles can be slower, and business-user agility is often limited. Many teams rely on it out of necessity rather than preference.

    Power BI and Its Growing Role
    Power BI has become a significant part of the insurance analytics conversation. Its integration with modern data platforms such as Databricks and Snowflake, improving enterprise governance, and rapidly evolving AI capabilities have made it a strategic option for many insurers.

    In practice, we frequently see Power BI introduced alongside existing BI platforms — supporting executive reporting, self-service analytics, embedded use cases, or AI-driven insights — rather than as an immediate replacement. This coexistence reinforces the need for a flexible, decoupled architecture.

    The Hidden Risk: Where Business Logic Lives

    Across migrations and modernization programs, one risk appears repeatedly: deeply embedded business logic inside BI semantic layers.

    When regulatory calculations, actuarial formulas, and financial metrics are hard-coded into a specific BI tool:

    • Migrations become slow and expensive
    • Parallel runs are difficult to validate
    • Flexibility disappears during mergers, acquisitions, or platform shifts

    At that point, the BI tool stops being a presentation layer and becomes a structural constraint.

    Five Questions We Use to Future-Proof Insurance BI Decisions

    Based on our delivery experience, we encourage insurance BI leaders to ask five critical questions before making — or renewing — a BI investment:

    How easily can BI tools be swapped or augmented as strategies and vendors change?
    Rigid architectures increase risk during integrations and modernization efforts.

    Can governance models evolve with regulatory and data privacy demands?
    Many BI failures stem from brittle access controls and manual processes.

    How well does the BI layer integrate with modern data platforms and AI services?
    Cloud-native and AI-enabled analytics are no longer optional.

    How is the balance managed between self-service and enterprise control?
    Too much freedom leads to chaos; too much control drives shadow IT.

    Are investments being made in skills and architecture, not just licenses?
    Tools change, but strong teams and sound design principles endure.

    Lessons Learned From Real Programs

    In one engagement, we supported an insurer migrating from Oracle BI to Jasper to improve operations. While the target state made sense, a significant amount of critical logic was embedded in Oracle’s semantic layer. Rebuilding these calculations extended the program timeline by nearly 40%.

    In contrast, we’ve worked with insurers who deliberately decoupled their transformation and metric layers from the BI tool. When licensing or strategic priorities shifted, they were able to introduce Power BI with minimal disruption. That architectural choice saved months of effort and reduced long-term risk.

    Trends Insurance BI Teams Can No Longer Ignore

    Across recent insurance RFPs and transformation programs, several patterns are now consistent:

    • Cloud-native data platforms (Databricks, Snowflake, BigQuery)
    • Power BI and embedded analytics for agents, partners, and customers
    • AI-driven insights and natural language querying
    • Data mesh and data fabric operating models

    These are no longer emerging trends — they are current expectations.

  • Driving Intelligence Across a Leading German Automotive Manufacturer’s Operations with AI-Powered Forecasting

    • Enterprise AI Forecasting Framework – Designed and deployed a centralized, modular AI/ML forecasting architecture to unify forecasting across Finance, Logistics, Procurement, and Sales, replacing fragmented, manual processes with a single source of truth. 
    • Accuracy & Predictive Depth – Achieved up to 80% forecast accuracy across freight costs, transport lead times, and sales, with <20% MAPE for daily and weekly bank balance forecasts—delivering reliable short- and long-term visibility across business functions. 
    • Operational Efficiency at Scale – Automated end-to-end forecasting pipelines, significantly reducing manual effort, minimizing human error, and enabling monthly forecast updates with minimal retraining overhead. 
    • Actionable Business Intelligence – Enabled finance, sales, and logistics teams with real-time, role-specific dashboards to support proactive cash flow management, inventory planning, shipment prioritization, and demand-led decision-making. 
    • Modularity, Scalability & Reuse – Implemented a reusable forecasting framework supporting both univariate and multivariate models, allowing rapid extension to new business use cases, profit centers, and data sources without architectural rework. 
    • Strategic Business Impact – Improved planning precision, strengthened cross-functional alignment, and established a scalable AI foundation to support ongoing digital transformation and enterprise-wide forecasting maturity. 
  • AI-Powered Multimodal Fusion for Health Risk Prediction

    Predict Health Risks Before They Become Diagnoses

    Chronic diseases like diabetes, cancer, and heart conditions often get detected too late. But what if early warning signals were already hidden inside your EMR data?

    Our POV on AI-Powered Multimodal Fusion reveals how healthcare providers can move from reactive treatment to proactive, data-driven, and explainable risk prediction, without the need for advanced imaging or expensive diagnostics.

    Why This POV Is a Must-Read

    Healthcare organizations are sitting on enormous amounts of clinical data but very little of it works together. Our POV uncovers how multimodal AI bridges these silos to deliver:

    • Earlier detection of diabetes, cancer, and cardiovascular risks
    • Explainable health insights powered by SHAP and attention mechanisms
    • Seamless integration with existing EMR systems
    • Improved clinical decision-making using data you already have
    • Better population health, lower long-term costs

    Who Shouldn’t Miss to Read This POV

    • Hospital & clinical leaders
    • Digital health innovators
    • EMR/HealthTech product owners
    • Population health & payer strategy teams

    If early risk detection, preventive care, and explainable AI are priorities, this POV will equip you with high-impact insights.

  • From Connected to Intelligent: The Evolution of Smart Homes

    Overview:

    From futuristic speculation to everyday reality, smart homes can go way beyond connected devices – they can become intelligent, collaborative, reactive and adaptable environments. This can be achieved using Multi-agent AI Systems (MAS) to unify IoT devices and lay a solid foundation for innovation, for more seamless and secure living.  

    This remarkable growth of smart homes brings both opportunities and challenges. In this whitepaper, we’ll explore both, moving from the general – market overview and predictions, to specific – blueprint architecture and use cases, using AWS Harmony.  

    Here’s a breakdown of the whitepaper:

    • The Smart Homes market landscape: what is the current state and changes to expect
    • Multi-Agent AI Systems (MAS): how they work and why they’re transforming Smart Homes
    • The technology behind MAS: capabilities, practical applications and benefits
    • Smart Homes on AWS Harmony: blueprint of Agentic AI as the foundation for next-gen experiences
    • Use case for sustainable living: a hybrid Edge + Cloud IoT high-level architecture to implement for energy saving

  • The Data Lake Revolution: Unleashing the Power of Delta Lake

    Once upon a time, in the vast and ever-expanding world of data storage and processing, a new hero emerged. Its name? Delta Lake. This unsung champion was about to revolutionize the way organizations handled their data, and its journey was nothing short of remarkable.

    The Need for a Data Savior

    In this world, data was king, and it resided in various formats within the mystical realm of data lakes. Two popular formats, Parquet and Hive, had served their purposes well, but they harbored limitations that often left data warriors frustrated.

    Enterprises faced a conundrum: they needed to make changes, updates, or even deletions to individual records within these data lakes. But it wasn’t as simple as it sounded. Modifying schemas was a perilous endeavor that could potentially disrupt the entire data kingdom.

    Why? Because these traditional table formats lacked a vital attribute: ACID transactions. Without these safeguards, every change was a leap of faith.

    The Rise of Delta Lake

    Amidst this data turmoil, a new contender emerged: Delta Lake. It was more than just a format; it was a game-changer.

    Delta Lake brought with it the power of ACID transactions. Every data operation within the kingdom was now imbued with atomicity, consistency, isolation, and durability. It was as if Delta Lake had handed data warriors an enchanted sword, making them invincible in the face of chaos.

    But that was just the beginning of Delta Lake’s enchantment.

    The Secrets of Delta Lake

    Delta Lake was no ordinary table format; it was a storage layer that transcended the limits of imagination. It integrated seamlessly with Spark APIs, offering features that left data sorcerers in awe.

    • Time Travel: Delta Lake allowed users to peer into the past, accessing previous versions of data. The transaction log became a portal to different eras of data history.
    • Schema Evolution: It had the power to validate and evolve schemas as data changed. A shapeshifter of sorts, it embraced change effortlessly.
    • Change Data Feed: With this feature, it tracked data changes at the granular level. Data sorcerers could now decipher the intricate dance of inserts, updates, and deletions.
    • Data Skipping with Z-ordering: Delta Lake mastered the art of optimizing data retrieval. It skipped irrelevant files, ensuring that data requests were as swift as a summer breeze.
    • DML Operations: It wielded the power of SQL-like data manipulation language (DML) operations. Updates, deletes, and merges were but a wave of its hand.

    Delta Lake’s Allies

    Delta Lake didn’t stand alone; it forged alliances with various data processing tools and platforms. Apache Spark, Apache Flink, Presto, Trino, Hive, DBT, and many others joined its cause. They formed a coalition to champion the cause of efficient data processing.

    In the vast landscape of data management, Delta Lake stands as a beacon of innovation, offering a plethora of features that elevate your data handling capabilities to new heights. In this exhilarating adventure, we’ll explore the key features of Delta Lake and how they triumph over the limitations of traditional file formats, all while embracing the ACID properties.

    ACID Properties: A Solid Foundation

    In the realm of data, ACID isn’t just a chemical term; it’s a set of properties that ensure the reliability and integrity of your data operations. Let’s break down how Delta Lake excels in this regard.

    A for Atomicity: All or Nothing

    Imagine a tightrope walker teetering in the middle of their performance—either they make it to the other side, or they don’t. Atomicity operates on the same principle: either all changes happen, or none at all. In the world of Spark, this principle often takes a tumble. When a write operation fails midway, the old data is removed, and the new data is lost in the abyss. Delta Lake, however, comes to the rescue. It creates a transaction log, recording all changes made along with their versions. In case of a failure, data loss is averted, and your system remains consistent.

    C for Consistency: The Guardians of Validity

    Consistency is the gatekeeper of data validity. It ensures that your data remains rock-solid and valid at all times. Spark sometimes falters here. Picture this: your Spark job fails, leaving your system with invalid data remnants. Consistency crumbles. Delta Lake, on the other hand, is your data’s staunch guardian. With its transaction log, it guarantees that even in the face of job failure, data integrity is preserved.

    I for Isolation: Transactions in Solitude

    Isolation is akin to individual bubbles, where multiple transactions occur in isolation, without interfering with one another. Spark might struggle with this concept. If two Spark jobs manipulate the same dataset concurrently, chaos can ensue. One job overwrites the dataset while the other is still using it—no isolation, no guarantees. Delta Lake, however, introduces order into the chaos. Through its versioning system and transaction log, it ensures that transactions proceed in isolation, mitigating conflicts and ensuring the data’s integrity.

    D for Durability: Unyielding in the Face of Failure

    Durability means that once changes are made, they are etched in stone, impervious to system failures. Spark’s Achilles’ heel lies in its vulnerability to data loss during job failures. Delta Lake, however, boasts a different tale. It secures your data with unwavering determination. Every change is logged, and even in the event of job failure, data remains intact—a testament to true durability.

    Time Travel: Rewriting the Past

    Now, let’s embark on a fascinating journey through time. Delta Lake introduces a feature that can only be described as “time travel.” With this feature, you can revisit previous versions of your data, just like rewinding a movie. All of this magical history is stored in the transaction log, encapsulated within the mystical “_delta_log” folder. When you write data to a Delta table, it’s not just the present that’s captured; the past versions are meticulously preserved, waiting for your beck and call.

    In conclusion, Delta Lake emerges as the hero of the data world, rewriting the rules of traditional file formats and conquering the challenges of the ACID properties. With its robust transaction log, versioning system, and the ability to traverse time, Delta Lake opens up a new dimension in data management. So, if you’re on a quest for data reliability, integrity, and a touch of magic, Delta Lake is your trusted guide through this thrilling journey beyond convention.

    More Features of Delta Lake Are:

    • UPSERT
    • Schema Evolution
    • Change Data Feed
    • Data Skipping with Z-ordering
    • DML Operations

    The Quest for Delta Lake

    Setting up Delta Lake was like embarking on a quest. Data adventurers ventured into the cloud, AWS, GCP, Azure, or even their local domains. They armed themselves with the delta-spark spells and summoned the JARs of delta-core, delta-contrib, and delta-storage, tailored to their Spark versions.

    Requirements:

    • Python
    • Delta-spark
    • Delta jars

    You can configure in a Spark session and define the package name so it will be downloaded at the run time. As I said, I am using Spark version 3.3. We will require these things: delta-core, delta-contribs, delta-storage. You can download them from here: https://github.com/delta-io/delta/releases/ 

    To use and configure various cloud storage options, there are separate .jars you can use: https://docs.delta.io/latest/delta-storage.html. Here, you can find .jars for AWS, GSC, and Azure to configure and use their data storage medium.

    Run this command to install delta-spark first:

    pip  install delta-spark

    (If you are using dataproc or EMR, you can install this while creating cluster as a startup action, and if you are using serverless env like Glue or dataproc batches, you can create docker build or pass the .whl file for this package.)

    You must also do this while downloading the .jar. If it is serverless, download the .jar, store it in cloud storage, like S3 or GS, and use that path while running the job. If it is a cluster like dataproc or EMR, you can download this on the cluster. 

    One can also download these .jars at the run time while creating the Spark session as well.

    Now, create the Spark session, and you are ready to play with Delta tables.

    Environment Setup

    How do you add the Delta Lake dependencies to your environment?

    1. You can directly add them while initializing the Spark session for Delta Lake by passing the specific version, and these packages or dependencies will be downloaded during run time.
    2. You can place the required .jar files in your cluster and provide the reference while initializing the Spark session.
    3. You can download the .jar files and store them in cloud storage, and you can pass them as a run time argument if you don’t want to download the dependencies on your cluster.
    # Initialize Spark Session
    import pyspark
    from delta import *
    from pyspark.sql.types import *
    from delta.tables import *
    from pyspark.sql.functions import *
    
    builder = pyspark.sql.SparkSession.builder.appName("My App") \
          .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
            .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    
    
             .config("spark.jars.packages", "io.delta:delta-core_2.12:2.2.0") \
    
    
    
    # or if jar is already there
    
    builder = pyspark.sql.SparkSession.builder.appName("My App") \
          .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
            .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    
    
    
    
    spark = builder.getOrCreate()

    You have to add the following properties to use delta in Spark:-

    • Spark.sql.extensions
    • Spark.sql.catalog.spark_catalog

    You can see these values in the above code snippet. If you want to use cloud storage like reading and writing data from S3, GS, or Blob storage, then we have to set some more configs as well in the Spark session. Here, I am providing examples for AWS and GSC only.

    The next thing that will come to your mind: how will you be able to read or write the data into cloud storage?

    For different cloud storages, there are certain .jar files available that are used to connect and to do IO operations on the cloud storage. See the examples below.

    You can use the above approach to make this .jar available for Spark sessions either by downloading at a run time or storing them on the cluster itself.

    AWS 

    spark_jars_packages = “com.amazonaws:aws-java-sdk:1.12.246,org.apache.hadoop:hadoop-aws:3.2.2,io.delta:delta-core_2.12:2.2.0”

    spark = SparkSession.builder.appName(‘delta’)
      .config(“spark.jars.packages”, spark_jars_packages)
      .config(“spark.sql.extensions”, “io.delta.sql.DeltaSparkSessionExtension”)
      .config(“spark.sql.catalog.spark_catalog”, “org.apache.spark.sql.delta.catalog.DeltaCatalog”)
      .config(‘spark.hadoop.fs.s3a.aws.credentials.provider’, ‘org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider’)
      .config(“spark.hadoop.fs.s3.impl”, “org.apache.hadoop.fs.s3a.S3AFileSystem”)
      .config(“spark.hadoop.fs.AbstractFileSystem.s3.impl”, “org.apache.hadoop.fs.s3a.S3AFileSystem”)
      .config(“spark.delta.logStore.class”, “org.apache.spark.sql.delta.storage.S3SingleDriverLogStore”) 

    spark = builder.getOrCreate()

    GCS

    spark_session = SparkSession.builder.appName(‘delta’).builder.getOrCreate()

    spark_session.conf.set(“fs.gs.impl”, “com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem”)

    spark_session.conf.set(“spark.hadoop.fs.gs.impl”, “com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem”)

    spark_session.conf.set(“fs.gs.auth.service.account.enable”, “true”)

    spark_session.conf.set(“fs.AbstractFileSystem.gs.impl”, “com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS”)

    spark_session.conf.set(“fs.gs.project.id”, project_id)

    spark_session.conf.set(“fs.gs.auth.service.account.email”, credential[“client_email”])

    spark_session.conf.set(“fs.gs.auth.service.account.private.key.id”, credential[“private_key_id”])

    spark_session.conf.set(“fs.gs.auth.service.account.private.key”, credential[“private_key”])

    Write into Delta Tables: In the following example, we are using a local system only for reading and writing the data into and from delta lake tables.

    Data Set Used: https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/organizations/organizations-100.zip

    For reference, I have downloaded this file in my local machine and unzipped the data:

    df = spark.read.option("header", "true").csv("organizations-100.csv")
    df.write.mode('overwrite').format("delta").partitionBy(partition_keys).save("./Documents/DE/Delta/test-db/organisatuons")

    There are two modes available in Delta Lake and Spark (Append and Overwrite) while writing the data in the Delta tables from any source.

    For now, we have enabled the Delta catalog to store all metadata-related information. We can also use the hive meta store to store the metadata information and to directly run the SQL queries over the delta tables. You can use the cloud storage path as well.

    Read data from the Delta tables:

    delta_df = spark.read.format("delta").load("./Documents/DE/Delta/test-db/organisatuons")
    delta_df.show()

    Here, you can see the folder structure, and after writing the data into Delta Tables, it creates one delta log file, which keeps track of metadata, partitions, and files.

    Option 2: Create Delta Table and insert data using Spark SQL.

    spark.sql("CREATE TABLE orgs_data(index String, c_name String, organization_id String, name String, website String, country String, description String, founded String, industry String, num_of_employees String, remarks String) USING DELTA")

    Insert the data:

    df.write.mode('append').format("delta").option("mergeSchema", "true").saveAsTable("orgs_data")
    spark.sql("SELECT * FROM orgs_data").show()
    spark.sql("DESCRIBE TABLE orgs_data").show()

    This way, we can read the Delta table, and you can use SQL as well if you have enabled the hive.

    Schema Enforcement: Safeguarding Your Data

    In the realm of data management, maintaining the integrity of your dataset is paramount. Delta Lake, with its schema enforcement capabilities, ensures that your data is not just welcomed with open arms but also closely scrutinized for compatibility. Let’s dive into the meticulous checks 

    Delta Lake performs when validating incoming data against the existing schema:

    Column Presence: Delta Lake checks that every column in your DataFrame matches the columns in the target Delta table. If there’s a single mismatch, it won’t let the data in and, instead, will raise a flag in the form of an exception.

    Data Types Harmony: Data types are the secret language of your dataset. Delta Lake insists that the data types in your incoming DataFrame align harmoniously with those in the target Delta table. Any discord in data types will result in a raised exception.

    Name Consistency: In the world of data, names matter. Delta Lake meticulously examines that the column names in your incoming DataFrame are an exact match to those in the target Delta table. No aliases allowed. Any discrepancies will lead to, you guessed it, an exception.

    This meticulous schema validation guarantees that your incoming data seamlessly integrates with the target Delta table. If any aspect of your data doesn’t meet these strict criteria, it won’t find a home in the Delta Lake, and you’ll be greeted by an error message and a raised exception.

    Schema Evolution: Adapting to Changing Data

    In the dynamic landscape of data, change is the only constant. Delta Lake’s schema evolution comes to the rescue when you need to adapt your table’s schema to accommodate incoming data. This powerful feature offers two distinct approaches:

    Overwrite Schema: You can choose to boldly overwrite the existing schema with the schema of your incoming data. This is an excellent option when your data’s structure undergoes significant changes. Just set the “overwriteSchema” option to true, and voila, your table is reborn with the new schema.

    Merge Schema: In some cases, you might want to embrace the new while preserving the old. Delta Lake’s “Merge Schema” property lets you merge the incoming data’s schema with the existing one. This means that if an extra column appears in your data, it elegantly melds into the target table without throwing any schema-related tantrums.

    Should you find the need to tweak column names or data types to better align with the incoming data, Delta Lake’s got you covered. The schema evolution capabilities ensure your dataset stays in tune with the ever-changing data landscape. It’s a smooth transition, no hiccups, and no surprises, just data management at its finest.

    spark.read.table(...) 
      .withColumn("birthDate", col("birthDate").cast("date")) 
      .write 
      .format("delta") 
      .mode("overwrite")
      .option("overwriteSchema", "true") 
      .saveAsTable(...)

    The above code will overwrite the existing delta table with the new schema along with the new data.

    Delta Lake has support for automatic schema evolution. For instance, if you have added two more columns in the Delta Lake tables and still tried to access the existing table, you will be able to read the data without any error.

    There is another way as well. For example, if you have three columns in a Delta table but the incoming table has four columns, you can set up spark.databricks.delta.schema.autoMerge.enabled is true. It can be done for the entire cluster as well.

    spark.sql("DESCRIBE TABLE orgs_data").show()

    Let’s add one more column and try to access the data again:

    spark.sql(“alter table orgs_data add column(exta_col String)”)

    spark.sql(“describe table orgs_data”).show()

    As you can see, that column has been added but has not impacted the data. You can still smoothly and seamlessly read the data. It will set null to a newly created column.

    What happens if we receive the extra column in an incoming CSV that we want to append to the existing delta table? You have to set up one config here for that:

    input_df = spark.read.format('csv').option('header', 'true').load("../Desktop/Data-Engineering/data-samples/input-data/organizations-11111.csv")
    input_df.printSchema()
    input_df.write.mode('append').format("delta").option("mergeSchema", "true").saveAsTable("orgs_data")
    spark.sql("SELECT * FROM orgs_data").show()
    spark.sql("DESCRIBE TABLE orgs_data").show()

    You have to add this config mergeSchem=true while appending the data. It will merge the schema of incoming data that is receiving some extra columns.

    The first figure shows the schema of incoming data, and in the previous one, we have already seen the schema of our delta tables.

    Here, we can see that the new column that was coming in the incoming data is merged with the existing schema of the table. Now, the delta table has the latest updated schema.

    Time Travel 

    Basically, Delta Lake keeps track of all the changes in _delta_log by creating a log file. By using this, we can fetch the data of the previous version by specifying the version number.

    df = spark.read.format("delta").option("versionAsOf", 0).load("orgs_data")
    df.show()

    Here, we can see the first version of the data, where we have not added any columns. As we know, the Delta table maintains the delta log file, which contains the information of each commit so that we can fetch the data till the particular commit.

    Upsert, Delete, and Merge

    Unlocking the Power of Upsert with Delta Lake

    In the exhilarating realm of data management, upserting shines as a vital operation, allowing you to seamlessly merge new data with your existing dataset. It’s the magic wand that updates, inserts, or even deletes records based on their status in the incoming data. However, for this enchanting process to work its wonders, you need a key—a primary key, to be precise. This key acts as the linchpin for merging data, much like a conductor orchestrating a symphony.

    A Missing Piece: Copy on Write and Merge on Read

    Now, before we delve into the mystical world of upserting with Delta Lake, it’s worth noting that Delta Lake dances to its own tune. Unlike some other table formats like Hudi and Iceberg, Delta Lake doesn’t rely on the concepts of Copy on Write and Merge on Read. These techniques are used elsewhere to speed up data operations.

    Two Paths to Merge: SQL and Spark API

    To harness the power of upserting in Delta Lake, you have two pathways at your disposal: SQL and Spark API. The choice largely depends on your Delta version. In the latest Delta version, 2.2.0, you can seamlessly execute merge operations using Spark API. It’s a breeze. However, if you’re working with an earlier Delta version, say 1.0.0, then Spark SQL is your trusty steed for upserts and merges. Remember, using the right Delta version is crucial, or you might find yourself grappling with the cryptic “Method not found” error, which can turn into a debugging labyrinth.

    In the snippet below, we showcase the elegance of upserting using Spark SQL, a technique that ensures your data management journey is smooth and error-free:

    -- Insert new data and update existing data based on the specified key
    MERGE INTO targetTable AS target
    USING sourceTable AS source
    ON target.id = source.id
    WHEN MATCHED THEN
      UPDATE SET *
    WHEN NOT MATCHED THEN
      INSERT *
    WHEN NOT MATCHED BY SOURCE THEN
      DELETE;

    today_data_df = spark.read.format('csv').option('header', 'true').load("../Desktop/Data-Engineering/data-samples/input-data/organizations-11111.csv")
    today_data_df.show()
    
    
    spark.sql("select * from orgs_data where organization_id = 'FAB0d41d5b5ddd'").show()
    
    
    # Reading Existing Delta table
    deltaTable = DeltaTable.forPath(spark, "orgs_data")
    
    today_data_df.createOrReplaceTempView("incoming_data")

    Here, we are loading the incoming data and showing what is inside. The existing data with the same primary key appears in the Delta table so that we can compare after upserting or merging the data.

    spark.sql(
    """
    MERGE INTO orgs_data
    USING incoming_data
    ON orgs_data.organization_id = incoming_data.organization_id
    WHEN MATCHED THEN
      UPDATE SET
        organization_id = incoming_data.organization_id,
        name = incoming_data.name
    """
    )
    
    spark.sql("select * from orgs_data where organization_id = 'FAB0d41d5b5ddd'").show()

    orgs_data.alias("oldData").merge(
       incoming_data.alias("newData"),
       f"oldData.organization_id = newData.organization_id") 
       .whenMatchedUpdateAll() 
       .whenNotMatchedInsertAll() 
       .execute()

    This is the example of how you can do upsert using Spark APIs. The merge operation creates lots of small files. You can control the number of small files by setting up the following properties in the spark session.

    spark.delta.merge.repartitionBeforeWrite true

    spark.sql.shuffle.partitions 10

    This is how merge operations work. Merge supports one-to-one mapping. What we want to say is that only rows should try to update the one row in the target delta table. If multiple rows try to update the rows in the target Delta table, it will fail. Delta Lake matches the data on the basis of a key in case of an update operation.

    Change Data Feed

    This is also a useful feature of Delta Lake and tracks and maintains the history of all records in the Delta table after upsert or insert at the row level. You can enable these things at the beginning while setting up the Spark session or using Spark SQL by enabling  “change events” for all the data.

    Now, you can see the whole journey of each record in the Delta table, from its insertion to deletion. It introduces one more extra column, _change_type, which contains the type of operations that have been performed on that particular row.

    To enable this, you can set these configurations: 

    spark.sql("set spark.databricks.delta.properties.defaults.enableChangeDataFeed = true;") 

    Or you can set this conf while reading the delta table as well. 

    ## Stream Data Generation
    
    data = [{"Category": 'A', "ID": 1, "Value": 121.44, "Truth": True, "Year": 2022},
            {"Category": 'B', "ID": 2, "Value": 300.01, "Truth": False, "Year": 2020},
            {"Category": 'C', "ID": 3, "Value": 10.99, "Truth": None, "Year": 2022},
            {"Category": 'E', "ID": 5, "Value": 33.87, "Truth": True, "Year": 2022}
            ]
    
    df = spark.createDataFrame(data)
    
    df.show()
    
    df.write.mode('overwrite').format("delta").partitionBy("Year").save("silver_table")

    deltaTable = DeltaTable.forPath(spark, "silver_table")
                                     
    deltaTable.delete(condition = "ID == 1")
    
    delta_df = spark.read.format("delta").option("readChangeFeed", "true").option("startingVersion", 0).load("silver_table")
    delta_df.show()

    Now, after deleting something, you will be able to see the changes, like what is deleted and what is updated. If you are doing upserts on the same Delta table after enabling the change data feed, you will be able to see the update as well, and if you insert anything, you will be able to see what is inserted in your Delta table. 

    If we overwrite the complete Delta table, it will mark all past records as a delete:

    If you want to record each data change, you have to enable this before creating the table so that we can see the data changes for each version. If you’ve already created one table, you won’t be able to see the changes for the previous version once you enable the change data feed, but you will be able to see the changes in all versions that came after this configuration.

    Data Skipping with Z-ordering

    Data skipping is the technique in Delta Lake where, if you have a larger number of records stored in many files, it will read the data from the files that contain required information, but apart from that, other files will get skipped. This makes it faster to read the data from the Delta tables.

    Z-ordering is a technique used to colocate the information in the same dataset files. If you know the column that will be more in use in the select statement and has A cardinality, you can use Z-order by that particular column. It will reduce the large number of files from being read. We can give you multiple columns for Z-order by separating them from commas.

    Let’s suppose you have two tables, a and b, and there is one column that is most frequently used. You can increase the number of files to be skipped, and you can use the columns files running that query. Normal order works linearly, whereas Z-order works in multi- dimensionally.

    OPTIMIZE events
    WHERE date >= current_timestamp() - INTERVAL 1 day
    ZORDER BY (eventType)

    DML Operations

    Delta Lake has capabilities to run all the DML operations of SQL in the data lake as well as update, delete, and merge operations.

    Integrations and Ecosystem Supported in Delta Lake

    ‍Read Delta Tables

    Unlock the Delta Tables: Tools That Bring Data to Life

    Reading data from Delta tables is like diving into a treasure trove of information, and there’s more than one way to unlock its secrets. Beyond the standard Spark API, we have a squad of powerful allies ready to assist: SQL query engines like Athena and Trino. But they’re not just passive onlookers; they bring their own magic to the table, empowering you to perform data manipulation language (DML) operations that can reshape your data universe.

    Athena: Unleash the SQL Sorcery

    Imagine Athena as the Oracle of data. With SQL as its spellbook, it delves deep into your Delta tables, fetching insights with precision and grace. But here’s the twist: Athena isn’t just for querying; like a skilled blacksmith, it can help you hammer your data into a new shape, creating a masterpiece.

    Trino: The Shape-Shifting Wizard

    Trino, on the other hand, is the shape-shifter of the data realm. It glides through Delta tables, allowing you to perform an array of DML operations that can transform your data into new, dazzling forms. Think of it as a master craftsman who can sculpt your data, creating entirely new narratives and visualizations.

    So, when it comes to Delta tables, these tools are not just readers; they are your co-creators. They enable you to not only glimpse the data’s beauty but also mold it into whatever shape serves your purpose. With Athena and Trino at your side, the possibilities are as boundless as your imagination.

    Read Delta Tables Using Spark APIS

    from delta.tables import *
    delta_df =DeltaTable.forPath(spark,"./Documents/DE/Delta/test-db/organisatuons")<br>delta_df.toDf().show()

    Steps to Set Up Delta Lake with S3 on EC2 Or EMR and Access Data through Athena

    Data Set Used – We have generated some dummy data of around 100gb and written that into the Delta tables.

    Step 1- Set up a Spark session along with AWS cloud storage and Delta – Spark. Here, we have used an EC2 instance with Spark 3.3 and Delta version 2.1.1. Here, we are setting up Spark config for Delta and S3.

    AWS_ACCESS_KEY_ID = "XXXXXXXXXXXXXXXXXXXXXX"
    AWS_SECRET_ACCESS_KEY = "XXXXXXXXXXXXXXXXXXXXXX+XXXXXXXXXXXXXXXXXXXXXX"
    
    spark_jars_packages = "com.amazonaws:aws-java-sdk:1.12.246,org.apache.hadoop:hadoop-aws:3.2.2,io.delta:delta-core_2.12:2.1.1"
    
    spark = SparkSession.builder.appName('delta') \
       .config("spark.jars.packages", spark_jars_packages) \
       .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
       .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
       .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') \
       .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
       .config("spark.hadoop.fs.AbstractFileSystem.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
       .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") \
       .config("spark.driver.memory", "20g") \
       .config("spark.memory.offHeap.enabled", "true") \
       .config("spark.memory.offHeap.size", "8g") \
       .getOrCreate()
    
    spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
    spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)

    Spark Version – You can use any Spark version, but Spark 3.3.1 came along with the pip install. Just make sure whatever version you are using is compatible with the Delta Lake version that you are using; otherwise, most of the features won’t work.

    Step 2 – Here, we are creating a Delta table with an S3 path. We can directly write the data into an S3 bucket as a Delta table, but it is better to create a table first and then write it into S3 to make sure the schema is correct.

    Set the Delta location path if it exists to run the Spark SQL query and create the Delta table along with the S3 path. 

    # If table is already there
    delta_path = "s3a://abhishek-test-01012023/delta-lake-sample-data/"
    spark.conf.set('table.location', delta_path)
    
    # Creating new delta table on s3 location
    spark.sql("CREATE TABLE delta.`s3://abhishek-test-01012023/delta-lake-sample-data/`(id INT, first_name String, "
             "last_name String, address String, pincocde INT, net_income INT, source_of_income String, state String, "
             "email_id String, description String, population INT, population_1 String, population_2 String, "
             "population_3 String, population_4 String, population_5 String, population_6 String, population_7 String, "
             "date String) USING DELTA PARTITIONED BY (date)")

    Step 3 – Here, I have given one link that I have used to generate the dummy data and have written that into the S3 bucket as Delta tables. Feel free to look over this. An example code of writing is given below:

    df.write.format("delta").mode("append").partitionBy("date").save("s3a://abhishek-test-01012023/delta-lake-sample-data/")

    https://github.com/velotio-tech/delta-lake-iceberg-poc/blob/0396cdbf96230609695a907fdbe8c240042fce9e/delta-data-writer.py#L83

    In the above link, you find the code of dummy data generation.

    Step 4 – Here, we are printing the count and selecting some data from the Delta table that we have written in just right away.

    Run the SQL query to check the table data and upsert using S3 bucket data:

    spark.sql("select count(*) from delta.`s3://abhishek-test-01012023/delta-lake-sample-data/` group by id having count("
             "*) > 1").show()
    
    spark.sql("select count(*) from delta.`s3://abhishek-test-01012023/delta-lake-sample-data/`").show()
    
    #############################################################################
    # Upsert
    #############################################################################
    
    # upserts the starting five records. We will read first five record and will do some changes in some columns and
    
    input_df = spark.read.csv("s3a://abhishek-test-01012023/incoming_data/delta/4e0ae9f5-8c9d-435a-a434-febff1effbc3.csv",inferSchema=True,header=True)
    input_df.printSchema()
    input_df.createOrReplaceTempView("incoming")
    spark.sql("MERGE INTO delta.`s3://abhishek-test-01012023/delta-lake-sample-data/` t USING (SELECT * FROM incoming) s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *")

    This is the output of select statement:

    This is the schema of the incoming data we are planning to merge into the existing Delta table:

    After upsert, let’s see the data for the particular data partition:

    spark.sql(“select * from delta.`s3://abhishek-test-01012023/delta-lake-sample-data/` where id = 1 and date = 20221206”)

    Access Delta table using Hive or any other external metastore: 

    For that, we have to create a link between them and to create this link, go to the Spark code and generate the manifest file on the S3 path where we have already written the data.

    spark.sql("GENERATE symlink_format_manifest FOR TABLE 
    delta.`s3a://abhishek-test-01012023/delta-lake-sample-data/`")

    This will create the manifest folder not go to Athena and run this query:

    CREATE EXTERNAL TABLE delta_db.delta_table(id INT, first_name String, last_name String, address String, pincocde INT, net_income INT, source_of_income String, state String, email_id String, description String, population INT, population_1 String, population_2 String, population_3 String, population_4 String, population_5 String, population_6 String, population_7 String) 
    PARTITIONED BY (date String)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION 's3://abhishek-test-01012023/delta-lake-sample-data/_symlink_format_manifest/'

    MSCK REPAIR TABLE delta_db.delta_table

    You will be able to query the data. 

    Conclusion: A New Dawn

    In a world where data continued to grow in volume and complexity, Delta Lake stood as a beacon of hope. It empowered organizations to manage their data lakes with unprecedented efficiency and extract insights with unwavering confidence.

    The adoption of Delta Lake marked a new dawn in the realm of data. Whether dealing with structured or semi-structured data, it was the answer to the prayers of data warriors. As the sun set on traditional formats, Delta Lake emerged as the hero they had been waiting for—a hero who had not only revolutionized data storage and processing but also transformed the way stories were told in the world of data.

    And so, the legend of Delta Lake continued to unfold, inspiring data adventurers across the land to embark on their own quests, armed with the power of ACID transactions, time travel, and the promise of a brighter, data-driven future.

  • Spatial Data Analytics : The What, Why, and How?

    Introduction

    Have you ever wondered how Google Maps, Starlink, Zomato, Arogya Setu, and even methods like population clustering are able to add value to the human world? Well, the common thread between these applications and technologies is the use of spatial data and analysis techniques.

    Both Google Maps and Zomato use spatial techniques to provide navigation and location-based information to their users. While Arogya Setu is a contact tracing app that uses spatial data to track the spread of infectious illnesses, Starlink uses spatial data analysis to provide internet access to remote areas around the world. Population clustering is a technique that can be useful for urban planning, public health, and disaster response. Since the use of spatial data and its analysis techniques has become increasingly critical in the current scenario, let’s understand some fundamentals and explore different aspects of spatial data analytics.

    So, welcome to the world of spatial data analytics, where data meets geography and insights come to life! The use of spatial data analytics has changed the way we understand and interact with the world around us, providing insights and solutions to some of the most pressing challenges facing humanity today. So, let’s cut through the process by taking a quick tour of a spatial journey that you might have never been on before.

    What is spatial data analytics?

    Before we start talking about the process of spatial data analytics, let’s try to understand what is special about the term “spatial data.” Spatial data, also known as “Geospatial data,” refers to data representing features or objects on the Earth’s surface. Whether it’s man-made or natural, if it has to do with a specific location on the surface of the Earth, it’s spatial. Spatial data refers to where things are now, or perhaps, where they were or will be in the future.

    This data can be further classified as:

    Geometric Data:

    Geometric data is a type of spatial data mapped on a two-dimensional flat surface. Google Maps is an application that uses geometric data to provide accurate directions.

    Geographic Data:

    Geographic data is information mapped around the Earth that highlights the latitude and longitude relationships to a specific object or location. A familiar example of geographic data is a Global Positioning System (GPS).

    Spatial data is not limited to structured information; it also comprises imagery from satellites and drones, address data points, and longitudinal and latitudinal data. Primarily, spatial data is classified as vector data and raster data. Vector data consists of coordinate information, while raster data is all about layers of images extracted from camera sensors.

    The real world can be represented as below, where the built environment (roads, buildings) and administrative data (countries, census areas) tend to be represented as vector data. Natural environment (e.g., elevation, temperature, precipitation) is often represented using a raster grid.

    1. Discrete data, stored according to its exact geographic location, is called vector data.
    2. Continuous data is represented by regular grids called raster data.
    3. Attributes
    (Image source:  CVRD)

    Vector Data:

    • Points: A single dot on the layer depicts them. It can be either an x, y, or z coordinate.
    • Lines: This form of vector data is presented using two coordinates, i.e., either the x-y coordinate or the inverse of this, and has a definite length. These are used for rivers, roads, railways, ferry routes, and even major pipeline flows.
    • Polygon: The feature is defined using three or more coordinates. It is used to showcase inland water bodies like lakes, buildings, etc.

    Raster Data:

    • Raster is all about multilayered map images from satellites, drones, and various other camera sensors (ortho-imagery).
    • It is stored in cell-based and color-pixel formats. These pixels are arranged in columns and rows.
    • Analysis can be done better than with vector-based data due to the richness of the data.
    • It can give you more accurate measurements than other types of data.

    Attributes and Properties:

    • Spatial data contains more information than just a location on the surface of the Earth.
    • Any additional information, or non-spatial data, that describes a feature is referred to as an attribute.
    • In addition to locational and attribute information, spatial data inherently contains geometric and topological properties, which help to gain deeper insights.
    • Geometric properties include position and measurements, such as length, direction, area, and volume.
    • Topological properties represent spatial relationships such as connectivity, inclusion, and adjacency.

    As seen above, spatial data includes information such as geographic coordinates, elevation, and demographic information. Hence, it can be used to identify patterns, correlations, and trends that are not readily apparent through other data sources. For instance, geospatial data can be used to map the distribution of air pollution across a city, identify areas at risk for natural disasters like floods or wildfires, or monitor changes in land use over time. Here is where the analytics process takes place to uncover insights that can aid in providing solutions.

    Spatial data analytics involves collecting, processing, and analyzing various types of spatial data with insights to go beyond what occurs to determine not only where and when something occurs but also why it occurs at that specific place and/or time. It can be further viewed as descriptive analytics, which involves summarizing and visualizing spatial data to identify patterns and trends. Predictive analytics uses statistical models to make predictions about future events or trends based on past data. Prescriptive analytics uses optimization techniques to determine the best course of action given a specific set of circumstances.

    Why is spatial data analytics important?

    Spatial data analytics plays an essential role in many industries and fields, providing insights and solutions that can have a significant impact on our daily lives. It aids businesses in gaining a competitive edge through improved decision-making and time and money savings. Urban planning, telecommunications, military, public health, and emergency management are just a few examples of industries that rely heavily on spatial data analytics to make informed decisions.

    (Image source:  OneStopGIS)

    Public Health

    A patient’s location directly influences their health. Whether it’s disease prevention or clinic site selection, considering spatial aspects in healthcare analytics can have a drastic impact.

    Urban Planning

    An urban planner might want to assess the extent of urban fringe growth, quantify the population growth that some suburbs are witnessing, and also understand why these particular suburbs are growing, and others are not.

    Environmental and Natural Resources

    Protecting our world against climate change, promoting biodiversity, exploration, and conservation planning requires spatial storytelling and sophisticated environmental analysis.

    Space and Navigation

    Optimizing transport infrastructure and navigation spatially is key to the future of mobility. The most efficient cities are moving away from traditional methods to analyze new data. 

    Telecommunication

    Since network signal strength fluctuates by location over time, spatial analytics helps telecommunications companies understand where anomalies occur and then resolve them.

    Architecture, Engineering, and Construction

    The leading AEC firms are going beyond traditional workflows to use spatial data science in urban planning and site selection, reducing costs and boosting project profitability. A geological engineer might want to identify the best localities for constructing buildings in an earthquake-prone area by looking at rock formation characteristics.

    Military

    Spatial predictive analytics helps the military optimize the placement of resources while using predictive analytics to assess infrastructure, situational awareness, anticipate maintenance needs, and meet deadlines.

    Weather Forecasting

    Rapid response to extreme weather by visualizing blizzards, wildfires, and hurricanes fast enough for effective evacuation alerts. Spatial data analytics also helps airlines with routing and gives insurance companies a better way to assess property risk.

    How to perform spatial data analytics?

    The process of spatial data analytics involves data gathering, data cleaning, data processing, and visualization, much like any other traditional analytics technique. The specific details of the process will be determined on the basis of the data and the goals of the analysis.

    Data Collection: The initial stage in spatial data analytics is to collect the relevant data. This involves gathering data from different sources, such as remote sensing satellites, GPS-enabled devices, social media, or survey instruments. The data may include geographic coordinates, attributes of features, and other pertinent information that can help analyze the data.

    Data Cleaning and Preprocessing: Once the data is collected, it needs to be cleaned and preprocessed to ensure that it is accurate and usable for further processing. This may involve eliminating duplicates, filling in missing values, and standardizing data formats.

    Data Transformation: Spatial data is often obtained from numerous sources and in a variety of forms, so the next step is to transform and combine the data into a single data set. This may involve joining tables or layers based on a shared attribute or location.

    Data Analysis: This part of spatial data analytics involves identifying spatial patterns and relationships in the data. This may involve various techniques such as clustering, interpolation, spatial regression, and spatial autocorrelation analysis. The analysis may also include visualizing the data using maps, charts, and graphs for spatial data exploration.

    Modeling and Prediction: Based on the results of spatial analysis, it may be possible to build models to predict future patterns or trends in the data as a part of predictive analytics. This may involve using machine learning algorithms or other statistical techniques to identify patterns and make predictions.

    Business Intelligence: Finally, the results of spatial data analytics can be used to support decision-making in a variety of contexts, such as urban planning, natural resource management, or emergency response. The decision-making process may involve evaluating trade-offs between different options and considering the potential impact of different decisions on the spatial patterns in the data.

    Tools and Techniques:

    Spatial Data Storage

    Spatial data storage is a specialized form of data storage that takes into account the spatial relationships between various data points, allowing for more efficient and effective analysis and retrieval of information. There are many tools available for spatial data storage, including both open-source and proprietary software. Here are a few instances of such tools.

    (Image source:  Safe Software)

    RDBMS (Relational Database Management Systems): RDBMS are among the most used methods for storing geographical data having extensions that enable spatial features. RDBMS examples supporting geographic data include:

    Spatial File Formats: Spatial file formats are widely used for storing and sharing spatial data. Examples of spatial file formats include:

    • Shapefile (.shp)
    • GeoJSON (.geojson)
    • Keyhole Markup Language (KML) (.kml)
    • Geography Markup Language (GML) (.gml)

    NoSQL: NoSQL databases are becoming increasingly popular for spatial data storage due to their ability to handle large and complex datasets, flexible schema, and scalability. Examples of NoSQL databases that support spatial data include:

    Cloud-based Storage Services: Cloud-based storage services like AWS, GCP, Azure are popular options for storing spatial data, which can be termed as DataLakes. Examples of cloud-based storage services that support spatial data include:

    • Amazon S3 with Amazon S3 GeoSpatial Indexing
    • Google Cloud Storage with Google Cloud Storage Geo-Location
    • Microsoft Azure Blob Storage with Azure Spatial Anchors

    Spatial Data Warehouses: Spatial data warehouses are specialized databases designed for spatial data analysis. Examples of spatial data warehouses include:

    It can be noted that tools, such as RDBMS and NoSQL databases, can also be used for spatial data analytics and processing in addition to storage.

    Spatial Data Processing

    Spatial data processing is an important step in spatial data analytics to ensure that the data is properly aligned and in a consistent format before further analysis. This is a must-do step because various applications and data sources use different formats and coordinate systems, which might lead to several difficulties when analyzing the data.

    Below are a few examples of processing methods in a spatial context that ensure that spatial data is compatible and consistent across different applications and data sources.

    Reprojection: Reprojection is the process of converting spatial data from one map projection to another. This is frequently necessary when working with data from multiple sources that use different projections.

    Coordinate System/Datum Transformation: This transformation involves converting spatial data from one coordinate system to another or from one geodetic datum to another. This is important when working with data from different sources that use different coordinate systems and information.

    Resampling: Resampling involves changing the resolution or scale of spatial data. This is often necessary when handling data at different scales coming from different sources.

    Geocoding: Geocoding is the process of converting a street address or other location description into a set of geographic coordinates. This allows the location to be plotted on a map and later analyzed in a spatial context.

    Georeferencing: Georeferencing is the method of aligning geographic data to a specific coordinate system or reference system. This is often required when working with data from several sources, such as aerial photographs or satellite imagery.

    Digitizing: Digitizing is the process of converting analog maps or other spatial data into a digital format. This involves manually tracing features such as roads, buildings, and water bodies using a computer program.

    Several tools are available that can perform such data processing techniques, and a few of these tool instances are given below.

    GIS (Geographic Information Systems): GIS connects data to a map, integrating location data with all types of descriptive information. It helps users understand patterns, relationships, and geographic context. The benefits include improved communication and efficiency, as well as better management and decision-making. Examples of GIS software that supports spatial data processing include:

    • ArcGIS – A proprietary GIS software with a comprehensive set of features and tools
    • QGIS – An open-source GIS software with a wide range of plugins and tools

    Python Libraries: Python is a popular programming language for spatial data processing, and there are several libraries available for this purpose. Examples of Python libraries that support spatial data processing include:

    • GeoPandas: A library for working with geospatial data in Python
    • Shapely: A library for manipulation and analysis of planar geometric objects
    • PySAL: A library for spatial analysis and modeling

    R Packages: Like Python, R is another popular programming language for spatial data processing, and there are several packages available for spatial data operations. Examples of R packages that support spatial data processing include:

    • sf: an R package for working with geospatial data
    • sp: an R package for spatial data analysis
    • raster: an R package for working with raster data

    SQL: SQL can be used for spatial data processing and analysis, especially when working with spatial databases with extensions like PostGIS. Examples of SQL spatial functions include:

    Command-line Tools: There are a handful of command-line tools available for spatial data processing. Examples of command-line tools that support spatial data processing include:

    • GDAL/OGR: a suite of tools for geospatial data processing and conversion
    • GRASS GIS: a command-line tool for geospatial analysis and modeling

    There are a few other tools for data processing worth exploring, such as MATLAB, GeoServer, Global Mapper, and Mapbox. Tools like GIS software and Python libraries can also be used for spatial data storage and analysis in addition to processing.

    (Image source:  Carto)

    Spatial Data Analysis

    Spatial data analysis is the process of examining geographic data to spot trends, correlations, and patterns. It involves the use of statistical, computational, and visualization methods to explore spatial data and extract business insights. There are different categories of spatial data analysis, such as:

    Proximity Analysis: It involves measuring the distance between two or more places in a spatial dataset. It is possible to analyze proximity using methods like Euclidean distance.

    Accessibility Analysis: It is a measure of how easy it is to get to a location from other locations in the dataset. In addition to distance, Accessibility analysis takes into account other factors that affect how easy it is to travel between locations, such as traffic, road conditions, and public transportation.

    Spatial Clustering: Spatial clustering is the process of identifying groups of spatially adjacent objects that have similar characteristics. Hierarchical clustering and k-means clustering are two methods that can be used to accomplish this.

    Spatial Interpolation: Spatial interpolation involves estimating values for locations where data is not available based on nearby data points. This can be done using techniques such as kriging or inverse distance weighting.

    Spatial Exploratory Data Analysis: It involves creating visual representations of spatial data to explore patterns and relationships. Spatial EDA helps to identify patterns and relationships that may not be immediately apparent from the data and can help guide further analysis. This can include techniques such as choropleth maps, heat maps, or scatter plots.

    Spatial Simulation: This involves using simulation models to study the behavior of spatial systems over time. Spatial simulation includes techniques such as cellular automata, agent-based models, and Monte Carlo simulations. Spatial simulation is useful for predicting the future behavior of spatial systems under different scenarios.

    There are other categories, like factor analysis, trajectory analysis, network analysis, etc., that can be used for fine-grained spatial data analysis. Below are a few examples of tools that can be used to devise an analysis of spatial data.

    GIS (Geographic Information Systems): As seen earlier, this software can be used not only for capturing and processing the data but also to analyze and display geographically referenced data. Examples include ArcGIS, QGIS, and GRASS GIS.

    Open Source Libraries and Binaries: It includes various programming languages with many packages for spatial data analysis, such as the sp package for handling spatial data and the rgdal package for reading and writing geospatial data formats in R or packages, such as GeoPandas and Shapely to provide functionality for working with geospatial data in Python. The list can go on with the GDAL framework and its dependencies.

    PostGIS: PostGIS is a spatial database extender for the PostgreSQL database management system. It adds support for geographic objects, allowing you to manipulate and query geospatial data within the database for any kind of analysis purpose.

    Data Visualization Tools: These tools are used to create visual representations of spatial data for exploratory data analysis. Examples include Tableau, ArcGIS Pro, and QGIS.

    Mapbox: Mapbox is a mapping platform that provides APIs and SDKs for building custom maps and applications. It includes tools for data visualization, geocoding, routing, and more.

    ENVI: ENVI is a software package for processing and analyzing remote sensing data. It includes tools for image classification, spectral analysis, and terrain modeling, among others.

    Spatial data analysis plays an essential role in understanding complex spatial patterns and relationships and can help formulate business decisions in a wide range of areas. The choice of tool depends on the type of analysis that needs to be performed, the size of the data set, and the resources available.

    How to solve spatial big data problems?

    Big data refers to datasets that are too large and complex to process and analyze using the traditional methods that we discussed earlier. When dealing with spatial data, the challenges of big data are amplified due to the added dimensions of space and time. Spatial data is being captured at an unusual rate because of the growing numbers of sensors and devices, the networks of GPS satellites and cell towers, and the rise of the Internet of Things.

    Spatial data analytics can leverage strategies and resources, including distributed computing, cloud computing, and parallel processing, to address the above issues. These techniques allow for the processing and analysis of large spatial datasets, enabling real-time decision-making in industries including transportation, agriculture, and public safety. For instance, massive geographic data analytics are used by real-time traffic management systems to optimize traffic flows, ease congestion, and improve safety.

    (Image source:  Utilizing Cloud Computing to Address Big Geospatial Data Challenges Paper)

    Apache Sedona (formerly GeoSpark):

    • Apache Spark with a geospatial extension for geospatial data analytics capabilities.
    • Supports different spatial indexes, such as R-Tree, Quadtree, and K-D Tree, which can improve the performance of spatial queries and operations.
    • Supports various spatial queries, such as range queries, KNN queries, and spatial joins.
    • Designed to work with other components of the Spark ecosystem, such as Spark SQL and Spark Streaming.
    • Provides support for machine learning algorithms on geospatial data, such as clustering and classification.

    SpatialHadoop:

    • Apache Hadoop with a geospatial extension for spatial data analytics capabilities.
    • Can process and analyze large-scale spatial data in a distributed environment using the MapReduce paradigm.
    • Supports different spatial indexes, such as R-Tree and Grid File, which can improve the performance of spatial queries and operations.
    • Supports various spatial queries, such as range queries, KNN queries, and spatial joins.
    • Designed to work with other components of the Hadoop ecosystem, such as HDFS, MapReduce, and Hive.

    BigQuery GIS:

    • Google Cloud Platform that provides geospatial data analytics capabilities.
    • It is a fully-managed service that automatically scales up or down based on the volume of data and the complexity of queries.
    • Supports different spatial indexes, such as R-Tree and Hilbert Curve, which can improve the performance of spatial queries and operations.
    • Supports various spatial queries, such as range queries, KNN queries, and spatial joins.
    • Designed to work with other components of the BigQuery ecosystem, such as BigQuery ML, BigQuery BI Engine, and Bigquery Geo Viz.

    There are a few other tools and extensions, like Esri GIS Tools for HadoopSpatialSparkGoogle Earth Engine that can be used to gain insights and make informed decisions based on spatial data.

    Case Study

    In the telecommunications industry, spatial analysis can be used to optimize network coverage and capacity, plan new infrastructure, and identify areas of high network congestion.

    Let’s consider a hypothetical telecommunications company that wants to improve its network performance and customer experience by analyzing geospatial data. Specifically, the company wants to analyze call detail records (CDRs) to identify areas of high call volume and network congestion.

    (Image source:  Microsoft Azure Architectures)

    In a given solution by Azure in a published article, the suggested architecture involves:

    • Azure Data Factory, which is used to collect the CDRs from various sources (mainly geospatial databases).
    • Azure Data Factory stores them in Azure Data Lake Storage in formats such as GeoJSON, WKT, and Vector tiles. The bronze container holds raw data, the silver container holds semi-curated data, and the gold container holds fully curated data as the processing proceeds.
    • Azure Databricks and the GeoSpark/Sedona package are being used to convert data formats and efficiently load, process, and analyze large-scale spatial data across machines.
    • GeoPandas exports data in various formats, which are later used by GIS applications such as QGIS and ArcGIS for exploratory analysis.
    • Azure Machine Learning extracts insights from geospatial data, determining, for example, where and when to deploy new wireless access points.
    • Power BI or Azure Maps can be used to visualize the geospatial data and identify areas where network upgrades or infrastructure improvements are needed.
    • A log analytics system is set up to run queries against data in Azure Monitor Logs to implement a robust and fine-grained logging system to analyze events and performance.

    Overall, the Azure-based solution gives an idea about how one can try to perform geospatial analysis in the telecommunications industry and improve network performance and customer experience. You can read more about this solution here.

    Challenges and limitations

    In conclusion, spatial data analytics is an essential component of decision-making across a range of industries. It is important to understand the techniques, infrastructure, and challenges of spatial data analytics to effectively leverage spatial data and make informed decisions. Spatial data collection can be challenging and may contain faults or inconsistencies. The data may not be available for certain geographic areas or for certain time periods. Spatial data analytics can raise privacy concerns if personal data is collected and used without consent. In addition, there may be concerns about the use of spatial data analytics for surveillance or other unethical purposes, which can lead to significant harm.

    Conclusion

    Spatial data analytics is a powerful tool that can help organizations make better-informed decisions and gain a competitive advantage. As the fields of machine learning (AI) and spatial data analysis intertwine, spatial data analytics looks promising and quite useful for real-life problems. The blend of both vector and raster data produces a powerful product that can tackle various economic and earth-related problems. This blog is just a high-level overview of spatial data analytics since you have just scratched the surface, but I can guarantee that this spatial ride will be smoother from here on.

  • Know Everything About Spinnaker & How to Deploy Using Kubernetes Engine

    As marketed, Spinnaker is an open-source, multi-cloud continuous delivery platform that helps you release software changes with high velocity and confidence.

    Open sourced by Netflix and heavily contributed to by Google, it supports all major cloud vendors (AWS, Azure, App Engine, Openstack, etc.) including Kubernetes.

    In this blog I’m going to walk you through all the basic concepts in Spinnaker and help you create a continuous delivery pipeline using Kubernetes Engine, Cloud Source Repositories, Container Builder, Resource Manager, and Spinnaker. After creating a sample application, we will configure these services to automatically build, test, and deploy it. When the application code is modified, the changes trigger the continuous delivery pipeline to automatically rebuild, retest, and redeploy the new version.

    What Spinnaker Provides?

    Application management and Application Deployment are its two core features.

    Application Management

    Spinnaker’s application management features can be used to view and manage your cloud resources.

    Modern tech organizations operate collections of services—sometimes referred to as “applications” or “microservices”. A Spinnaker application models this concept.

    Applications, Clusters, and Server Groups are the key concepts Spinnaker uses to describe services. Load balancers and Firewalls describe how services are exposed to users.

    Application

    • An application in Spinnaker is a collection of clusters, which in turn are collections of server groups. The application also includes firewalls and load balancers. An application represents the service which needs to be deployed using Spinnaker, all configuration for that service, and all the infrastructure on which it will run. Normally, a different application is configured for each service, though Spinnaker does not enforce that.

    Cluster

    • Clusters are logical groupings of Server Groups in Spinnaker.
    • Note: Cluster, here, does not map to a Kubernetes cluster. It’s merely a collection of Server Groups, irrespective of any Kubernetes clusters that might be included in your underlying architecture.

    Server Group

    • The base resource, the Server Group, identifies the deployable artifact (VM image, Docker image, source location) and basic configuration settings such as number of instances, autoscaling policies, metadata, etc. This resource is optionally associated with a Load Balancer and a Firewall. When deployed, a Server Group is a collection of instances of the running software (VM instances, Kubernetes pods).

    Load Balancer

    • A Load Balancer is associated with an ingress protocol and port range. Traffic is balanced among the instances present in Server Groups. Optionally, health checks can be enabled for a load balancer, with flexibility to define health criteria and specify the health check endpoint.

    Firewall

    • A Firewall defines network traffic access. It is effectively a set of firewall rules defined by an IP range (CIDR) along with a communication protocol (e.g., TCP) and port range.

    Application Deployment

    Pipeline

    • The pipeline is the key deployment management construct in Spinnaker. It consists of a sequence of actions, known as stages. Parameters can be passed from one stage to the next one in the pipeline.
    • You can start a pipeline manually, or you can configure it to be automatically triggered by an event, such as a Jenkins job completing, a new Docker image being pushed in your docker registry, a CRON type schedule, or maybe a stage in another pipeline.
    • You can configure the pipeline to emit notifications, by email, SMS or HipChat, to interested parties at various points during pipeline execution (such as on pipeline start/complete/fail).

    Stage

    • A Stage in Spinnaker is an atomic building block for a pipeline, describing an action that the pipeline will perform. You can sequence stages in a Pipeline in any order, though some stage sequences may be more common than others. There are different types of stages in Spinnaker such as Deploy, Manual Judgment, Resize, Disable,  and many more. The full list of stages and read about implementation details for each provider here.

    Deployment Strategies

    • Spinnaker supports all the cloud native deployment strategies including Red/Black (a.k.a Blue/Green), Rolling red/black and Canary deployments, etc.

    What is Spinnaker Made Of?

    Spinnaker is composed of a number of independent microservices:

    • Deck Deck is the custom browser-based GUI.
    • Gate is the API gateway. All the API calls from UI (Deck) and other API callers go to Spinnaker through Gate.
    • Orca is the orchestration engine. It handles all ad-hoc operations and pipelines.
    • Clouddriver is responsible for all mutating calls to the cloud providers and for indexing/caching all deployed resources.
    • Front50 is used to persist the metadata of applications, pipelines, projects and notifications.
    • Rosco is the bakery. It helps to create machine images for various cloud vendors (for example GCE images for GCP, AMIs for AWS, Azure VM images). It currently wraps Packer, but will be expanded to support additional mechanisms for producing images.
    • Igor is used to trigger pipelines via continuous integration jobs in systems like Jenkins and Travis CI, and it allows Jenkins/Travis stages to be used in pipelines.
    • Echo is Spinnaker’s eventing bus. It supports sending notifications (e.g. Slack, email, Hipchat, SMS), and acts on incoming webhooks from services like GitHub.
    • Fiat is Spinnaker’s authorization service. It is used to query a user’s access permissions for accounts, applications and service accounts.
    • Kayenta provides automated canary analysis for Spinnaker.
    • Halyard is Spinnaker’s configuration service. Halyard manages the lifecycle of each of the above services. It only interacts with these services during Spinnaker start-up, updates, and rollbacks.

    By default, Spinnaker binds ports accordingly for all the above mentioned microservices. For us the UI (Deck) will be exposed onto Port 9000.

    What are We Going to Do?

    • Set up your environment by launching Cloud Shell, creating a Kubernetes Engine cluster, and configuring your identity and user management scheme.
    • Download a sample application, create a Git repository, and upload it to a Cloud Source Repository.
    • Deploy Spinnaker to Kubernetes Engine using Helm.
    • Build a Docker image from the source code.
    • Create triggers to create Docker images when the source code for application changes.
    • Configure a Spinnaker pipeline to reliably and continuously deploy your application to Kubernetes Engine.
    • Deploy a code change, triggering the pipeline, and watch it roll out to production.

     Note: This blog post uses various billable components in GCP like GKE, Container Builder etc. 

    Pipeline Architecture

    To continuously deliver application updates to users, companies need an automated process that reliably builds, tests, and updates their software. Code changes should automatically flow through a pipeline that includes artifact creation, unit testing, functional testing, and production rollout. In some cases, they want a code update to apply to only a subset of their users, so that it is exercised realistically before pushing it to entire user base. If one of these canary releases proves unsatisfactory, the automated procedure must be able to quickly roll back the software changes.

    With Kubernetes Engine and Spinnaker, we can create a robust continuous delivery flow that helps us to ensure that software is shipped as quickly as it is developed and validated. Although rapid iteration is the end goal, we must first ensure that each application revision passes through a series of automated validations before becoming a candidate for production rollout. When a given change has been vetted through automation, we can also validate the application manually and conduct further pre-release testing.

    After the team decides the application is ready for production, one of the team members can approve it for production deployment.

    Application Delivery Pipeline

    We are going to build the continuous delivery pipeline shown in the following diagram.

    Prerequisites  

    • Fair bit of experience in GCP services like:  
    • GKE (Google Kubernetes Engine)
    • Google Compute
    • Google APIs
    • Cloud Source Repository
    • Container Builder
    • Cloud Storage
    • Cloud Load Balancing
    • Knowledge in K8s terminology like Services, Deployments, Pods, etc
    • Familiarity with Kubectl and Helm package manager

    Before Starting just enable the APIs needed on GCP

     Set Up a Kubernetes Cluster  

    1. Go to the Console and scroll the left panel down to Compute->Kubernetes Engine->Kubernetes Clusters.
    2. Click Create Cluster.
    3. Choose a name or leave as the default one.
    4. Under Machine Type, click Customize.
    5. Allocate at least 2 vCPU and 10GB of RAM.
    6. Change the cluster size to 2.
    7. Enable Legacy Authorization while customizing the cluster.
    8. Keep the rest of the defaults and click Create.

    In a minute or two the cluster will be created and ready to go.

    Configure identity and access management

    Create a Cloud Identity and Access Management (Cloud IAM) service account to delegate permissions to Spinnaker, allowing it to store data in Cloud Storage. Spinnaker stores its pipeline data in Cloud Storage to ensure reliability and resiliency. If our Spinnaker deployment unexpectedly fails, we can create an identical deployment in minutes with access to the same pipeline data as the original.

    1. Create the service account:

    $ gcloud iam service-accounts create spinnaker-storage-account  --display-name spinnaker-storage-account

    2.  Store the service account email address and our current project ID in environment variables for use in later commands:

    $ export SA_EMAIL=$(gcloud iam service-accounts list  --filter="displayName:spinnaker-storage-account"  --format='value(email)')
    $ export PROJECT=$(gcloud info --format='value(config.project)')

    3. Bind the storage.admin role to our service account:  

    $ gcloud projects add-iam-policy-binding  $PROJECT --role roles/storage.admin --member serviceAccount:$SA_EMAI

    4. Download the service account key. We will need this key later while installing Spinnaker and we need to also upload the key to Kubernetes Engine.  

    $ gcloud iam service-accounts keys create spinnaker-sa.json --iam-account $SA_EMAIL

    Deploying Spinnaker using Helm

    In this section, we will deploy Spinnaker onto the K8s cluster via Charts with the help of K8s package manager Helm. Helm has made it very easy to deploy Spinnaker, it can be a very painful act to deploy it manually via Halyard and configure it.

    Install Helm

    1. Download and install the helm binary:

    $ wget https://storage.googleapis.com/kubernetes-helm/helm-v2.9.0-linux-amd64.tar.gz

    2. Unzip the file to your local system:

    $ tar zxfv helm-v2.9.0-linux-amd64.tar.gz$ sudo chmod +x linux-amd64/helm && sudo mv linux-amd64/helm /usr/bin/helm

    3. Grant Tiller, the server side of Helm, the cluster-admin role in your cluster:

    $ kubectl create clusterrolebinding user-admin-binding  --clusterrole=cluster-admin --user=$(gcloud config get-value account)
    $ kubectl create serviceaccount tiller --namespace kube-system
    $ kubectl create clusterrolebinding tiller-admin-binding  --clusterrole=cluster-admin --serviceaccount=kube-system:tiller

    4. Grant Spinnaker the cluster-admin role so it can deploy resources across all namespaces:

    $ kubectl create clusterrolebinding --clusterrole=cluster-admin       --serviceaccount=default:default spinnaker-admin

    5. Initialize Helm to install Tiller in your cluster:

    $ helm init --service-account=tiller --upgrade
    $ helm repo update

    6. Ensure that Helm is properly installed by running the following command. If Helm is correctly installed, v2.9.0 appears for both client and server.

    $ helm version

    Configure Spinnaker

    1. Create a bucket for Spinnaker to store its pipeline configuration:

    $ export PROJECT=$(gcloud info --format='value(config.project)')
    $ export BUCKET=$PROJECT-spinnaker-configgsutil mb -c regional -l us-central1  gs://$BUCKET

    2. Create the configuration file:

    $ export SA_JSON=$(cat spinnaker-sa.json)
    $ export PROJECT=$(gcloud info --format='value(config.project)')
    $ export BUCKET=$PROJECT-spinnaker-config
    $ cat > spinnaker-config.yaml <

    # Disable minio as the defaultminio:      
    enabled: false 
    
    # Configure your Docker registries here accounts:      
    name: gcr       
    address: https://gcr.io 
    username: _json_key 
    password: '$SA_JSON'
    email: 1234@5678.com EOF

    Deploy the Spinnaker chart

    1. Use the Helm command-line interface to deploy the chart with the configuration set earlier. This command typically takes five to ten minutes to complete, so we will be providing a deploy timeout with ` — timeout`.
    $ helm install -n cd stable/spinnaker -f spinnaker-config.yaml --timeout  600 --version 0.3.1

    After the command completes, run the following command to set up port forwarding to the Spinnaker UI from Cloud Shell:

    $ export DECK_POD=$(kubectl get pods --namespace default -l  "component=deck" -o jsonpath="{.items[0].metadata.name}")
    $ kubectl port-forward --namespace default $DECK_POD 8080:9000  >> /dev/null &

    The above command exposes the Spinnaker UI onto the local machine that we’re using to run all the commands. We can use any port of our choosing instead of 8080 in above command. Now the UI can be opened onto the url http://localhost:8080.

    Building the Docker image

    In this section, we will configure Container Builder to detect changes to the application source code, if yes then build a Docker image, and then push it to Container Registry.

    For this step we will use a sample app provided by the Google community  

    Create your source code repository

    1. Download the source code:

    $ wget https://gke-spinnaker.storage.googleapis.com/sample-app.tgz

    2. Unpack the source code:

    $ tar xzfv sample-app.tgz

    3. Change directories to source code:

    $ cd sample-app

    4. Set the username and email address for Git commits in this repository. Replace [EMAIL_ADDRESS] with Git email address, and replace [USERNAME] with Git username.  

    $ git config --global user.email "[EMAIL_ADDRESS]"
    $ git config --global user.name "[USERNAME]"

    5. Make the initial commit to source code repository:

    $ git init
    $ git add .
    $ git commit -m "Initial commit"

    6. Create a repository to host the code:

    $ gcloud source repos create sample-app
    $ git config credential.helper gcloud.sh

    7. Add our newly created repository as remote:

    $ export PROJECT=$(gcloud info --format='value(config.project)')
    $ git remote add origin  https://source.developers.google.com/p/$PROJECT/r/sample-app

    8. Push the code to the new repository’s master branch:

    $ git push origin master

    9. Check that we can see our source code in the console.

    Configuring the build triggers  

    In this section, we configure Google Container Builder to build and push your Docker images every time we push Git tags to our source repository. Container Builder automatically checks out the source code, builds the Docker image from the Dockerfile in repository, and pushes that image to Container Registry.

    1. In the GCP Console, click Build Triggers in the Container Registry section.
    2. Select Cloud Source Repository and click Continue.
    3. Select your newly created sample-app repository from the list, and click Continue.
    4. Set the following trigger settings:
    5. Name:sample-app-tags
    6. Trigger type: Tag
    7. Tag (regex): v.*
    8. Build configuration: cloudbuild.yaml
    9. cloudbuild.yaml location: /cloudbuild.yaml
    10. Click Create trigger.

    From now on, whenever we push a Git tag prefixed with the letter “v” to source code repository, Container Builder automatically builds and pushes our application as a Docker image to Container Registry.

    Let’s build our first image:

    Push the first image using the following steps:

    1. Go to source code folder in Cloud Shell.

    2. Create a Git tag:

    $ git tag v1.0.0

    3. Push the tag:  

    $ git push --tags

    4. In Container Registry, click Build History to check that the build has been triggered. If not, verify the trigger was configured properly in the previous section.

    Configuring your deployment pipelines

    Now that our images are building automatically, we need to deploy them to the Kubernetes cluster.

    We deploy to a scaled-down environment for integration testing. After the integration tests pass, we must manually approve the changes to deploy the code to production services.

    Create the application

    1. In the Spinnaker UI, click Actions, then click Create Application.

    2. In the New Application dialog, enter the following fields:

    1. Name: sample
    2. Owner Email: [your email address]

    3. Click Create.

    Create service load balancers

    To avoid having to enter the information manually in the UI, use the Kubernetes command-line interface to create load balancers for the services. Alternatively, we can perform this operation in the Spinnaker UI.

    On the local machine where the code resides, run the following command from the sample-app root directory:

    $ kubectl apply -f k8s/services

    Create the deployment pipeline

    Now we create the continuous delivery pipeline. The pipeline is configured to detect when a Docker image with a tag prefixed with “v” has arrived in your Container Registry.

    1. Create a new pipeline named say “Deploy”.

    2. Go to the Config page for the pipeline that we just created and click Pipeline Actions -> Edit as JSON.

    3. Change the directory to the source code directory and update the current pipeline-deploy.json at path spinnaker/pipeline-deploy.json according to our needs.

    $ export PROJECT=$(gcloud info --format='value(config.project)')
    $ sed s/PROJECT/$PROJECT/g spinnaker/pipeline-deploy.json > spinnaker/updated-pipeline-deploy.json

    4. Now in the JSON editor just copy the whole file spinnaker/updated-pipeline-deploy.json.

    5. Click on Update Pipeline and we should have an updated pipeline config now.

    6. In the Spinnaker UI, click Pipelines on the top navigation bar.

    7. Click Configure in the Deploy pipeline.

    8. The continuous delivery pipeline configuration appears in the UI:

    Running the pipeline manually

    The configuration we just created contains a trigger to start the pipeline when a new Git tag containing the prefix “v” is pushed. Now we test the pipeline by running it manually.  

    1. Return to the Pipelines page by clicking Pipelines.

    2. Click Start Manual Execution.

    3. Select the v1.0.0 tag from the Tag drop-down list, then click Run.

    4. After the pipeline starts, click Details to see more information about the build’s progress. This section shows the status of the deployment pipeline and its steps. Steps in blue are currently running, green ones have completed successfully, and red ones have failed. Click a stage to see details about it.

    5. After 3 to 5 minutes the integration test phase completes and the pipeline requires manual approval to continue the deployment.

    6. Hover over the yellow “person” icon and click Continue.

    7. Your rollout continues to the production frontend and backend deployments. It completes after a few minutes.

    8. To view the app, click Load Balancers in the top right of the Spinnaker UI.

    9. Scroll down the list of load balancers and click Default, under sample-frontend-prod.  

    10. Scroll down the details pane on the right and copy application’s IP address by clicking the clipboard button on the Ingress IP.

    11. Paste the address into the browser to view the production version of the application.

    12. We have now manually triggered the pipeline to build, test, and deploy your application. 

    Triggering the pipeline automatically via code changes

    Now let’s test the pipeline end to end by making a code change, pushing a Git tag, and watching the pipeline run in response. By pushing a Git tag that starts with “v”, we trigger Container Builder to build a new Docker image and push it to Container Registry. Spinnaker detects that the new image tag begins with “v” and triggers a pipeline to deploy the image to canaries, run tests, and roll out the same image to all pods in the deployment.

    1. Change the colour of the app from orange to blue: 

    $ sed -i 's/orange/blue/g' cmd/gke-info/common-service.go

    view rawcolor.js hosted with ❤ by GitHub

    2. Tag your change and push it to the source code repository:

    $ git commit -a -m "Change colour to blue"git tag v1.0.1git push --tags

    view rawtag_color.js hosted with ❤ by GitHub

    3. See the new build appear in the Container Builder Build History

    4. Click Pipelines to watch the pipeline start to deploy the image. 

    5. Observe the canary deployments. When the deployment is paused, waiting to roll out to production, start refreshing the tab that contains our application. Nine of our backends are running the previous version of your application, while only one backend is running the canary. Now we should see the new, blue version of our application appear about every tenth time we refresh

    6. After testing completes, return to the Spinnaker tab and approve the deployment. 

    7. When the pipeline completes, application looks like the following screenshot. Note that the colour has changed to blue because of code change, and that the Version field now reads v1.0.1. 

    8. We have now successfully rolled out your application to your entire production environment!!!!!! 

    9. Optionally, we can roll back this change by reverting the previous commit. Rolling back adds a new tag (v1.0.2), and pushes the tag back through the same pipeline we used to deploy v1.0.1: 

    $ git revert v1.0.1
    $ git tag v1.0.2
    $ git push --tags

    view rawrevert.js hosted with ❤ by GitHub

    Conclusion

    Now that you know how to get Spinnaker up and running in a development environment, start using it already. In this blog, we have done everything from installing a K8s cluster on GCP to deploying an End to End Pipeline just like that in a production environment. Hope you found it helpful.

    References

    https://cloud.google.com/solutions/continuous-delivery-spinnaker-kubernetes-engine