<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.3">Jekyll</generator><link href="https://pyblog.xyz/feed.xml" rel="self" type="application/atom+xml" /><link href="https://pyblog.xyz/" rel="alternate" type="text/html" /><updated>2026-02-27T15:24:26+00:00</updated><id>https://pyblog.xyz/feed.xml</id><title type="html">PYBLOG</title><subtitle>How do you know which &lt;br&gt; &lt;img id=&quot;showerButton&quot; class=&quot;twemoji&quot; src=&quot;https://pyblog.xyz/assets/img/emoji/watermelon.svg&quot; alt=&quot;&quot;&gt; to pick?</subtitle><author><name>Adesh Nalpet Adimurthy</name></author><entry><title type="html">Apache Flink Internals</title><link href="https://pyblog.xyz/flink-internals" rel="alternate" type="text/html" title="Apache Flink Internals" /><published>2026-02-26T00:00:00+00:00</published><updated>2026-02-26T00:00:00+00:00</updated><id>https://pyblog.xyz/flink-internals</id><content type="html" xml:base="https://pyblog.xyz/flink-internals">&lt;p&gt;Most blog posts on Flink&apos;s internals and architecture, even the official documentation, tend to be fragmented across different examples and cover components in isolation. The approach taken here is to follow a single reference Flink job end-to-end, through every component and moving part it touches, keeping the discussion grounded in the example, rather than attempting broad coverage of Flink&apos;s full capabilities. The tradeoff is intentional: depth over breadth.&lt;/p&gt;

&lt;h3&gt;1. Components&lt;/h3&gt;

&lt;p&gt;A running Flink system has two sides: the user-facing side and the system side.&lt;/p&gt;

&lt;p&gt;The user-facing side is the Client, where the application code lives. This includes the &lt;code&gt;DataStream&lt;/code&gt; API calls, job configuration, and JAR packaging. The Client&apos;s job is to compile that code into a graph representation and submit it to the cluster.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/flink/flink-program.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The system side consists of the &lt;code&gt;JobManager&lt;/code&gt; and &lt;code&gt;TaskManagers&lt;/code&gt;. The &lt;code&gt;JobManager&lt;/code&gt; receives the submitted job, plans its execution, and coordinates the entire lifecycle: scheduling, checkpointing, failure recovery. &lt;code&gt;TaskManagers&lt;/code&gt; are the workers that receive individual tasks from the &lt;code&gt;JobManager&lt;/code&gt; and run the actual data processing.&lt;/p&gt;

&lt;p&gt;The journey from user code to running tasks involves a series of graph transformations, each adding the detail the runtime needs to distribute and execute the job across the cluster.&lt;/p&gt;

&lt;h3&gt;2. Code to Execution&lt;/h3&gt;

&lt;p&gt;Consider a simple streaming job: read from a source, apply a map transformation, group by key, aggregate in a window, and write to a sink.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-65&quot; src=&quot;./assets/posts/flink/flink-topology.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Code does not execute anything until env.execute() is called. Between that call and actual task execution, Flink builds a series of progressively more detailed graphs.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/flink/flink-planning.svg&quot; /&gt;&lt;/p&gt;

&lt;h3&gt;2.1. Transformations&lt;/h3&gt;

&lt;p&gt;Each API call (&lt;code&gt;fromSource, map, keyBy, window, apply, sinkTo&lt;/code&gt;) creates a &lt;code&gt;Transformation&lt;/code&gt; object and appends it to a list inside the &lt;code&gt;StreamExecutionEnvironment&lt;/code&gt;. Each &lt;code&gt;Transformation&lt;/code&gt; holds a reference to its input, its output type, its parallelism, and the operator logic.&lt;/p&gt;
&lt;p&gt;Because each one points back to its input(s), they implicitly form a DAG.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/flink/flink-transformations.svg&quot; /&gt;&lt;/p&gt;

&lt;details class=&quot;text-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt; &amp;nbsp;Relevant Packages and Classes&lt;/summary&gt;
&lt;p&gt;In &lt;code&gt;streaming/api/transformations/&lt;/code&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;Transformation, 
OneInputTransformation, 
SourceTransformation, 
PartitionTransformation, 
SinkTransformation
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;h3&gt;2.2. Logical Topology&lt;/h3&gt;

&lt;p&gt;When &lt;code&gt;env.execute()&lt;/code&gt; fires, &lt;code&gt;StreamGraphGenerator&lt;/code&gt; walks the &lt;code&gt;Transformation&lt;/code&gt; list and produces a &lt;code&gt;StreamGraph&lt;/code&gt;, a DAG of &lt;code&gt;StreamNode(s)&lt;/code&gt; connected by &lt;code&gt;StreamEdge(s)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Each physical Transformation (Source, Map, Window/Apply, Sink) becomes a &lt;code&gt;StreamNode&lt;/code&gt;. Each &lt;code&gt;StreamNode&lt;/code&gt; holds its operator factory, parallelism, and serializers. Connections between nodes become &lt;code&gt;StreamEdges&lt;/code&gt;, each carrying a &lt;code&gt;StreamPartitioner&lt;/code&gt; that defines how data flows between operators.&lt;/p&gt;

&lt;p&gt;Non-physical Transformations like &lt;code&gt;PartitionTransformation&lt;/code&gt; (created by keyBy) don&apos;t produce their own node. Instead, they attach partitioning information to the downstream edge. These are handled as virtual nodes during generation.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/flink/flink-logical-topology.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The resulting StreamGraph is a direct representation of the job logic. No optimization has happened yet.&lt;/p&gt;

&lt;details class=&quot;text-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt; &amp;nbsp;Relevant Packages and Classes&lt;/summary&gt;
&lt;p&gt;In &lt;code&gt;streaming/api/graph/&lt;/code&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;StreamGraphGenerator, 
StreamGraph, 
StreamNode, 
StreamEdge
&lt;/code&gt;&lt;/pre&gt;
In &lt;code&gt;streaming/runtime/partitioner/&lt;/code&gt;
&lt;pre&gt;&lt;code&gt;StreamPartitioner, 
ForwardPartitioner, 
KeyGroupStreamPartitioner, 
etc.
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;h3&gt;2.3. Operator Chaining&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;StreamGraph&lt;/code&gt; is compiled into a &lt;code&gt;JobGraph&lt;/code&gt; by &lt;code&gt;StreamingJobGraphGenerator&lt;/code&gt;. The key optimization here is operator chaining: operators that meet certain conditions are fused into a single &lt;code&gt;JobVertex&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/flink/flink-operator-chaining.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Source and Map chain together (same parallelism, forward edge). The keyBy between Map and Window introduces a hash partitioner, a shuffle boundary, so those two cannot chain. Window and Sink also cannot chain because their parallelism differs (2 vs 1). That gives three JobVertices.&lt;/p&gt;

&lt;p&gt;4 operators → 3 JobVertices. Chaining reduces the number of network exchanges and avoids unnecessary serialization within a chain.&lt;/p&gt;

&lt;details class=&quot;text-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt; &amp;nbsp;Relevant Packages and Classes&lt;/summary&gt;
&lt;p&gt;In &lt;code&gt;streaming/api/graph/&lt;/code&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;StreamingJobGraphGenerator 
&lt;/code&gt;&lt;/pre&gt;
In &lt;code&gt;runtime/jobgraph/&lt;/code&gt;
&lt;pre&gt;&lt;code&gt;JobGraph, 
JobVertex, 
JobEdge
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;h3&gt;2.4. Physical Topology&lt;/h3&gt;

&lt;p&gt;The physical topology describes how it actually runs, in parallel, distributed across machines.&lt;/p&gt;

&lt;p&gt;Each operator runs at some parallelism, the number of parallel instances (subtasks) that execute it. At parallelism N, the operator&apos;s data stream is divided into N stream partitions.&lt;/p&gt;

&lt;p&gt;Using the same example:&lt;/p&gt;
&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/flink/flink-physical-topology.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Each subtask produces a stream partition, an independent slice of the data.
Between operators, data either flows forward or gets redistributed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Forward: data stays local, 1:1 from upstream partition to downstream partition. No serialization, no network. [Source → Map] uses this because both run at the same parallelism and no repartitioning is needed.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Redistribution (shuffle): data crosses the network. Every upstream partition can send to every downstream partition. Records get serialized, sent over TCP, deserialized. &lt;code&gt;keyBy&lt;/code&gt; triggers this, records are hashed by key so that all records for a given key land on the same downstream subtask. [Map → Window] in the diagram above is a hash shuffle.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where these shuffle boundaries land is one of the most important performance factors in a Flink job. Forward connections are cheap. Shuffles are expensive.&lt;/p&gt;

&lt;details class=&quot;text-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt; &amp;nbsp;Relevant Packages and Classes&lt;/summary&gt;
&lt;p&gt;In &lt;code&gt;streaming/runtime/partitioner/&lt;/code&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;StreamPartitioner, 
ForwardPartitioner, 
KeyGroupStreamPartitioner, 
RebalancePartitioner, 
BroadcastPartitioner
&lt;/code&gt;&lt;/pre&gt;
In &lt;code&gt;streaming/api/graph/&lt;/code&gt;
&lt;pre&gt;&lt;code&gt;StreamEdge
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;h3&gt;2.5. Execution Plan&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;JobGraph&lt;/code&gt; is submitted to the &lt;code&gt;JobManager&lt;/code&gt;. The &lt;code&gt;JobMaster&lt;/code&gt; takes each &lt;code&gt;JobVertex&lt;/code&gt; and expands it by parallelism to produce the &lt;code&gt;ExecutionGraph&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/flink/flink-execution-graph.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Each &lt;code&gt;JobVertex&lt;/code&gt; becomes an &lt;code&gt;ExecutionJobVertex&lt;/code&gt;. Each parallel instance becomes an &lt;code&gt;ExecutionVertex&lt;/code&gt;. Each &lt;code&gt;ExecutionVertex&lt;/code&gt; tracks its current Execution attempt. If a subtask fails and needs to restart, a new Execution is created for the same &lt;code&gt;ExecutionVertex&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;ExecutionGraph&lt;/code&gt; is the structure the &lt;code&gt;JobMaster&lt;/code&gt; uses for scheduling, tracking task state, coordinating checkpoints, and handling failures.&lt;/p&gt;

&lt;p&gt;Each &lt;code&gt;ExecutionVertex&lt;/code&gt; is deployed to a &lt;code&gt;TaskManager&lt;/code&gt; as a Task. A Task is the actual runtime entity: a dedicated thread that runs the &lt;code&gt;OperatorChain&lt;/code&gt;, reads from InputGates, processes records through the chained operators, and writes to &lt;code&gt;ResultPartition&lt;/code&gt;(s).&lt;/p&gt;

&lt;details class=&quot;text-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt; &amp;nbsp;Relevant Packages and Classes&lt;/summary&gt;
&lt;p&gt;In &lt;code&gt;runtime/executiongraph/&lt;/code&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;DefaultExecutionGraph, 
ExecutionJobVertex, 
ExecutionVertex, 
Execution.
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;h3&gt;3. State&lt;/h3&gt;

&lt;p&gt;Operators can be stateless or stateful. From the above example, the &lt;code&gt;map&lt;/code&gt; transforms the record, has no state. On the other hand, the window operation collect records until a trigger fires uses state.&lt;/p&gt;

&lt;p&gt;Flink state is fault tolerant (through checkpoints) and rescalable (by redistributing it when parallelism changes). Without which, every operator would have to manage its own storage and recovery.&lt;/p&gt;

&lt;h3&gt;3.1. State Backend&lt;/h3&gt;

&lt;p&gt;Going back to the example, &lt;code&gt;keyBy(...).window(TumblingEventTimeWindows.of(Time.seconds(10)))&lt;/code&gt;, the window operator collecting events for 10 seconds needs to store that data somewhere until the window fires. Each parallel subtask of a stateful operator maintains its own local state storage. This storage is embedded within the TaskManager process, so state access is fast and does not require any network calls.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-60&quot; src=&quot;./assets/posts/flink/flink-window-state.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The storage engine behind this is called the State Backend. Flink provides two production-ready options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;HashMapStateBackend&lt;/code&gt;: State lives as Java objects on the JVM heap. Fast access since there is no serialization overhead, but limited by available memory.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;code&gt;EmbeddedRocksDBStateBackend&lt;/code&gt;: State is serialized and stored in an embedded RocksDB instance on local disk. Slower per access (every read/write goes through serialization), but can hold state much larger than memory, bounded only by disk space.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tradeoff is speed vs. capacity. For small to moderate state, heap is faster. For large state (GBs to TBs), RocksDB is the only viable option.&lt;/p&gt;

&lt;p&gt;Because each subtask has its own local state backend instance, state scales naturally with parallelism. Two parallel subtasks of the window operator means two independent state stores, each holding only the data for its own subset of keys.&lt;/p&gt;

&lt;p&gt;There is also an third option (gaining popularity), &lt;code&gt;ForStStateBackend&lt;/code&gt;, built on &lt;code&gt;ForSt&lt;/code&gt; (a fork of RocksDB). It stores SST files on remote storage (S3, HDFS) instead of local disk (outside of local cache), allowing state to exceed local disk capacity entirely. Designed for disaggregated, cloud native setups and supports asynchronous state access.&lt;/p&gt;

&lt;p&gt;Note: &lt;code&gt;ForStStateBackend&lt;/code&gt; does not support canonical savepoint, full snapshot, changelog and file-merging checkpoints&lt;/p&gt;

&lt;details class=&quot;text-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt; &amp;nbsp;Relevant Packages and Classes&lt;/summary&gt;
&lt;p&gt;In &lt;code&gt;flink-runtime/&lt;/code&gt;, &lt;code&gt;flink-statebackend-rocksdb/&lt;/code&gt;, &lt;code&gt;flink-statebackend-forst/&lt;/code&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;StateBackend,
HashMapStateBackend,
EmbeddedRocksDBStateBackend,
ForStStateBackend
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;h3&gt;3.2. State Primitives&lt;/h3&gt;

&lt;p&gt;The state backends described above are the storage engines. What gets stored in them broadly falls into two categories.&lt;/p&gt;

&lt;h3&gt;3.2.1. Keyed State&lt;/h3&gt;
&lt;p&gt;Keyed State is partitioned by key. In the example job, the &lt;code&gt;keyBy(...)&lt;/code&gt; before the window means each window subtask only processes events for its assigned keys. The window operator internally uses keyed state to buffer incoming events until the window fires. In `MyJob` that buffer is a &lt;code&gt;ListState&lt;/code&gt; scoped to each key, stored in whichever state backend is configured.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/flink/flink-state.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Beyond the internal use by windows, Flink exposes keyed state primitives for custom operators:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ValueState&amp;lt;T&amp;gt;&lt;/code&gt;: A single value per key.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ListState&amp;lt;T&amp;gt;&lt;/code&gt;: A list of values per key.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;MapState&amp;lt;K, V&amp;gt;&lt;/code&gt;: A key-value map per key.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ReducingState&amp;lt;T&amp;gt;&lt;/code&gt; / &lt;code&gt;AggregatingState&amp;lt;IN, OUT&amp;gt;&lt;/code&gt;: Applies a reduce or aggregate on each addition, storing only the accumulated result.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;3.2.2. Operator State&lt;/h3&gt;
&lt;p&gt;Operator State is per subtask, not tied to keys. Each parallel instance holds its own independent state. The typical use case is a source connector tracking partition assignments and offsets.&lt;/p&gt;

&lt;p&gt;Both categories are managed by Flink: included in checkpoints, restored on failure, redistributed on rescale. Keyed state is redistributed through Key Groups, the atomic unit of state redistribution. The total number of Key Groups is fixed at the configured maximum parallelism. Each subtask is assigned a range of Key Groups, and when parallelism changes, those ranges are simply reassigned across the new set of subtasks.&lt;/p&gt;

&lt;h3&gt;3.3. Snapshots and Checkpointing&lt;/h3&gt;

&lt;p&gt;State stored locally in each subtask solves the access problem, but not the durability problem. If a &lt;code&gt;TaskManager&lt;/code&gt; crashes, that local state is gone. Flink needs a way to periodically capture a consistent snapshot of the entire job&apos;s state so it can recover from failures.&lt;/p&gt;

&lt;p&gt;This mechanism is called checkpointing, and it is based on the &lt;code&gt;Chandy-Lamport&lt;/code&gt; algorithm for distributed snapshots, adapted for Flink&apos;s dataflow model.&lt;/p&gt;

&lt;h3&gt;3.3.1. Checkpoint Barriers&lt;/h3&gt;

&lt;p&gt;The process works as follows:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The &lt;code&gt;CheckpointCoordinator&lt;/code&gt; (running inside the &lt;code&gt;JobManager&lt;/code&gt;) periodically initiates a checkpoint by sending a trigger to all source operators.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Each source, records its current position (e.g., Kafka partition offsets) and injects a special marker called a checkpoint barrier into the data stream. The barrier is not a separate signal; it flows with the records, in order, through the DAG.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;When an operator receives a barrier, it snapshots its local state and forwards the barrier downstream. The state snapshot is written to durable storage (typically a distributed file system like HDFS or S3).&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;When all sink(s) have received the barrier and acknowledged it back to the &lt;code&gt;CheckpointCoordinator&lt;/code&gt;, the checkpoint is considered complete.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;slider&quot; id=&quot;slider2&quot;&gt;
  &lt;div class=&quot;slides center-image-0 center-image-70&quot;&gt;
    &lt;img src=&quot;./assets/posts/flink/flink-checkpoint-1.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/flink/flink-checkpoint-2.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/flink/flink-checkpoint-3.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/flink/flink-checkpoint-4.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/flink/flink-checkpoint-5.svg&quot; class=&quot;slide&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;controls&quot;&gt;
    &lt;button onclick=&quot;plusSlides(-1, &apos;slider2&apos;)&quot; class=&quot;prev black-button&quot;&gt;Prev&lt;/button&gt;
    &lt;button onclick=&quot;playSlides(&apos;slider2&apos;)&quot; class=&quot;play black-button&quot;&gt;Play&lt;/button&gt;
    &lt;button onclick=&quot;stopSlides(&apos;slider2&apos;)&quot; class=&quot;stop black-button&quot; hidden=&quot;&quot;&gt;Stop&lt;/button&gt;
    &lt;button onclick=&quot;plusSlides(1, &apos;slider2&apos;)&quot; class=&quot;next black-button&quot;&gt;Next&lt;/button&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;The result is a consistent global snapshot: source offsets plus the state of every operator, all corresponding to the same logical point in the stream. No records are lost, no records are counted twice.&lt;/p&gt;

&lt;p&gt;A key detail: barriers never overtake records. They flow strictly in line. This is what ensures the snapshot captures exactly the state that results from processing all records before the barrier and none of the records after it.&lt;/p&gt;

&lt;details class=&quot;text-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt; &amp;nbsp;Relevant Packages and Classes&lt;/summary&gt;
&lt;p&gt;In &lt;code&gt;runtime/checkpoint/&lt;/code&gt;, &lt;code&gt;runtime/io/network/api/&lt;/code&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;CheckpointCoordinator,
CheckpointBarrier
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In &lt;code&gt;streaming/api/checkpoint/&lt;/code&gt;, &lt;code&gt;streaming/runtime/tasks/&lt;/code&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;CheckpointedFunction,
SubtaskCheckpointCoordinator
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;h3&gt;3.3.2. Aligned Checkpoint&lt;/h3&gt;

&lt;p&gt;For operators with multiple inputs (like after a shuffle), the barrier must arrive from all input channels before the snapshot is taken. This is called barrier alignment, and it ensures that no pre-checkpoint and post-checkpoint data gets mixed. This alignment can briefly pause processing on the faster channels, which is a tradeoff explored further in unaligned checkpoints.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-75&quot; src=&quot;./assets/posts/flink/flink-window-barrier.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;While aligned checkpoint (default) guarantees a clean cut: the snapshot contains exactly the state that results from all records before the barrier and none after, the pausing can cause backpressure. If one channel is significantly faster than another, the fast channel&apos;s data backs up, stalling upstream operators.&lt;/p&gt;

&lt;details class=&quot;text-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt; &amp;nbsp;Relevant Packages and Classes&lt;/summary&gt;
&lt;p&gt;In &lt;code&gt;streaming/runtime/io/checkpointing/&lt;/code&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;SingleCheckpointBarrierHandler,
AbstractAlignedBarrierHandlerState,
AlternatingCollectingBarriersUnaligned
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;h3&gt;3.3.3. Unaligned Checkpoint&lt;/h3&gt;

&lt;p&gt;Instead of pausing, the operator reacts to the first barrier it sees from any channel. It immediately forwards the barrier downstream and continues processing all channels. The records that are already in the input/output buffers (in-flight data between the two barriers) are stored as part of the checkpoint state.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-80&quot; src=&quot;./assets/posts/flink/flink-unaligned-checkpoint.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The result: checkpoint duration becomes independent of throughput and alignment time. Barriers travel through the DAG as fast as possible. The tradeoff is larger checkpoint sizes (in-flight data is included) and more I/O.&lt;/p&gt;

&lt;p&gt;Note, Unaligned checkpoints:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;require exactly-once mode and only one concurrent checkpoint is allowed with unaligned mode. So they will take slightly longer.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;break with an implicit guarantee in respect to watermarks during recovery. On recovery, Flink generates watermarks after it restores in-flight data, which means pipelines that apply the latest watermark on each record may produce different results than with aligned checkpoints.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Flink also supports a hybrid approach. Checkpoints start aligned, but if alignment takes longer than a configured timeout (&lt;code&gt;execution.checkpointing.aligned-checkpoint-timeout&lt;/code&gt;), the operator switches to unaligned mid-checkpoint. This gets the benefits of aligned checkpoints under normal conditions while avoiding the stalling problem under backpressure.&lt;/p&gt;

&lt;h3&gt;3.3.4. Incremental Checkpoints&lt;/h3&gt;

&lt;p&gt;Full checkpoints upload the entire state every time. For an operator holding 10 GB of state where only 200 MB changed, uploading the full 10 GB is wasteful.&lt;/p&gt;

&lt;p&gt;Incremental checkpoints exploit how RocksDB stores data. Writes go into an in-memory MemTable. When full, it flushes to disk as an immutable SST file (Sorted String Table). A background compaction process merges smaller SST files into larger ones, discarding duplicates. The key property: SST files are never modified after creation, only created (by flush) or deleted (by compaction).&lt;/p&gt;

&lt;p&gt;Going back to the example job, the Window operator [2] buffers events in RocksDB until the 10 second window fires. With incremental checkpoints enabled and 2 retained checkpoints:&lt;/p&gt;

&lt;div style=&quot;width: 100%; overflow-x: auto;&quot;&gt;
&lt;img class=&quot;center-image-0 center-image-100&quot; style=&quot;width: 135%; max-width: none; display: block;&quot; src=&quot;./assets/posts/flink/flink-incremental-checkpoint.svg&quot; /&gt;
&lt;/div&gt;

&lt;p&gt;Flink tracks which SST files are new or deleted between checkpoints and only uploads the delta.&lt;/p&gt;

&lt;p&gt;The shared state registry tracks how many active checkpoints reference each file. When a checkpoint is pruned (retained count exceeded), Flink decrements the reference counts. Files that drop to 0 are deleted from storage.&lt;/p&gt;

&lt;p&gt;The result: instead of uploading the full state each time, only new SST files are uploaded. The tradeoff is that recovery may need to reconstruct state from multiple incremental deltas, potentially making restores slower than with full checkpoints.&lt;/p&gt;

&lt;h3&gt;3.3.5. Savepoint&lt;/h3&gt;

&lt;p&gt;Savepoints use the same mechanism as checkpoints (barriers, state snapshots, source offsets) but are triggered manually by the user, not by the periodic scheduler.&lt;/p&gt;

&lt;p&gt;The key differences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Always aligned: unaligned mode does not apply to savepoints.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Do not expire: checkpoints are automatically cleaned up when newer ones complete. Savepoints persist until explicitly deleted. Triggered on demand: via CLI (flink savepoint &lt;code&gt;jobID&lt;/code&gt;) or REST API, not on a timer.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Portable format: savepoints can be created in canonical format, a standardized representation that is compatible across state backends. A job checkpointed with &lt;code&gt;HashMapStateBackend&lt;/code&gt; can be restored on &lt;code&gt;EmbeddedRocksDBStateBackend&lt;/code&gt; from a canonical savepoint. Native format (default and preferred) is faster to create and restore but is tied to the specific state backend and does not support cross-backend restoration.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Savepoints are used for planned operations: upgrading application code, changing parallelism, migrating to a different cluster, or switching state backends. The workflow is: take a savepoint, stop the job, make changes, restart from the savepoint.&lt;/p&gt;

&lt;p&gt;In the example job, if the parallelism of the Window operator needs to change from 2 to 4, a savepoint captures the current state (including Key Group assignments). On restart with the new parallelism, Flink redistributes the Key Groups across the 4 new subtasks and restores the state accordingly.&lt;/p&gt;

&lt;h3&gt;3.4. Recovery&lt;/h3&gt;

&lt;p&gt;When a failure occurs (TaskManager crash, network fault, user code exception, etc.), Flink stops the affected pipeline region (which for a single-region streaming job like this example, means the entire job) and rolls back to the latest completed checkpoint.&lt;/p&gt;

&lt;p&gt;The recovery process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The &lt;code&gt;JobManager&lt;/code&gt; selects the most recent successfully completed checkpoint (all sinks acknowledged, all state stored durably).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;All operators are redeployed across available &lt;code&gt;TaskManagers&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Each operator&apos;s state is restored from the checkpoint storage (Remote File System, S3/HDFS). The window operator gets back its buffered events, aggregation operators get back their partial results.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Source operators rewind to the offsets recorded in the checkpoint. For Kafka, this means resetting the consumer to the checkpointed partition offsets.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Processing resumes from that point. Every record after the checkpoint offset is reprocessed, but since the state has been rolled back to match, the end result is as if the failure never happened.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-75&quot; src=&quot;./assets/posts/flink/flink-state-restore.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This is what gives Flink &lt;code&gt;exactly-once&lt;/code&gt; processing semantics. Records between the checkpoint and the failure are reprocessed, but the state they are applied to has been rolled back to before those records were processed the first time. No double counting.&lt;/p&gt;

&lt;p&gt;However, the source must support replay (rewinding to a previous position). Kafka, Kinesis, filesystem, etc., sources all support replay. If a source cannot rewind, exactly-once guarantees cannot be met.&lt;/p&gt;

&lt;h3&gt;4. Time&lt;/h3&gt;
&lt;p&gt;There are three notions of time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Event Time (most common): The timestamp embedded in the event itself, representing when the event actually occurred. A sensor reading generated at &lt;code&gt;14:00:03&lt;/code&gt; carries that timestamp regardless of when Flink processes it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Processing Time: The wall clock of the machine running the operator at the moment it processes the event. Simple and fast, but non-deterministic. The same data replayed at a different speed produces different results.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ingestion Time (least common/discouraged): The timestamp assigned when the event enters Flink. More stable than processing time, but still does not reflect actual event occurrence.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/flink/flink-times.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In the example job, &lt;code&gt;TumblingEventTimeWindows.of(Time.seconds(10))&lt;/code&gt; uses event time. The window boundaries are determined by the timestamps in the data, not by when the records happen to arrive. This makes the results deterministic and reproducible.&lt;/p&gt;

&lt;h3&gt;4.1. Disorder Problem&lt;/h3&gt;

&lt;p&gt;Processing time is always monotonically increasing, the wall clock only moves forward. Event time has no such guarantee. In distributed systems, events produced in order can arrive at Flink out of order due to network delays, partitioning, or upstream buffering.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-80&quot; src=&quot;./assets/posts/flink/flink-disorder.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;A window covering &lt;code&gt;t=1&lt;/code&gt; to &lt;code&gt;t=5&lt;/code&gt; cannot simply close when it sees &lt;code&gt;t=6&lt;/code&gt;, because &lt;code&gt;t=4&lt;/code&gt; or &lt;code&gt;t=5&lt;/code&gt; might still be in transit. The system needs a way to know when it is safe to fire the window.&lt;/p&gt;

&lt;h3&gt;4.2. Watermarks&lt;/h3&gt;

&lt;p&gt;Watermarks are Flink&apos;s solution to the disorder problem. A watermark is a special marker that flows through the data stream carrying a timestamp &lt;code&gt;t&lt;/code&gt;. It declares: &quot;no more events with a &lt;code&gt;timestamp ≤ t&lt;/code&gt; will arrive.&quot;&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-80&quot; src=&quot;./assets/posts/flink/flink-watermarks.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;When the window operator receives a watermark that passes the window&apos;s end time, it knows the window is complete and fires it. Until that watermark arrives, the window holds its state.&lt;/p&gt;

&lt;p&gt;Watermarks flow inline with the data, just like checkpoint barriers. At operators with multiple inputs (after a shuffle), the effective watermark is the minimum across all input channels. The stream can only be as far along in event time as its slowest input.&lt;/p&gt;

&lt;p&gt;The gap between the actual event time and the watermark is called the bounded out of orderness. A larger gap tolerates more disorder but increases latency (windows fire later) and state lifetime (buffered data is held longer).&lt;/p&gt;

&lt;h3&gt;4.3. Timers&lt;/h3&gt;

&lt;p&gt;Operators can register timers for a future point in event time or processing time. When the watermark (for event time) or the wall clock (for processing time) reaches the registered timestamp, the timer fires and triggers a callback.&lt;/p&gt;

&lt;p&gt;Windows use timers internally. When a new window is created, the window operator registers an event time timer for the window&apos;s end time. When the watermark passes that time, the timer fires and the window emits its result.&lt;/p&gt;

&lt;p&gt;Custom operators using &lt;code&gt;ProcessFunction&lt;/code&gt; can register their own timers for use cases like session timeouts, delayed cleanup of expired state, or triggering periodic aggregations.&lt;/p&gt;

&lt;details class=&quot;text-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt; &amp;nbsp;Relevant Packages and Classes&lt;/summary&gt;
&lt;p&gt;In &lt;code&gt;flink-core/api/common/eventtime/&lt;/code&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;WatermarkStrategy,
WatermarkGenerator
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In &lt;code&gt;streaming/runtime/operators/&lt;/code&gt;, &lt;code&gt;streaming/api/operators/&lt;/code&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;TimestampsAndWatermarksOperator,
InternalTimerService
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;h3&gt;5. Runtime&lt;/h3&gt;

&lt;p&gt;A running Flink cluster consists of two types of JVM processes: one &lt;code&gt;JobManager&lt;/code&gt; and one or more &lt;code&gt;TaskManagers&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-45&quot; src=&quot;./assets/posts/flink/flink-runtime.svg&quot; /&gt;&lt;/p&gt;

&lt;h3&gt;5.1. Job Manager&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;JobManager&lt;/code&gt; is the control plane. It contains three RPC endpoints running in the same JVM. TaskManagers are the data plane: worker processes that execute tasks. Communication between them splits into two layers: &lt;code&gt;Pekko&lt;/code&gt; (formerly Akka) for control messages (scheduling, heartbeats, checkpoint triggers) and &lt;code&gt;Netty&lt;/code&gt; for actual data exchange between tasks.&lt;/p&gt;

&lt;h3&gt;5.1.1. Dispatcher&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;Dispatcher&lt;/code&gt; is the entry point for the cluster. It exposes the REST API, receives job submissions, and serves the Flink Web UI.&lt;/p&gt;

&lt;p&gt;When a job arrives, the Dispatcher persists it durably via the &lt;code&gt;ExecutionPlanWriter&lt;/code&gt;, then creates a &lt;code&gt;JobManagerRunner&lt;/code&gt; which starts a &lt;code&gt;JobMaster&lt;/code&gt; for that job. This persist-before-run design is what makes HA recovery possible: if the &lt;code&gt;JobManager&lt;/code&gt; crashes and a new leader takes over, the new Dispatcher recovers persisted jobs from storage and re-creates their JobMasters.&lt;/p&gt;

&lt;p&gt;In a session cluster, the Dispatcher lives for the lifetime of the cluster and handles multiple jobs. In application mode, it is scoped to a single application.&lt;/p&gt;

&lt;p&gt;The Dispatcher also participates in leader election. A &lt;code&gt;DispatcherLeaderProcess&lt;/code&gt; monitors whether this JobManager is the current leader. On gaining leadership, it reads recovered jobs from the &lt;code&gt;ExecutionPlanStore&lt;/code&gt; and recovered dirty job results from the &lt;code&gt;JobResultStore&lt;/code&gt;, then creates the actual Dispatcher instance with that recovery state.&lt;/p&gt;

&lt;h3&gt;5.1.2. Resource Manager&lt;/h3&gt;

&lt;p&gt;The ResourceManager owns the cluster&apos;s slot inventory. It maintains a registry of all TaskManagers and their slots, and a &lt;code&gt;SlotManager&lt;/code&gt; that matches slot requests from JobMasters against available slots.&lt;/p&gt;

&lt;p&gt;The flow:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;TaskManagers start up and register with the &lt;code&gt;ResourceManager&lt;/code&gt; via RPC, reporting how many slots they offer and each slot&apos;s &lt;code&gt;ResourceProfile&lt;/code&gt; (CPU, memory).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When a &lt;code&gt;JobMaster&lt;/code&gt; needs slots, it declares resource requirements to the ResourceManager.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;code&gt;SlotManager&lt;/code&gt; checks if existing free slots can satisfy the request. If yes, it sends an &lt;code&gt;requestSlot&lt;/code&gt; RPC to the TaskManager, telling it to allocate that slot for the specific job.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If not enough free slots exist and the ResourceManager is backed by an active resource provider (Kubernetes, YARN), it requests new TaskManagers from the provider. In standalone mode, it can only wait for TaskManagers to register on their own.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;ResourceManager&lt;/code&gt; also monitors TaskManager health through heartbeats. If a TaskManager misses heartbeats, the ResourceManager declares it dead, removes its slots from the inventory, and notifies affected JobMasters.
Importantly, the ResourceManager knows nothing about job logic. It deals purely in slots: who has them, who needs them, and how to provision more.&lt;/p&gt;

&lt;p&gt;Slot Allocation Flow:&lt;/p&gt;
&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/flink/flink-allocation-flow.svg&quot; /&gt;&lt;/p&gt;

&lt;h3&gt;5.1.3. Job Master&lt;/h3&gt;

&lt;p&gt;One &lt;code&gt;JobMaster&lt;/code&gt; per running job. This is where the actual job execution is managed. Internally it contains two critical components:&lt;/p&gt;

&lt;h3&gt;5.1.3a. Scheduler&lt;/h3&gt;
&lt;p&gt;Scheduler decides when and where to deploy tasks. There are multiple scheduler implementations, such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;DefaultScheduler&lt;/code&gt; with &lt;code&gt;PipelinedRegionSchedulingStrategy&lt;/code&gt; for streaming&lt;/li&gt;
&lt;li&gt;&lt;code&gt;AdaptiveBatchScheduler&lt;/code&gt; for batch workloads&lt;/li&gt;
&lt;li&gt;&lt;code&gt;AdaptiveScheduler&lt;/code&gt; for reactive scaling (adjusts parallelism based on available slots)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The scheduler works with the &lt;code&gt;SlotPool&lt;/code&gt;, which is the JobMaster&apos;s local view of allocated slots. The SlotPool uses a declarative resource model: it declares how many slots of what profile it needs, the &lt;code&gt;ResourceManager&lt;/code&gt; fulfills them, and TaskManagers offer the allocated slots back to the JobMaster. Once slots are available, the scheduler assigns &lt;code&gt;ExecutionVertex&lt;/code&gt; instances to them and triggers deployment.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/flink/flink-jobmaster.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;For a pure streaming job like &lt;code&gt;MyJob&lt;/code&gt;, the entire job is one pipelined region. On scheduling start, it finds all source regions and schedules them. Since everything is one region, all tasks launch at once.&lt;p&gt;

&lt;p&gt;For batch jobs with blocking shuffle boundaries, each stage is a separate region. Source regions are scheduled first. Downstream regions are scheduled only when their upstream blocking partitions become consumable. This saves resources by not starting downstream tasks that have nothing to consume yet.&lt;/p&gt;

&lt;h3&gt;5.1.3b. Checkpoint Coordinator&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;CheckpointCoordinator&lt;/code&gt;: triggers checkpoint barriers, tracks acknowledgements from all tasks, manages completed checkpoint metadata, and decides when to discard old checkpoints. This is the component that drives the entire checkpointing flow described in the earlier State section.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;JobMaster&lt;/code&gt; also handles failure recovery. When a task fails, it consults a &lt;code&gt;FailoverStrategy&lt;/code&gt; (typically &lt;code&gt;RestartPipelinedRegionFailoverStrategy&lt;/code&gt;) to determine which tasks need to be restarted, cancels them, and redeploys from the last checkpoint.&lt;/p&gt;

&lt;h3&gt;5.1.4. Job Lifecycle&lt;/h3&gt;

&lt;p&gt;A job, once accepted by the &lt;code&gt;Dispatcher&lt;/code&gt;, moves through a state machine managed by the &lt;code&gt;JobStatus&lt;/code&gt;. The typical happy path is straightforward: &lt;code&gt;INITIALIZING → CREATED → RUNNING → FINISHED&lt;/code&gt;&lt;/p&gt; 

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;INITIALIZING&lt;/code&gt;: The Dispatcher has received the job, but the JobMaster has not yet gained leadership or been fully created.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;CREATED&lt;/code&gt;: The JobMaster is ready. No tasks have been scheduled yet.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;RUNNING&lt;/code&gt;: At least some tasks are scheduled or executing. The job stays in this state until all tasks finish.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;FINISHED&lt;/code&gt;: All tasks completed successfully.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When a task fails during execution, the Scheduler evaluates whether the error is recoverable. If it is, the affected tasks are restarted. The job itself stays in &lt;code&gt;RUNNING&lt;/code&gt; while individual tasks are restarted at the region level.&lt;/p&gt;

&lt;p&gt;If the failure is unrecoverable (or restart attempts are exhausted), the job transitions through: &lt;code&gt;RUNNING → FAILING → FAILED&lt;/code&gt;. &lt;code&gt;FAILING&lt;/code&gt; cancels all remaining tasks. Once every task reaches a terminal state, the job moves to &lt;code&gt;FAILED&lt;/code&gt; and exits.&lt;/p&gt;

&lt;p&gt;When a user manually cancels a job (via the Web UI or CLI): &lt;code&gt;RUNNING → CANCELLING → CANCELED&lt;/code&gt;. &lt;code&gt;CANCELLING&lt;/code&gt; cancels all tasks. Once all tasks are in a terminal state, the job enters &lt;code&gt;CANCELED&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Suspension (HA only): &lt;code&gt;RUNNING → SUSPENDED&lt;/code&gt;. &lt;code&gt;SUSPENDED&lt;/code&gt; only occurs when high availability is configured and the JobMaster loses leadership. The job is not removed from the HA store, it just means this particular JobMaster has stopped managing it. Another &lt;code&gt;JobMaster&lt;/code&gt; (or the same one after regaining leadership) will pick the job back up and restart it.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/flink/flink-job-cycle.svg&quot; /&gt;

&lt;h3&gt;5.2. Task Manager&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;TaskManager&lt;/code&gt; is a JVM process that does the actual data processing. In Flink, this process is called &lt;code&gt;TaskExecutor&lt;/code&gt;. Each cluster has one or more TaskExecutors, and each one registers with the &lt;code&gt;ResourceManager&lt;/code&gt; on startup by sending a &lt;code&gt;SlotReport&lt;/code&gt; listing all available task slots.&lt;/p&gt;

&lt;h3&gt;5.2.1. Task Slots&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;TaskExecutor&lt;/code&gt; divides its resources into a fixed number of task slots. Each slot is a resource container with its own &lt;code&gt;MemoryManager&lt;/code&gt; and a defined &lt;code&gt;ResourceProfile&lt;/code&gt; (CPU, memory). The number of slots is configured via &lt;code&gt;taskmanager.numberOfTaskSlots&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A slot has three states: &lt;code&gt;ALLOCATED&lt;/code&gt; (assigned to a job by the ResourceManager, not yet in use by the JobMaster), &lt;code&gt;ACTIVE&lt;/code&gt; (in use, tasks can be added), and &lt;code&gt;RELEASING&lt;/code&gt; (tasks have failed, waiting to be fully emptied before the slot is freed).&lt;/p&gt;

&lt;p&gt;The important detail: a slot can hold multiple tasks. The tasks map inside &lt;code&gt;TaskSlot&lt;/code&gt; is keyed by &lt;code&gt;ExecutionAttemptID&lt;/code&gt;, meaning multiple operator subtasks can share a single slot. This is where slot sharing comes in.&lt;/p&gt;

&lt;h3&gt;5.2.2. Task Slot Sharing&lt;/h3&gt;

&lt;p&gt;By default, Flink places all operators of a job into the same &lt;code&gt;SlotSharingGroup&lt;/code&gt;. This means one subtask from each operator in the pipeline can be co-located in a single slot. For the running &lt;code&gt;MyJob&lt;/code&gt; example:&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/flink/flink-task-slot.svg&quot; /&gt;

&lt;p&gt;The design motivation is twofold. First, it means a job with N pipeline stages does not need &lt;code&gt;N × parallelism&lt;/code&gt; slots. The number of slots needed equals the maximum parallelism across all operators (here: 2). Second, co-locating a full pipeline slice in one slot enables forward connections to stay local (in-memory data exchange, no network serialization).&lt;/p&gt;

&lt;h3&gt;5.2.3. Task Execution Model&lt;/h3&gt;

&lt;p&gt;Each task runs in a dedicated thread and typically follows a simple internal pipeline: &lt;code&gt;InputGate(s) → OperatorChain → ResultPartition(s)&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The task reads records from its &lt;code&gt;InputGate&lt;/code&gt;, passes them through the &lt;code&gt;OperatorChain&lt;/code&gt; (the chained operators from the JobGraph), and writes output to its &lt;code&gt;ResultPartition&lt;/code&gt;. Source tasks are the exception: they generate data directly, with no &lt;code&gt;InputGate&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;ResultPartition&lt;/code&gt; is divided into &lt;code&gt;SubPartitions&lt;/code&gt;, one per downstream consumer subtask. An InputGate is composed of InputChannels, one per upstream producer subtask.&lt;/p&gt;

&lt;p&gt;The data exchange between ResultPartitions and InputGates goes through the &lt;code&gt;ShuffleEnvironment&lt;/code&gt;. The default implementation is &lt;code&gt;NettyShuffleEnvironment&lt;/code&gt;. If the producer and consumer are in the same &lt;code&gt;TaskManager&lt;/code&gt;, data can be exchanged locally without going over the network.&lt;/p&gt;

&lt;p&gt;For MyJob (Source+Map chained, parallelism 2 → Window parallelism 2 → Sink parallelism 1):&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-50&quot; src=&quot;./assets/posts/flink/flink-task-execution-model.svg&quot; /&gt;

&lt;p&gt;Source and Map are chained, so they share a thread with no serialization between them. The &lt;code&gt;keyBy&lt;/code&gt; triggers an all-to-all shuffle: each SourceMap subtask&apos;s ResultPartition has 2 SubPartitions (one per Window subtask), and each Window subtask&apos;s InputGate has 2 InputChannels (one per SourceMap subtask).&lt;/p&gt; 

&lt;p&gt;Records are hashed by key and routed to the SubPartition responsible for that key group. Window to Sink has a parallelism change (2 → 1), so each Window subtask&apos;s ResultPartition has only 1 SubPartition (the single Sink), and the Sink&apos;s &lt;code&gt;InputGate&lt;/code&gt; has 2 InputChannels (one per Window subtask).&lt;/p&gt;

&lt;h3&gt;5.2.4. Task Manager Services&lt;/h3&gt;

&lt;p&gt;In the post &lt;code&gt;TaskManager&lt;/code&gt; and &lt;code&gt;TaskExecutor&lt;/code&gt; may have been used interchangeably. To clarify, &lt;code&gt;TaskManager&lt;/code&gt; is the process (the JVM). &lt;code&gt;TaskExecutor&lt;/code&gt; is the main class running inside that process. In practice they refer to the same thing, but at different levels of abstraction.&lt;/p&gt;

&lt;p&gt;When a TaskManager process starts, it initializes a set of shared services before any task is deployed. These services live for the lifetime of the process and are shared across all tasks running in it. They fall into a few categories.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Slot Management&lt;/b&gt; is central. The &lt;code&gt;TaskSlotTable&lt;/code&gt; tracks which slots exist, which are free, and which tasks are running in each slot. The &lt;code&gt;JobTable&lt;/code&gt; maps each active JobID to its &lt;code&gt;JobMaster&lt;/code&gt; connection, so the &lt;code&gt;TaskManager&lt;/code&gt; knows which JobMaster to report to for each task. The &lt;code&gt;JobLeaderService&lt;/code&gt; monitors leadership changes for each job, so if a JobMaster fails over, the TaskManager reconnects to the new leader.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Network and Shuffle&lt;/b&gt; handles all data exchange. The &lt;code&gt;ShuffleEnvironment&lt;/code&gt; (default: Netty) owns the buffer pools, creates &lt;code&gt;ResultPartitions&lt;/code&gt; for task output and &lt;code&gt;InputGates&lt;/code&gt; for task input. This is where credit based flow control and backpressure happen. The &lt;code&gt;TaskExecutorPartitionTracker&lt;/code&gt; keeps track of which result partitions this TaskManager has produced, so they can be released when no longer needed.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Memory&lt;/b&gt; is handled by the per-slot &lt;code&gt;MemoryManager&lt;/code&gt; (managed off-heap memory) and the &lt;code&gt;IOManager&lt;/code&gt; (disk spill). Within managed memory, &lt;code&gt;SharedResources&lt;/code&gt; enables reference-counted sharing of resources like RocksDB caches across operators in the same slot. State backends like RocksDB/ForSt and operators that sort or hash data use managed memory. The &lt;code&gt;IOManager&lt;/code&gt; provides temporary file channels for spilling when memory is exhausted.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-65&quot; src=&quot;./assets/posts/flink/flink-task-executor-services.svg&quot; /&gt;

&lt;p&gt;&lt;b&gt;State and Checkpointing&lt;/b&gt; services support fault tolerance. The &lt;code&gt;LocalStateStoresManager&lt;/code&gt; maintains local copies of state on disk for faster recovery (instead of always fetching from the distributed checkpoint store). The &lt;code&gt;FileMergingManager&lt;/code&gt; is a newer optimization that merges many small checkpoint files into fewer larger ones to reduce file system pressure. The &lt;code&gt;ChangelogStoragesManager&lt;/code&gt; supports the changelog state backend. The &lt;code&gt;ChannelStateExecutorFactory&lt;/code&gt; handles snapshotting in-flight network buffers for unaligned checkpoints.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Classloading and Artifacts&lt;/b&gt; manages user code isolation. The &lt;code&gt;LibraryCacheManager&lt;/code&gt; maintains per-job classloaders so that different jobs running on the same TaskManager do not interfere with each other. The &lt;code&gt;PermanentBlobService&lt;/code&gt; downloads JAR files from the central &lt;code&gt;BlobServer&lt;/code&gt; on the JobManager side. The &lt;code&gt;FileCache&lt;/code&gt; handles files registered through the distributed cache API.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Connectivity&lt;/b&gt; keeps the TaskManager linked to the cluster. Two heartbeat managers run continuously: one toward the &lt;code&gt;ResourceManager&lt;/code&gt; (reporting slot availability and resource usage) and one toward each JobMaster (reporting task status and metrics). If heartbeats stop, the other side assumes the TaskManager is dead and triggers failover. &lt;code&gt;HAServices&lt;/code&gt; handles leader discovery so the TaskManager always knows who the current ResourceManager leader is.&lt;/p&gt;

&lt;p&gt;When a task gets deployed into a slot, it receives references to these shared services. It does not create its own network stack. The &lt;code&gt;NetworkBufferPool&lt;/code&gt; is shared across all tasks in the TaskManager, though each task gets its own &lt;code&gt;LocalBufferPool&lt;/code&gt; drawn from it. Managed memory is scoped per slot: all tasks sharing a slot through slot sharing share the same &lt;code&gt;MemoryManager&lt;/code&gt;, but tasks in different slots have independent memory budgets. Heartbeat connections are shared across the entire TaskManager process.&lt;/p&gt;

&lt;h3&gt;5.2.5. Task Manager Memory&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;TaskManager&lt;/code&gt; is a single JVM process. Its total memory is carved into strictly defined regions at startup, each serving a different purpose. Unlike a typical Java application where the JVM manages one undifferentiated heap, Flink explicitly budgets every byte.&lt;/p&gt;

&lt;p&gt;The first distinction is between what Flink controls (Total Flink Memory) and what the JVM needs for itself (Metaspace and Overhead). Together they form Total Process Memory, which is the container or process limit. When deploying on YARN or Kubernetes, Flink uses Total Process Memory to calculate the container request size.&lt;/p&gt;

&lt;p&gt;Within Total Flink Memory, the heap is split into Framework and Task. Both live in the same JVM heap at runtime; Flink does not enforce isolation between them. The separation exists for budgeting: it ensures the framework always has enough headroom for coordination even when user code is memory intensive. Task Heap has no fixed default because it is the remainder after every other component is subtracted from Total Flink Memory.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-75&quot; src=&quot;./assets/posts/flink/flink-total-memory.svg&quot; /&gt;

&lt;p&gt;The off-heap region covers Framework Off-Heap, Task Off-Heap, and Network Memory. All three are counted toward &lt;code&gt;-XX:MaxDirectMemorySize&lt;/code&gt;. Network Memory is allocated as JVM direct memory (&lt;code&gt;ByteBuffer.allocateDirect()&lt;/code&gt;), used exclusively for the network buffer pool that moves data between tasks. Framework and Task Off-Heap budget for both JVM direct memory and native memory; Flink counts their full configured amount toward the JVM direct memory limit as a conservative measure.&lt;/p&gt;

&lt;p&gt;Managed Memory in practice is scoped per slot, not per task. Each slot gets its own &lt;code&gt;MemoryManager&lt;/code&gt; with a budget of total managed memory divided by the number of slots. All tasks sharing a slot (through slot sharing) share this budget. For the &lt;code&gt;MyJob&lt;/code&gt; example:&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-75&quot; src=&quot;./assets/posts/flink/flink-task-memory-budget.svg&quot; /&gt;

&lt;p&gt;Managed Memory is different: it lives outside JVM direct memory entirely. For stateful operators using RocksDB, Flink reserves a budget and RocksDB allocates its own native memory through JNI. (invisible to &lt;code&gt;-XX:MaxDirectMemorySize&lt;/code&gt;). This means Managed Memory and Network Memory never compete for the same JVM budget, and the state backend (RocksDB/ForSt) cannot accidentally starve the network layer.&lt;/p&gt;

&lt;p&gt;The tradeoff is that if Managed Memory is misconfigured and the process exceeds its container limit, the OS kills the process rather than the JVM throwing a catchable exception.&lt;/p&gt;


&lt;h3&gt;5.3. Network&lt;/h3&gt;

&lt;p&gt;Flink&apos;s network stack sits inside flink-runtime and connects all subtasks across TaskManagers. It is the layer through which all shuffled data flows, making it a primary factor in both throughput and latency. Coordination between TaskManagers and the JobManager uses RPC (Pekko). Data transport between subtasks uses a lower level API built on Netty.&lt;/p&gt;

&lt;h3&gt;5.3.1. Physical Transport&lt;/h3&gt;

&lt;p&gt;In the example job, &lt;code&gt;keyBy()&lt;/code&gt; introduces a network shuffle between &lt;code&gt;SourceMap&lt;/code&gt; and &lt;code&gt;Window&lt;/code&gt;. Records can no longer stay local to the subtask that produced them. Each record is hashed by its key and routed to whichever Window subtask is responsible for that key group. This is a full all-to-all connection: every SourceMap subtask must be able to send to every Window subtask.&lt;/p&gt;

&lt;p&gt;As covered in the Task Execution Model, slot sharing places each pipeline slice into a single slot. With a small twist for this section, the two slots sit on two different TaskManagers. This means some connections are &lt;code&gt;local&lt;/code&gt; (same TM) and some are &lt;code&gt;remote&lt;/code&gt; (cross-TM, over TCP via Netty).&lt;/p&gt;

&lt;p&gt;Whether a connection is local or remote depends entirely on where the subtasks land:&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-75&quot; src=&quot;./assets/posts/flink/flink-example-recap.svg&quot; /&gt;

&lt;p&gt;Each remote connection gets its own &lt;code&gt;TCP&lt;/code&gt; channel. Consider a higher parallelism, i.e. parallelism 4 across two TaskManagers offering 2 slots each, multiple subtasks of the same task share a &lt;code&gt;TaskManager&lt;/code&gt;. Their remote connections toward the same destination TaskManager are then multiplexed over a single TCP channel, reducing resource usage.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/flink/flink-tcp-channel.svg&quot; /&gt;

&lt;p&gt;Each subtask&apos;s output is a &lt;code&gt;ResultPartition&lt;/code&gt;, split into &lt;code&gt;ResultSubpartitions&lt;/code&gt;, one per downstream consumer. In the example, each SourceMap subtask has a &lt;code&gt;ResultPartition&lt;/code&gt; with 4 ResultSubpartitions (one for each Window subtask). Each Window subtask has a ResultPartition with 1 ResultSubpartition (the single Sink subtask).&lt;/p&gt;

&lt;p&gt;On the receiving side, each subtask reads from an &lt;code&gt;InputGate&lt;/code&gt; containing &lt;code&gt;InputChannels&lt;/code&gt;, one per upstream producer. Each Window subtask&apos;s InputGate has 4 InputChannels (one from each SourceMap subtask). Sink&apos;s InputGate has 4 InputChannels (one from each Window subtask).&lt;/p&gt;

&lt;p&gt;At this layer, Flink no longer deals with individual records. Data is serialized and packed into network buffers. Each subtask has its own local buffer pool, one on the sending side and one on the receiving side, bounded by: &lt;code&gt;#channels × buffers-per-channel + floating-buffers-per-gate&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;With defaults of 2 exclusive buffers per channel and 8 floating buffers per gate, each Window subtask&apos;s receiving buffer pool is capped at &lt;code&gt;4 × 2 + 8 = 16&lt;/code&gt; buffers. These are drawn from the &lt;code&gt;NetworkBufferPool&lt;/code&gt; covered in the Memory Model section.&lt;/p&gt;

&lt;h3&gt;5.3.2. Credit-based Flow Control&lt;/h3&gt;

&lt;p&gt;Since all logical channels between two TaskManagers are multiplexed over a single TCP connection, a slow receiver on one channel could stall the connection entirely, throttling every other subtask sharing the wire. Credit-based flow control solves this by tracking buffer availability per logical channel, keeping backpressure isolated.&lt;/p&gt;

&lt;p&gt;The core rule: a sender may only forward a buffer if the receiver has announced capacity for it. &lt;code&gt;1 buffer = 1 credit&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;On the receiving side, each remote input channel has two kinds of buffers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Exclusive buffers (2 per channel): permanently assigned, never shared.&lt;/li&gt;
&lt;li&gt;Floating buffers (8 per gate): shared across all channels in the gate, borrowed on demand.&lt;/li&gt;
&lt;/ul&gt;

&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/flink/flink-flow-control.svg&quot; /&gt;

&lt;p&gt;If there are not enough floating buffers available globally, each buffer pool gets a share proportional to its capacity of whatever is available. The cycle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;When a channel is established, the receiver announces its exclusive buffers as initial credits.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The sender tracks the credit score per subpartition. Each sent buffer decrements the credit by one. No credit, no sending.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Each buffer sent also carries the sender&apos;s current backlog size, how many buffers are still waiting in that subpartition&apos;s queue.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The receiver uses the backlog to request floating buffers from the gate&apos;s shared pool. It may get all, some, or none. If none are available, it registers as a listener and gets notified when one is recycled.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Every newly acquired buffer is announced back to the sender as a fresh credit, and the cycle continues.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a receiver falls behind, its credits eventually hit 0. The sender stops forwarding buffers for that channel only. The TCP connection stays open, other channels on it continue normally. In the example: if one Window subtask on TM2 falls behind, its credit drops to 0. The SourceMap subtasks stop sending to it but keep sending to every other Window subtask. The shared TCP connection between TM1 and TM2 is never blocked.&lt;/p&gt;

&lt;p&gt;Because one channel in a multiplex can no longer block another, overall resource utilization improves. Full control over how much data is &quot;on the wire&quot; also improves checkpoint alignment. Without flow control, a stalled receiver would still have the lower network stack&apos;s internal buffers filling up, and checkpoint barriers would queue behind all of that data, waiting for it to drain before alignment could begin. With credit-based control, there is far less data sitting in transit, so barriers propagate faster.&lt;/p&gt;

&lt;h3&gt;5.3.3. Buffer Flushing&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;RecordWriter&lt;/code&gt; serializes each record into bytes on the heap, then writes those bytes into the network buffer currently assigned to the target subpartition. If the record doesn&apos;t fit, the remaining bytes spill into a new buffer. The deserializer on the receiving side (&lt;code&gt;SpillingAdaptiveSpanningRecordDeserializer&lt;/code&gt;) handles reassembly, including records that span multiple 32 KB buffers.&lt;/p&gt;

&lt;p&gt;A buffer becomes available for &lt;code&gt;Netty&lt;/code&gt; to consume in three situations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Buffer full: the writer finishes the buffer and requests a new one. The finished buffer is added to the subpartition queue, which notifies Netty.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Buffer timeout: a background thread (&lt;code&gt;OutputFlusher&lt;/code&gt;) periodically calls flush (default: every 100ms, configured via &lt;code&gt;execution.buffer-timeout.interval&lt;/code&gt;). This notifies Netty to consume whatever has been written so far without closing the buffer. The buffer stays in the queue and keeps accumulating more data from the writer side.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Special event: checkpoint barriers, end-of-partition events, etc. These finish all in-progress buffers immediately and add the event to every subpartition.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/flink/flink-buffer-flushing.svg&quot; /&gt;

&lt;p&gt;The buffer is added to the subpartition queue while still being written to (via the &lt;code&gt;BufferBuilder&lt;/code&gt; / &lt;code&gt;BufferConsumer&lt;/code&gt; pair). The writer appends through the &lt;code&gt;BufferBuilder&lt;/code&gt;, Netty reads through the BufferConsumer. This avoids synchronization on every record, the two sides only coordinate through the buffer&apos;s reader and writer indices.&lt;/p&gt;

&lt;p&gt;In low-throughput scenarios, the output flusher drives latency. In high-throughput scenarios, buffers fill up before the flusher fires and the system self-adjusts.&lt;/p&gt;

&lt;h3&gt;6. End to End&lt;/h3&gt;

&lt;p&gt;The very first section introduced the high-level picture: Client, JobManager, TaskManagers. Here is the same diagram, redrawn with everything covered since.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/flink/flink-end-to-end.svg&quot; /&gt;

&lt;p&gt;If you made it this far, you now have a solid mental model of what happens inside a running Flink job, from graph compilation and operator chaining to state snapshots, flow control, and much more. Not everything Flink does, but enough to reason about what is actually going on when a job runs.&lt;/p&gt;

&lt;h3&gt;7. References&lt;/h3&gt;
&lt;pre style=&quot;max-height: 180px&quot;&gt;&lt;code&gt;[1] &quot;Flink Architecture,&quot; Apache Flink, [Online]. Available: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/.
[2] &quot;A Deep Dive into Flink&apos;s Network Stack,&quot; Apache Flink, [Online]. Available: https://flink.apache.org/2019/06/05/a-deep-dive-into-flinks-network-stack/.
[3] &quot;Flink Course Series 1: A General Introduction to Apache Flink,&quot; Alibaba Cloud, [Online]. Available: https://www.alibabacloud.com/blog/flink-course-series-1-a-general-introduction-to-apache-flink_597974.
[4] &quot;Apache Flink: Concepts Overview,&quot; Apache Flink, [Online]. Available: https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/concepts/overview/.
[5] &quot;DataStream V2: Watermark,&quot; Apache Flink, [Online]. Available: https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/dev/datastream-v2/watermark/.
[6] &quot;DataStream V2: Building Blocks,&quot; Apache Flink, [Online]. Available: https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/dev/datastream-v2/building_blocks/.
&lt;/code&gt;&lt;/pre&gt;

&lt;/p&gt;&lt;/p&gt;</content><author><name>Adesh Nalpet Adimurthy</name></author><category term="System Wisdom" /><category term="Realtime" /><summary type="html">Most blog posts on Flink&apos;s internals and architecture, even the official documentation, tend to be fragmented across different examples and cover components in isolation. The approach taken here is to follow a single reference Flink job end-to-end, through every component and moving part it touches, keeping the discussion grounded in the example, rather than attempting broad coverage of Flink&apos;s full capabilities. The tradeoff is intentional: depth over breadth.</summary></entry><entry><title type="html">Apache Kafka Internals</title><link href="https://pyblog.xyz/kafka-internals" rel="alternate" type="text/html" title="Apache Kafka Internals" /><published>2025-02-09T00:00:00+00:00</published><updated>2025-02-09T00:00:00+00:00</updated><id>https://pyblog.xyz/kafka-internals</id><content type="html" xml:base="https://pyblog.xyz/kafka-internals">&lt;div class=&quot;blog-reference&quot;&gt;
&lt;p&gt;🚧 This post is a work in progress, but feel free to explore what’s here so far. Stay tuned for more!&lt;/p&gt;
&lt;/div&gt;

&lt;p&gt;&lt;code&gt;14 years&lt;/code&gt; of &lt;a href=&quot;https://kafka.apache.org/&quot; target=&quot;_blank&quot;&gt;Apache Kafka&lt;/a&gt;! Kafka is the de facto standard for event streaming, just like AWS S3 is for object storage and PostgreSQL is for RDBMS. While every TD&amp;amp;H (SWE) has likely used Kafka, managing a Kafka cluster is a whole other game. The long list of &lt;a href=&quot;https://kafka.apache.org/documentation/#configuration&quot; target=&quot;_blank&quot;&gt;high-importance configurations&lt;/a&gt; is a testament to this. In this blog post, the goal is to understand Kafka&apos;s internals enough to make sense of its many configurations and highlight best practices.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-50&quot; src=&quot;./assets/posts/kafka/kafka-api.webp&quot; /&gt;&lt;/p&gt;

&lt;p&gt;On a completely different note, the cost and operational complexity of Kafka have led to the emergence of alternatives, making the &lt;code&gt;Kafka API&lt;/code&gt; the de facto standard for event streaming, similar to the S3 API and PG Wire. Some examples include: Confluent Kafka, RedPanda, WrapStream, AutoMQ, AWS MSK, Pulsar, and many more!&lt;/p&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;h3&gt;1. Event Stream&lt;/h3&gt;
&lt;p&gt;The core concept of Kafka revolves around streaming events. An event can be anything, typically representing an action or information of what happened such as a button click or a temperature reading.&lt;/p&gt;
&lt;p&gt;Each event is modeled as a &lt;code&gt;record&lt;/code&gt; in Kafka with a &lt;code&gt;timestamp&lt;/code&gt;, &lt;code&gt;key&lt;/code&gt;, &lt;code&gt;value&lt;/code&gt;, and optional &lt;code&gt;headers&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/kafka/event-stream.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The payload or event data is included in the &lt;code&gt;value&lt;/code&gt;, and the &lt;code&gt;key&lt;/code&gt; is used for:&lt;/p&gt;

&lt;ul class=&quot;one-line-list&quot;&gt;
    &lt;li&gt;imposing the ordering of events/messages,&lt;/li&gt;
    &lt;li&gt;co-locating the events that has the same key property,&lt;/li&gt;
    &lt;li&gt;and key-based storage, retention or compaction.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In Kafka, the &lt;code&gt;key&lt;/code&gt; and &lt;code&gt;value&lt;/code&gt; are stored as byte arrays, giving flexibility to encode the data in whatever way (serializer). Optionally, using a combination of &lt;a target=&quot;_blank&quot; herf=&quot;https://github.com/confluentinc/schema-registry&quot;&gt;Schema Registry&lt;/a&gt; and &lt;a href=&quot;https://mvnrepository.com/artifact/io.confluent/kafka-avro-serializer&quot; target=&quot;_blank&quot;&gt;Avro serializer&lt;/a&gt; is a common practice.&lt;/p&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;h3&gt;2. Kafka Topics&lt;/h3&gt;
&lt;p&gt;As for comparison, &lt;code&gt;topics&lt;/code&gt; are like tables in a database. In the context of Kafka, they are used to organize events of the same type, hence the same schema, together. Therefore, the producer specifies which topic to publish to, and the subscriber or consumer specifies which topic(s) to read from. Note: the stream is immutable and append-only.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/kafka/kafka-cluster.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The immediate question is, how do we distribute data in topics across different nodes in the Kafka cluster? This calls for a way to distribute data within the topic. That&apos;s where &lt;code&gt;partitions&lt;/code&gt; come into play.&lt;/p&gt;

&lt;h3&gt;2.1. Kafka Topic Partitions&lt;/h3&gt;
&lt;p&gt;A Kafka topic can have one or more &lt;code&gt;partitions&lt;/code&gt;, and a partition can be regarded as the unit of data distribution and also a unit of &lt;code&gt;parallelism&lt;/code&gt;. Partitions of a topic can reside on different nodes of the Kafka cluster. Each partition can be accessed independently, hence you can only have as many consumers as the number of partitions (strongly dictating horizontal scalability of consumers).&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/kafka/kafka-partitions.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Furthermore, each event/record within the partition has a unique ID called the &lt;code&gt;offset&lt;/code&gt;, which is a monotonically increasing number, once an offset number is assigned, it&apos;s never reused. The events in the partition are delivered to the consumer in assigned offset order.&lt;/p&gt;

&lt;h3&gt;2.2. Choosing Number of Partitions&lt;/h3&gt;

&lt;p&gt;The number of partitions dictates &lt;code&gt;parallelism&lt;/code&gt; and hence the &lt;code&gt;throughput&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The more the partitions:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Higher is the &lt;code&gt;throughput&lt;/code&gt;: both the producer and the broker can process different partitions independently and in parallel, leading to better utilization of resources for expensive operations such as &lt;code&gt;compression&lt;/code&gt; and other processes.&lt;/li&gt;
&lt;li&gt;More partitions mean more consumers in a &lt;code&gt;consumer group&lt;/code&gt;, leading to higher throughput. Each consumer can consume messages from multiple partitions, but one partition cannot be shared across consumers in the same consumer group.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-85&quot; src=&quot;./assets/posts/kafka/kafka-cluster-example.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;However, it&apos;s important to strike a balance when choosing the number of partitions. More partitions may increase unavailability/downtime periods.&lt;/p&gt;
&lt;ul&gt;
    &lt;li&gt;Quick pre-context (from &lt;a href=&quot;#3-3-data-replication&quot;&gt;Section 3.3&lt;/a&gt;): A partition has multiple &lt;code&gt;replicas&lt;/code&gt;, each stored in different brokers, and one replica is assigned as the &lt;code&gt;leader&lt;/code&gt; while the rest are &lt;code&gt;followers&lt;/code&gt;. The producer and consumer requests are typically served by the leader broker (of that partition).&lt;/li&gt;
    &lt;li&gt;When a Kafka broker goes down, the leader of those unavailable partitions is moved to other available replicas to serve client requests. When the number of partitions is high, the latency to elect a new leader adds up.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More partitions mean more RAM is consumed by the clients (especially the producer): &lt;/p&gt;
&lt;ul&gt;
    &lt;li&gt;the producer client creates a buffer per partition (&lt;a href=&quot;#3-1-producer&quot;&gt;Section 3.1&lt;/a&gt;: accumulated by byte size or time). With more partitions, the memory consumption adds up. &lt;/li&gt;
    &lt;li&gt;Similarly, the consumer client fetches a batch of records per partition, hence increasing the memory needs (crucial for real-time low-latency consumers).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The idea behind choosing the number of partitions is to measure the maximum throughput that can be achieved on a single partition (for both production and consumption) and choose the number of partitions to accommodate the &lt;code&gt;target throughput&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-60&quot; src=&quot;./assets/posts/kafka/partitions-equation.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The reason for running these benchmarks to determine the number of partitions is that it depends on several factors such as: batching size, compression codec, type of acknowledgment, replication factor, etc. To accommodate for the buffer, choose &lt;code&gt;(1.2 * P)&lt;/code&gt; or higher; It&apos;s a common practice to &lt;code&gt;over-partition&lt;/code&gt; by a bit.&lt;/p&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;p&gt;The Kafka cluster has a &lt;code&gt;control plane&lt;/code&gt; and a &lt;code&gt;data plane&lt;/code&gt;, where the control plane is responsible for handling all the metadata, and the data plane handles the actual data/events.&lt;/p&gt;

&lt;h3&gt;3. Kafka Broker (Data Plane)&lt;/h3&gt;

&lt;p&gt;Diving into the workings of the data plane, there are two types of requests the Kafka broker handles: the put requests from the producer and the get requests from the consumer.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/kafka/record-batch.svg&quot; /&gt;&lt;/p&gt;

&lt;h3&gt;3.1. Producer&lt;/h3&gt;
&lt;p&gt;The &lt;b&gt;producer&lt;/b&gt; requests start with the producer application, sending the request with the key and value. The Kafka producer library determines which partition the messages should be produced to. This is done by using a hash algorithm to assign a partition based on the supplied partition key. Hence, records with the same key always go to the same partition. When a partition key is not assigned, the default mechanism is to use round-robin to choose the next partition.&lt;/p&gt;

&lt;p&gt;Sending each record to the broker is not very efficient. The producer library also buffers data for a particular partition in an in-memory data structure (record batches). Data in the buffer is accumulated up to a limit based on the total size of all the records or by time (&lt;code&gt;time&lt;/code&gt; and &lt;code&gt;size&lt;/code&gt;). That is, if enough time has passed or enough data has accumulated, the records are flushed to the corresponding broker.&lt;/p&gt;

&lt;p&gt;Lastly, batching allows records to be compressed, as it is better to compress a batch of records than a single record.&lt;/p&gt;

&lt;h3&gt;3.1.1. Socker Receive Buffer &amp;amp; Network Threads&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;Network threads&lt;/code&gt; in a Kafka broker are like workers that handle high-level communication between the Kafka server (broker) and the outside world (clients), i,e. handle messages coming into the server (data sent by producers).&lt;small&gt;&lt;br /&gt;*and also send messages back to clients (consumers fetching data)&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;To avoid network threads being overwhelmed by incoming data, a &lt;code&gt;socket buffer&lt;/code&gt; stands before the network threads that buffers incoming requests.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-80&quot; src=&quot;./assets/posts/kafka/network-thread-producer.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The network handles each producer/client request throughout the rest of its lifecycle (the same network thread keeps track of the request through the entire process; the request is fully handled and the response is sent). For example, if a producer sends messages to a Kafka topic:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The network thread receives the request from the producer,&lt;/li&gt;
&lt;li&gt;processes the request &lt;small&gt;&lt;br /&gt;*(write the message to the Kafka commit log &amp;amp; wait for replication).&lt;/small&gt;&lt;/li&gt;
&lt;li&gt;Once processing is done, the network thread sends a response (acknowledgment that the messages were successfully received).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;3.1.2. Request Queue &amp;amp; I/O Threads&lt;/h3&gt;

&lt;p&gt;Each network thread handles multiple requests from different clients (multiplex) and is meant to be lightweight, where it receives the bytes, forms a producer request, and publishes it to a &lt;code&gt;shared request queue&lt;/code&gt;, immediately handling the next request.&lt;/p&gt;

&lt;p&gt;Note: In order to guarantee the order of requests from a client, the network thread handles one request per client at a time; i.e., only after completing a request (with a response), does the network thread take another request from the same client.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-65&quot; src=&quot;./assets/posts/kafka/i-o-threads.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The second main pool in Kafka, the &lt;code&gt;I/O threads&lt;/code&gt;, picks requests from the shared &lt;code&gt;request queue&lt;/code&gt;. The I/O threads handle requests from any client, unlike the network threads.&lt;/p&gt;

&lt;h3&gt;3.1.3. Commit Log&lt;/h3&gt;
&lt;p&gt;The I/O thread first validates the data (CRC) and appends data to a data structure called the &lt;code&gt;commit log&lt;/code&gt; (by partition).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;00000000000000000000.log
00000000000000000000.index
00000000000000000025.log
00000000000000000025.index
...
00000000000000004580.log
00000000000000004580.index
...
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The suffix (&lt;code&gt;0, 25 &amp;amp; 4580&lt;/code&gt;) in the segment&apos;s file name represents the base offset (i.e., the offset of the first message) of the segment.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/kafka/segment.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The commit log (per partition) is organized on &lt;code&gt;disk&lt;/code&gt; as &lt;code&gt;segments&lt;/code&gt;. Each segment has two main parts: the actual &lt;code&gt;data&lt;/code&gt; and the &lt;code&gt;index&lt;/code&gt; (&lt;code&gt;.log&lt;/code&gt; and &lt;code&gt;.index&lt;/code&gt;), which stores the position inside the log file. By default, the broker acknowledges the produce request only after replicating across other brokers (based on the &lt;code&gt;replication factor&lt;/code&gt;), since Kafka offers high durability via replication.&lt;/p&gt;

&lt;p&gt;Note: The new batch of records (producer request) is first written into the OS&apos;s &lt;code&gt;page cache&lt;/code&gt; and flushed to disk asynchronously. If the Kafka JVM crashes for any reason, recent messages are still in the page cache but may result in data loss when the machine crashes. &lt;code&gt;Topic replication&lt;/code&gt; solves the problem, meaning data loss is possible only if multiple brokers crash simultaneously.&lt;/p&gt;

&lt;h3&gt;3.1.4. Purgatory &amp;amp; Response Queue&lt;/h3&gt;

&lt;p&gt;While waiting for full replication, the I/O thread is not blocked. Instead, the pending produce requests are stashed in the &lt;code&gt;purgatory&lt;/code&gt;, and the I/O Thread is freed up to process the next set of requests.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-65&quot; src=&quot;./assets/posts/kafka/purgatory.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Once the data of the pending producer request is fully replicated, the request is then moved out of the purgatory&lt;/p&gt;

&lt;h3&gt;3.1.5. Network Thread &amp;amp; Socket Send Buffer&lt;/h3&gt;

&lt;p&gt;and then sent to the &lt;code&gt;shared response queue&lt;/code&gt;, which is then picked up by the network thread and sent through the &lt;code&gt;socket send buffer&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/kafka/broker-client.svg&quot; /&gt;&lt;/p&gt;

&lt;h3&gt;3.2. Consumer&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The consumer client sends the fetch request, specifying the &lt;code&gt;topic&lt;/code&gt;, the &lt;code&gt;partition&lt;/code&gt;, and the &lt;code&gt;start offset&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Similar to the produce request, the fetch request goes through the &lt;code&gt;socket receive buffer&lt;/code&gt; &amp;gt; &lt;code&gt;network threads&lt;/code&gt; &amp;gt; &lt;code&gt;shared request queue&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;IO threads&lt;/code&gt; now refer to the index structure to find the corresponding file byte range using the &lt;code&gt;offset index&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;To prevent frequent empty responses when no new data has been ingested, the consumer typically specifies the minimum number of bytes and maximum amount of time for the response.&lt;/li&gt;
&lt;li&gt;The fetch request is pushed to the &lt;code&gt;purgatory&lt;/code&gt;, for either of the conditions to be met.&lt;/li&gt;
&lt;li&gt;When either time or bytes are met, the request is taken out of purgatory and placed in the &lt;code&gt;response queue&lt;/code&gt; for the network thread, which sends the actual data as a response to the consumer/client.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Kafka uses &lt;code&gt;zero-copy&lt;/code&gt; transfers in the network, meaning there are no intermediate memory copies. Instead, data is transferred directly from disk buffers to the remote socket, making it memory efficient.&lt;/p&gt;

&lt;p&gt;However, reading older data, which involves accessing the disk, can block the network thread. This isn&apos;t ideal, as the network threads are used by several clients, delaying processing for other clients. The &lt;code&gt;Tiered Storage&lt;/code&gt; Fetch solves this very problem.&lt;/p&gt;

&lt;h3&gt;3.2.1. Tiered Storage&lt;/h3&gt;

&lt;p&gt;Tiered storage in Kafka was introduced as an early access feature in 3.6.0 (October 10, 2023).&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/kafka/broker-local-storage.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Tiered storage&lt;/code&gt; is a common storage architecture that uses different classes/layers/tiers of storage to efficiently store and manage data based on access patterns, performance needs, and cost. A typical tier model has frequently accessed data or &quot;hot&quot; data, and less frequently accessed data is moved (not copied) to a lower-cost, lower-performance storage (&quot;warm&quot;). Outside of the tiers, &quot;cold&quot; storage is a common practice for storing backups.&lt;/p&gt;

&lt;p&gt;Kafka is designed to ingest large volumes of data. Without tiered storage, a single broker is responsible for hosting an entire replica of a topic partition, adding a limit to how much data can be stored. This isn&apos;t much of a concern in real-time applications where older data is not relevant.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/kafka/broker-tiered-storage.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;But in cases where historical data is necessary, tiered storage allows storing less frequently accessed data in remote storage (not present locally in the broker).&lt;/p&gt;

&lt;p&gt;Tiered storage offers several advantages:&lt;/p&gt;
&lt;ul&gt;
    &lt;li&gt;&lt;code&gt;Cost&lt;/code&gt;: It&apos;s cost-effective as inactive segments of local storage (stored on expensive fast local disks like SSDs) can be moved to remote storage (object stores such as S3), making storage cheaper and virtually unlimited.&lt;/li&gt;
    &lt;li&gt;&lt;code&gt;Elasticity&lt;/code&gt;: Now that storage and compute of brokers are separated and can be scaled independently, it also allows faster cluster operations due to less local data. Without tiered storage, needing more storage essentially meant increasing the number of brokers (which also increases compute).&lt;/li&gt;
    &lt;li&gt;&lt;code&gt;Isolation&lt;/code&gt;: It provides better isolation between real-time consumers and historical data consumers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Coming back to the fetch request (from consumer) with &lt;code&gt;tiered storage&lt;/code&gt; enabled: If the consumer requests from an &lt;code&gt;offset&lt;/code&gt;, the data is served the same way as before from the &lt;code&gt;page cache&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/kafka/broker-consumer-tiered-storage.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The chances of most local data being in the page cache are also higher (due to smaller local data). However, if the data is not present locally and is in the &lt;code&gt;remote store&lt;/code&gt;, the broker will stream the remote data from the object store into an in-memory buffer via the &lt;code&gt;Tiered Fetch Threads&lt;/code&gt;, all the way to the remote &lt;code&gt;socket send buffer&lt;/code&gt; in the network thread.&lt;/p&gt;

&lt;p&gt;Hence, the network thread is no longer blocked even when the consumer is accessing historical data. i.e., real-time and historical data access don&apos;t impact each other.&lt;/p&gt;

&lt;h3&gt;3.3. Data Replication&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;Replication&lt;/code&gt; in the Data Plane is a critical feature of Kafka that offers &lt;code&gt;durability&lt;/code&gt; and &lt;code&gt;high-availability&lt;/code&gt;. Replication is typically enabled and defined at the time of creating the topic.&lt;/p&gt;

&lt;p&gt;Each partition of the topic will be replicated across replicas (&lt;code&gt;replication factor&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-95&quot; src=&quot;./assets/posts/kafka/data-replication.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;One of the replicas is assigned to be the &lt;code&gt;leader&lt;/code&gt; of that partition, and the rest are called &lt;code&gt;followers&lt;/code&gt;. The producer sends the data to the leader, and the followers retrieve the data from the leader for replication. In a similar fashion, the consumer reads from the leader; however, the consumer(s) can also read from the follower(s).&lt;/p&gt;

&lt;!-- &lt;h3&gt;4. Kafka Broker (Control Plane)&lt;/h3&gt; --&gt;

&lt;h3&gt;6. References&lt;/h3&gt;
&lt;pre style=&quot;max-height: 180px&quot;&gt;&lt;code&gt;[1] &quot;Apache Kafka Streams Architecture,&quot; Apache Kafka, [Online]. Available: https://kafka.apache.org/39/documentation/streams/architecture.
[2] &quot;Apache Kafka Documentation: Configuration,&quot; Apache Kafka, [Online]. Available: https://kafka.apache.org/documentation/#configuration.
[3] J. Rao, &quot;Apache Kafka Architecture and Internals,&quot; Confluent, [Online]. Available: https://www.confluent.io/blog/apache-kafka-architecture-and-internals-by-jun-rao/.
&lt;/code&gt;&lt;/pre&gt;</content><author><name>Adesh Nalpet Adimurthy</name></author><category term="System Wisdom" /><category term="System Design" /><category term="Realtime" /><summary type="html">🚧 This post is a work in progress, but feel free to explore what’s here so far. Stay tuned for more!</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://pyblog.xyz/assets/featured/webp/the-kafka.webp" /><media:content medium="image" url="https://pyblog.xyz/assets/featured/webp/the-kafka.webp" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Breath-First Search using Stack</title><link href="https://pyblog.xyz/stack-based-bfs" rel="alternate" type="text/html" title="Breath-First Search using Stack" /><published>2024-07-21T00:00:00+00:00</published><updated>2024-07-21T00:00:00+00:00</updated><id>https://pyblog.xyz/stack-based-bfs</id><content type="html" xml:base="https://pyblog.xyz/stack-based-bfs">&lt;h3&gt;1. BFS using Queue&lt;/h3&gt;
&lt;p&gt;Just in the prior post on &lt;a href=&quot;https://pyblog.xyz/stack-based-bfs&quot;&gt;graph traversal&lt;/a&gt;, we went into details of Depth-First Search (DFS) and Breadth-First Search (BFS). BFS is a way of traversing down the graph, level-by-level. Specifically for a balanced-tree, the first/root node is visited first, followed by its immediate children, then followed by the next level children, and so on. Here&apos;s the same example of BFS using a queue:&lt;/p&gt;

&lt;div class=&quot;slider&quot; id=&quot;slider8&quot;&gt;
  &lt;div class=&quot;slides center-image-0 center-image-80&quot;&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-1.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-2.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-3.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-4.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-5.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-6.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-7.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-8.svg&quot; class=&quot;slide&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;controls&quot;&gt;
    &lt;button onclick=&quot;plusSlides(-1, &apos;slider8&apos;)&quot; class=&quot;prev black-button&quot;&gt;Prev&lt;/button&gt;
    &lt;button onclick=&quot;playSlides(&apos;slider8&apos;)&quot; class=&quot;play black-button&quot;&gt;Play&lt;/button&gt;
    &lt;button onclick=&quot;stopSlides(&apos;slider8&apos;)&quot; class=&quot;stop black-button&quot; hidden=&quot;&quot;&gt;Stop&lt;/button&gt;
    &lt;button onclick=&quot;plusSlides(1, &apos;slider8&apos;)&quot; class=&quot;next black-button&quot;&gt;Next&lt;/button&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;h3&gt;2. Problem: Space Complexity&lt;/h3&gt;
&lt;p&gt;The problem with this solution is adding all the immediate children to the queue before visiting them. While this isn&apos;t much of a concern for a binary tree, imagine a non-binary tree where at each level the number of nodes grows exponentially. In the example below, when the second-level &lt;code&gt;node G&lt;/code&gt; is visited, the queue now has 49 entries. For the nth level: &lt;code&gt;7^(N-1)&lt;/code&gt; nodes. For level 100, there would be &lt;code&gt;282,475,249&lt;/code&gt; entries in the queue. Nearly 300 million entries and a 4-byte address pointer per entry would lead to around ~1 MB.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/graph-theory/bfs-stack-problem.svg&quot; /&gt;&lt;/p&gt;

&lt;h3&gt;3. Solution: BFS using Stack&lt;/h3&gt;
&lt;p&gt;The recursive approach below, the space coordinate depends on the number of levels. In a balanced tree, the space complexity is now &lt;code&gt;O(log(n))&lt;/code&gt;, where &lt;code&gt;n&lt;/code&gt; is the total number of nodes.&lt;/p&gt;

&lt;p&gt;Pusedo Code:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;procedure bfs(root:NODE*);
    var target = 0;
    var node = root;
BEGIN
    for each level in tree do
    begin
        printtree(node, target, 0);
        target = target + 1;
    end
END
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;procedure printtree(node:NODE*, target:int, level:int);
BEGIN
    if(target &amp;gt; level) then
    begin
        for each child of node do
            printtree(child, target, level + 1);
    end
    else
        print node;
END
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Going back to the same example for a balanced binary tree with nodes: &lt;code&gt;A, B, C, D, E, F, G&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0&quot; style=&quot;width: 48%&quot; src=&quot;./assets/posts/graph-theory/binary-tree.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Initializing the root node and setting the initial target level to 0. The main BFS loop iterates through each level of the tree, incrementing the target level after processing each one.&lt;/p&gt;

&lt;p&gt;Iteration 1 (target = 0)&lt;/p&gt;
&lt;div class=&quot;table-container&quot;&gt;
&lt;table&gt;
  &lt;tr&gt;
    &lt;th&gt;Step&lt;/th&gt;
    &lt;th&gt;Action&lt;/th&gt;
    &lt;th&gt;Current Call Stack&lt;/th&gt;
    &lt;th&gt;Visited Nodes&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;1&lt;/td&gt;
    &lt;td&gt;Initial setup&lt;/td&gt;
    &lt;td&gt;&lt;/td&gt;
    &lt;td&gt;&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;2&lt;/td&gt;
    &lt;td&gt;Iteration with target=0&lt;/td&gt;
    &lt;td&gt;printtree(A, 0, 0)&lt;/td&gt;
    &lt;td&gt;&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;3&lt;/td&gt;
    &lt;td&gt;Visiting A&lt;/td&gt;
    &lt;td&gt;&lt;/td&gt;
    &lt;td&gt;A&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;For each level, the &lt;code&gt;printtree&lt;/code&gt; function is called with the current node, the target level, and the current level (starting from zero). Checks if the target level is greater than the current level. If so, recursively call for each child of the current node, incrementing the level by 1. This continues until the target level equals the current level, at which point the node is printed.&lt;/p&gt;

&lt;p&gt;Iteration 2 (target = 1)&lt;/p&gt;
&lt;div class=&quot;table-container&quot;&gt;
&lt;table&gt;
  &lt;tr&gt;
    &lt;th&gt;Step&lt;/th&gt;
    &lt;th&gt;Action&lt;/th&gt;
    &lt;th&gt;Current Call Stack&lt;/th&gt;
    &lt;th&gt;Visited Nodes&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;4&lt;/td&gt;
    &lt;td&gt;target=1&lt;/td&gt;
    &lt;td&gt;&lt;/td&gt;
    &lt;td&gt;A&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;5&lt;/td&gt;
    &lt;td&gt;Iteration with target=1&lt;/td&gt;
    &lt;td&gt;printtree(A, 1, 0)&lt;/td&gt;
    &lt;td&gt;A&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;6&lt;/td&gt;
    &lt;td&gt;Call B&lt;/td&gt;
    &lt;td&gt;printtree(A, 1, 0) → printtree(B, 1, 1)&lt;/td&gt;
    &lt;td&gt;A&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;7&lt;/td&gt;
    &lt;td&gt;Visiting B&lt;/td&gt;
    &lt;td&gt;printtree(A, 1, 0)&lt;/td&gt;
    &lt;td&gt;A, B&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;8&lt;/td&gt;
    &lt;td&gt;Call C&lt;/td&gt;
    &lt;td&gt;printtree(A, 1, 0) → printtree(C, 1, 1)&lt;/td&gt;
    &lt;td&gt;A, B&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;9&lt;/td&gt;
    &lt;td&gt;Visiting C&lt;/td&gt;
    &lt;td&gt;printtree(A, 1, 0)&lt;/td&gt;
    &lt;td&gt;A, B, C&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;by incrementing the target level and repeating the process until all levels of the tree have been processed, nodes are printed level-by-level, leadind to a breadth-first traversal.&lt;/p&gt;

&lt;p&gt;Iteration 3 (target = 2)&lt;/p&gt;
&lt;div class=&quot;table-container&quot;&gt;
&lt;table&gt;
  &lt;tr&gt;
    &lt;th&gt;Step&lt;/th&gt;
    &lt;th&gt;Action&lt;/th&gt;
    &lt;th&gt;Current Call Stack&lt;/th&gt;
    &lt;th&gt;Visited Nodes&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;10&lt;/td&gt;
    &lt;td&gt;target=2&lt;/td&gt;
    &lt;td&gt;&lt;/td&gt;
    &lt;td&gt;A, B, C&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;11&lt;/td&gt;
    &lt;td&gt;Iteration with target=2&lt;/td&gt;
    &lt;td&gt;printtree(A, 2, 0)&lt;/td&gt;
    &lt;td&gt;A, B, C&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;12&lt;/td&gt;
    &lt;td&gt;Call B&lt;/td&gt;
    &lt;td&gt;printtree(A, 2, 0) → printtree(B, 2, 1)&lt;/td&gt;
    &lt;td&gt;A, B, C&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;13&lt;/td&gt;
    &lt;td&gt;Call D&lt;/td&gt;
    &lt;td&gt;printtree(A, 2, 0) → printtree(B, 2, 1) → printtree(D, 2, 2)&lt;/td&gt;
    &lt;td&gt;A, B, C&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;14&lt;/td&gt;
    &lt;td&gt;Visiting D&lt;/td&gt;
    &lt;td&gt;printtree(A, 2, 0) → printtree(B, 2, 1)&lt;/td&gt;
    &lt;td&gt;A, B, C, D&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;15&lt;/td&gt;
    &lt;td&gt;Call E&lt;/td&gt;
    &lt;td&gt;printtree(A, 2, 0) → printtree(B, 2, 1) → printtree(E, 2, 2)&lt;/td&gt;
    &lt;td&gt;A, B, C, D&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;16&lt;/td&gt;
    &lt;td&gt;Visiting E&lt;/td&gt;
    &lt;td&gt;printtree(A, 2, 0) → printtree(B, 2, 1)&lt;/td&gt;
    &lt;td&gt;A, B, C, D, E&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;17&lt;/td&gt;
    &lt;td&gt;Call C&lt;/td&gt;
    &lt;td&gt;printtree(A, 2, 0) → printtree(C, 2, 1)&lt;/td&gt;
    &lt;td&gt;A, B, C, D, E&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;18&lt;/td&gt;
    &lt;td&gt;Call F&lt;/td&gt;
    &lt;td&gt;printtree(A, 2, 0) → printtree(C, 2, 1) → printtree(F, 2, 2)&lt;/td&gt;
    &lt;td&gt;A, B, C, D, E&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;19&lt;/td&gt;
    &lt;td&gt;Visiting F&lt;/td&gt;
    &lt;td&gt;printtree(A, 2, 0) → printtree(C, 2, 1)&lt;/td&gt;
    &lt;td&gt;A, B, C, D, E, F&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;h3&gt;4. Recursive BFS: Implementation&lt;/h3&gt;
&lt;p&gt;Without much explanation, here&apos;s an implementation in Java. In the &lt;code&gt;Node&lt;/code&gt; class, &lt;code&gt;children&lt;/code&gt; is an array of `Node`s, but it also works with other data structures, such as a &lt;code&gt;LinkedList&lt;/code&gt;.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;class Node {
    char data;
    Node[] children;

    Node(char data, int childCount) {
        this.data = data;
        this.children = new Node[childCount];
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;public class TreeTraversal {

    // BFS subroutine
    boolean printTree(Node node, int target, int level) {
        boolean returnValue = false;
        if (target &amp;gt; level) {
            for (int i = 0; i &amp;lt; node.children.length; i++) {
                if (printTree(node.children[i], target, level + 1)) {
                    returnValue = true;
                }
            }
        } else {
            System.out.print(node.data);
            if (node.children.length &amp;gt; 0) {
                returnValue = true;
            }
        }
        return returnValue;
    }

    // BFS routine
    void printBfsTree(Node root) {
        if (root == null) return;
        int target = 0;
        while (printTree(root, target++, 0)) {
            System.out.println();
        }
    }

    public static void main(String[] args) {
        Node root = new Node(&apos;A&apos;, 2);
        root.children[0] = new Node(&apos;B&apos;, 2);
        root.children[1] = new Node(&apos;C&apos;, 1);
        root.children[0].children[0] = new Node(&apos;D&apos;, 0);
        root.children[0].children[1] = new Node(&apos;E&apos;, 0);
        root.children[1].children[0] = new Node(&apos;F&apos;, 0);

        TreeTraversal treeTraversal = new TreeTraversal();
        treeTraversal.printBfsTree(root);
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;5. Conclusion&lt;/h3&gt;
&lt;p&gt;The prime difference between the queue-based BFS and stack-based BFS is that the space coordinate of queue-based BFS depends on the number of children and for stack-based BFS, it&apos;s the depth/height of the tree.&lt;/p&gt;

&lt;p&gt;Taking an example, say we have a balanced tree with 9 levels (root node being level 1) and each node has 10 children. In the queue-based BFS solution, the number of nodes in the queue at level 9 would be &lt;code&gt;C^(N - 1)&lt;/code&gt;, where &lt;code&gt;N&lt;/code&gt; is the number of levels and &lt;code&gt;C&lt;/code&gt; is the number of children per node. For &lt;code&gt;C = 10&lt;/code&gt; and &lt;code&gt;N = 9&lt;/code&gt;, this results in &lt;code&gt;10^(9 - 1) = 10^8&lt;/code&gt;. Presuming each node is 4 bytes, that&apos;s &lt;span class=&quot;underline&quot;&gt;400 MB&lt;/span&gt; in the queue (at level 9).&lt;/p&gt;

&lt;p&gt;The stack-based solution, on the other hand, the call-stack can have at most &lt;code&gt;L&lt;/code&gt; (number of levels) recursive calls (one for each child), but only one call at a time will be active in the stack for each depth level. Realistically the stack contains other data such as return address, local variables, saved registers, etc., and taking each stack frame size of 64 bytes, the space of the callstack at most is &lt;code&gt;9 × 64&lt;/code&gt; bytes = &lt;span class=&quot;underline&quot;&gt;576 bytes&lt;/span&gt;.&lt;/p&gt;

&lt;p&gt;This is considerable space saving! At much higher levels, say 50 levels, the stack-based solution outperforms queue-based BFS in both time and space coordinates. However, for a more irregular/high-depth tree, queue-based BFS performs better.&lt;/p&gt;

&lt;h3&gt;6. References&lt;/h3&gt;
&lt;pre style=&quot;max-height: 180px&quot;&gt;&lt;code&gt;[1] Pravin Kumar Sinha, &quot;Stack-based breadth-first search tree traversal,&quot; IBM Developer. [Online]. Available: https://developer.ibm.com/articles/au-aix-stack-tree-traversal.
[2] Wikipedia contributors, &quot;Breadth-first search,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Breadth-first_search.
[3] Adesh Nalpet Adimurthy, &quot;Graph Theory: Search and Traversal,&quot; PyBlog, 2024. [Online]. Available: https://www.pyblog.xyz/graph-traversal.
&lt;/code&gt;&lt;/pre&gt;</content><author><name>Adesh Nalpet Adimurthy</name></author><category term="Code on the Road" /><category term="Graph Theory" /><category term="Data Structures" /><summary type="html">1. BFS using Queue Just in the prior post on graph traversal, we went into details of Depth-First Search (DFS) and Breadth-First Search (BFS). BFS is a way of traversing down the graph, level-by-level. Specifically for a balanced-tree, the first/root node is visited first, followed by its immediate children, then followed by the next level children, and so on. Here&apos;s the same example of BFS using a queue:</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://pyblog.xyz/assets/featured/webp/stack-bfs.webp" /><media:content medium="image" url="https://pyblog.xyz/assets/featured/webp/stack-bfs.webp" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Graph Theory: Search and Traversal</title><link href="https://pyblog.xyz/graph-traversal" rel="alternate" type="text/html" title="Graph Theory: Search and Traversal" /><published>2024-07-17T00:00:00+00:00</published><updated>2024-07-17T00:00:00+00:00</updated><id>https://pyblog.xyz/graph-traversal</id><content type="html" xml:base="https://pyblog.xyz/graph-traversal">&lt;h3&gt;0. Graph Traversal&lt;/h3&gt;
&lt;p&gt;Breadth-First Search (&lt;a href=&quot;https://en.wikipedia.org/wiki/Breadth-first_search&quot; target=&quot;_blank&quot;&gt;BFS&lt;/a&gt;) and Depth-First Search (&lt;a href=&quot;https://en.wikipedia.org/wiki/Depth-first_search&quot; target=&quot;_blank&quot;&gt;DFS&lt;/a&gt;) are two of the most commonly used graph traversal methods.&lt;/p&gt;
&lt;p&gt;The traversal of a graph, whether BFS or DFS, involves two main concepts: visiting a node and exploring a node. Exploration refers to visiting all the children/adjacent nodes.&lt;/p&gt;
&lt;p&gt;Among BFS and DFS, Depth-First Search is more intuitive to perform, so let&apos;s first explore DFS to set a clear standpoint on what BFS is not.&lt;/p&gt;

&lt;h3&gt;1. Depth First Search&lt;/h3&gt;
&lt;p&gt;Depth-First is the process of traversing (visiting and exploring) down the graph until we get to a leaf node or a cycle (re-visiting a node that&apos;s already explored). Every time we encounter one of these conditions, we head back to the last parent node (previous level node) and explore an adjacent node (until leaf or cycle) and repeat the process.&lt;/p&gt;

&lt;p&gt;In other words: traverse through the tree by visiting all of the children, grandchildren, great-grandchildren (and so on) until the end of a path, only then traverse a level back to start a new path.&lt;/p&gt;

&lt;div class=&quot;slider&quot; id=&quot;slider1&quot;&gt;
  &lt;div class=&quot;slides center-image-0 center-image-40&quot;&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-basic/dfs-basic-Page-1.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-basic/dfs-basic-Page-2.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-basic/dfs-basic-Page-3.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-basic/dfs-basic-Page-4.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-basic/dfs-basic-Page-5.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-basic/dfs-basic-Page-6.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-basic/dfs-basic-Page-7.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-basic/dfs-basic-Page-8.svg&quot; class=&quot;slide&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;controls&quot;&gt;
    &lt;button onclick=&quot;plusSlides(-1, &apos;slider1&apos;)&quot; class=&quot;prev black-button&quot;&gt;Prev&lt;/button&gt;
    &lt;button onclick=&quot;playSlides(&apos;slider1&apos;)&quot; class=&quot;play black-button&quot;&gt;Play&lt;/button&gt;
    &lt;button onclick=&quot;stopSlides(&apos;slider1&apos;)&quot; class=&quot;stop black-button&quot; hidden=&quot;&quot;&gt;Stop&lt;/button&gt;
    &lt;button onclick=&quot;plusSlides(1, &apos;slider1&apos;)&quot; class=&quot;next black-button&quot;&gt;Next&lt;/button&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Explanation of the above example:&lt;/p&gt;
&lt;ul class=&quot;one-line-list&quot;&gt;
&lt;li&gt;Starting with &lt;code&gt;Vertex A&lt;/code&gt; - start exploration, say, we go to &lt;code&gt;Node B&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Node A&lt;/code&gt; has two other adjacent vertices, but in DFS, we go depth-first&lt;/li&gt;
&lt;li&gt;Further exploring the visited &lt;code&gt;vertex B&lt;/code&gt;, head to &lt;code&gt;vertex C&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Cannot further explore &lt;code&gt;Node C&lt;/code&gt; as it&apos;s a leaf node - hence, &lt;code&gt;Node C&lt;/code&gt; is completely explored&lt;/li&gt;
&lt;li&gt;Head back to its parent (back-track prior level) and explore the next adjacent node, &lt;code&gt;Vertex D&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Similarly, &lt;code&gt;Vertex E&lt;/code&gt;. Now that all adjacent nodes of &lt;code&gt;Vertex B&lt;/code&gt; are already explored, head back a level again (Back to &lt;code&gt;Node A&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Visit F, explore F; head back to &lt;code&gt;Node A&lt;/code&gt;. Visit G, explore G; head back to &lt;code&gt;Node A&lt;/code&gt;. DFS is now complete&lt;/li&gt;
&lt;li&gt;Order of visiting nodes: &lt;code&gt;A, B, C, D, E, F, G&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notice that in scenarios where there are more than one adjacent node, we choose the next node to explore at random, and hence there are several paths to traverse using DFS. Defining specific rules for which node to explore next brings up the topic of new strategies in DFS (In the case for Trees: Pre-order, In-order and Post-order traversal).&lt;/p&gt;

&lt;h3&gt;1.1. DFS: Detecting Cycles&lt;/h3&gt;
&lt;p&gt;In the previous example, we understood to &lt;a href=&quot;https://en.wikipedia.org/wiki/Backtracking&quot; target=&quot;_blank&quot;&gt;back-track&lt;/a&gt; when we reach a leaf node. Taking an example with a graph this time to cover the &quot;detecting a cycle&quot; scenario, i.e., visiting a node that was previously visited.&lt;/p&gt;
&lt;div class=&quot;slider&quot; id=&quot;slider2&quot;&gt;
  &lt;div class=&quot;slides center-image-0 center-image-60&quot;&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-graph/dfs-graph-Page-1.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-graph/dfs-graph-Page-2.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-graph/dfs-graph-Page-3.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-graph/dfs-graph-Page-4.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-graph/dfs-graph-Page-5.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-graph/dfs-graph-Page-6.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-graph/dfs-graph-Page-7.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-graph/dfs-graph-Page-8.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-graph/dfs-graph-Page-9.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-graph/dfs-graph-Page-10.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-graph/dfs-graph-Page-11.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-graph/dfs-graph-Page-12.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-graph/dfs-graph-Page-13.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-graph/dfs-graph-Page-14.svg&quot; class=&quot;slide&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;controls&quot;&gt;
    &lt;button onclick=&quot;plusSlides(-1, &apos;slider2&apos;)&quot; class=&quot;prev black-button&quot;&gt;Prev&lt;/button&gt;
    &lt;button onclick=&quot;playSlides(&apos;slider2&apos;)&quot; class=&quot;play black-button&quot;&gt;Play&lt;/button&gt;
    &lt;button onclick=&quot;stopSlides(&apos;slider2&apos;)&quot; class=&quot;stop black-button&quot; hidden=&quot;&quot;&gt;Stop&lt;/button&gt;
    &lt;button onclick=&quot;plusSlides(1, &apos;slider2&apos;)&quot; class=&quot;next black-button&quot;&gt;Next&lt;/button&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;I have highlighted when re-visiting &lt;code&gt;Node G&lt;/code&gt; (Slide #6), followed by back-tracking and visiting &lt;code&gt;Node J&lt;/code&gt;. Again, this is one particular Depth First Search traversal, but it can be done in many other ways by choosing a different &quot;next&quot; node to visit (at every explore step).&lt;/p&gt;

&lt;h3&gt;1.2. DFS: Implementation&lt;/h3&gt;
&lt;p&gt;The core of the solution is to find a way to back-track and head on a different path when encountering two scenarios: reaching a dead-end (leaf node) and reaching an already visited node (cycle).&lt;/p&gt;

&lt;h3&gt;1.2.1. DFS: Stack&lt;/h3&gt;
&lt;p&gt;The intuition behind using a &lt;a href=&quot;https://en.wikipedia.org/wiki/Stack_(abstract_data_type)&quot; target=&quot;_blank&quot;&gt;stack&lt;/a&gt; is that when we reach a dead-end, we want to get to the previously added node (LIFO: Last-In First-Out) and explore other paths. This helps you explore each path deeply before backtracking, done using a stack to go back to the last node.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;slides center-image-0&quot; style=&quot;width: 28%&quot; src=&quot;./assets/posts/graph-theory/stack-ds.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Easier to understand with visualization:&lt;/p&gt;

&lt;div class=&quot;slider&quot; id=&quot;slider3&quot;&gt;
  &lt;div class=&quot;slides center-image-0 center-image-80&quot;&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-stack/dfs-tree-Page-1.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-stack/dfs-tree-Page-2.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-stack/dfs-tree-Page-3.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-stack/dfs-tree-Page-4.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-stack/dfs-tree-Page-5.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-stack/dfs-tree-Page-6.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-stack/dfs-tree-Page-7.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-stack/dfs-tree-Page-8.svg&quot; class=&quot;slide&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;controls&quot;&gt;
    &lt;button onclick=&quot;plusSlides(-1, &apos;slider3&apos;)&quot; class=&quot;prev black-button&quot;&gt;Prev&lt;/button&gt;
    &lt;button onclick=&quot;playSlides(&apos;slider3&apos;)&quot; class=&quot;play black-button&quot;&gt;Play&lt;/button&gt;
    &lt;button onclick=&quot;stopSlides(&apos;slider3&apos;)&quot; class=&quot;stop black-button&quot; hidden=&quot;&quot;&gt;Stop&lt;/button&gt;
    &lt;button onclick=&quot;plusSlides(1, &apos;slider3&apos;)&quot; class=&quot;next black-button&quot;&gt;Next&lt;/button&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;The key points to notice here are the stack &lt;code&gt;pop&lt;/code&gt; operations. On reaching &lt;code&gt;node D&lt;/code&gt;, a leaf node, &lt;code&gt;pop()&lt;/code&gt; to explore other paths, i.e., &lt;code&gt;Node E&lt;/code&gt;. Similarly, &lt;code&gt;Node E&lt;/code&gt; is a leaf node, so &lt;code&gt;pop()&lt;/code&gt; to head back and explore &lt;code&gt;Node C&lt;/code&gt;.&lt;/p&gt;

&lt;div class=&quot;table-container&quot;&gt;
&lt;table style=&quot;width: 800px;&quot;&gt;
  &lt;tr&gt;
    &lt;th&gt;Step&lt;/th&gt;
    &lt;th&gt;Action&lt;/th&gt;
    &lt;th&gt;Stack State&lt;/th&gt;
    &lt;th&gt;Visited Nodes&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;1&lt;/td&gt;
    &lt;td&gt;Push A&lt;/td&gt;
    &lt;td&gt;[A]&lt;/td&gt;
    &lt;td&gt;{}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;2&lt;/td&gt;
    &lt;td&gt;Pop A, Push C, B&lt;/td&gt;
    &lt;td&gt;[C, B]&lt;/td&gt;
    &lt;td&gt;{A}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;3&lt;/td&gt;
    &lt;td&gt;Pop B, Push E, D&lt;/td&gt;
    &lt;td&gt;[C, E, D]&lt;/td&gt;
    &lt;td&gt;{A, B}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;4&lt;/td&gt;
    &lt;td&gt;Pop D&lt;/td&gt;
    &lt;td&gt;[C, E]&lt;/td&gt;
    &lt;td&gt;{A, B, D}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;5&lt;/td&gt;
    &lt;td&gt;Pop E&lt;/td&gt;
    &lt;td&gt;[C]&lt;/td&gt;
    &lt;td&gt;{A, B, D, E}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;6&lt;/td&gt;
    &lt;td&gt;Pop C, Push G, F&lt;/td&gt;
    &lt;td&gt;[G, F]&lt;/td&gt;
    &lt;td&gt;{A, B, D, E, C}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;7&lt;/td&gt;
    &lt;td&gt;Pop F&lt;/td&gt;
    &lt;td&gt;[G]&lt;/td&gt;
    &lt;td&gt;{A, B, D, E, C, F}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;8&lt;/td&gt;
    &lt;td&gt;Pop G&lt;/td&gt;
    &lt;td&gt;[]&lt;/td&gt;
    &lt;td&gt;{A, B, D, E, C, F, G}&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Note: When visiting a node, add all adjacent nodes to the stack to ensure all possible paths from the current node are explored. This is essential for DFS to correctly traverse the entire graph.&lt;/p&gt;

&lt;p&gt;Pseudo Code: Wrapping it all up with 10 lines of code&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;DFS-Iterative(graph, start):
    let stack be a stack
    let visited be a set
    stack.push(start)
    
    while stack is not empty:
        node = stack.pop()
        if node is not in visited:
            visit(node)
            visited.add(node)
            for each neighbor of node in graph (Optional: reverse order):
                if neighbor is not in visited:
                    stack.push(neighbor)
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;1.2.2. DFS: Recursion&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;https://en.wikipedia.org/wiki/Recursion&quot; target=&quot;_blank&quot;&gt;recursion&lt;/a&gt; solution is quite similar to the above stack solution, where we rely on the call stack as opposed to a user-defined stack.&lt;/p&gt;

&lt;div class=&quot;slider&quot; id=&quot;slider4&quot;&gt;
  &lt;div class=&quot;slides center-image-0 center-image-80&quot;&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-call-stack/dfs-call-stack-Page-1.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-call-stack/dfs-call-stack-Page-2.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-call-stack/dfs-call-stack-Page-3.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-call-stack/dfs-call-stack-Page-4.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-call-stack/dfs-call-stack-Page-5.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-call-stack/dfs-call-stack-Page-6.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-call-stack/dfs-call-stack-Page-7.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-call-stack/dfs-call-stack-Page-8.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-call-stack/dfs-call-stack-Page-9.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-call-stack/dfs-call-stack-Page-10.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-call-stack/dfs-call-stack-Page-11.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/dfs-call-stack/dfs-call-stack-Page-12.svg&quot; class=&quot;slide&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;controls&quot;&gt;
    &lt;button onclick=&quot;plusSlides(-1, &apos;slider4&apos;)&quot; class=&quot;prev black-button&quot;&gt;Prev&lt;/button&gt;
    &lt;button onclick=&quot;playSlides(&apos;slider4&apos;)&quot; class=&quot;play black-button&quot;&gt;Play&lt;/button&gt;
    &lt;button onclick=&quot;stopSlides(&apos;slider4&apos;)&quot; class=&quot;stop black-button&quot; hidden=&quot;&quot;&gt;Stop&lt;/button&gt;
    &lt;button onclick=&quot;plusSlides(1, &apos;slider4&apos;)&quot; class=&quot;next black-button&quot;&gt;Next&lt;/button&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;There&apos;s a small difference (in traversal order). In the recursive solution, you handle each node when you see it. Thus, the first node you handle is the first child.&lt;/p&gt;

&lt;p&gt;Whereas in an iterative approach, you first insert all the elements into the stack and then handle the head of the stack (which is the last node inserted). Thus, the first node you handle is the last child.&lt;/p&gt;

&lt;div class=&quot;table-container&quot;&gt;
&lt;table style=&quot;width: 800px;&quot;&gt;
  &lt;tr&gt;
    &lt;th&gt;Step&lt;/th&gt;
    &lt;th&gt;Action&lt;/th&gt;
    &lt;th&gt;Call Stack State&lt;/th&gt;
    &lt;th&gt;Visited Nodes&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;1&lt;/td&gt;
    &lt;td&gt;Call on A&lt;/td&gt;
    &lt;td&gt;[A]&lt;/td&gt;
    &lt;td&gt;{}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;2&lt;/td&gt;
    &lt;td&gt;Visit A, Call on B&lt;/td&gt;
    &lt;td&gt;[A, B]&lt;/td&gt;
    &lt;td&gt;{A}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;3&lt;/td&gt;
    &lt;td&gt;Visit B, Call on D&lt;/td&gt;
    &lt;td&gt;[A, B, D]&lt;/td&gt;
    &lt;td&gt;{A, B}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;4&lt;/td&gt;
    &lt;td&gt;Visit D, Return from D&lt;/td&gt;
    &lt;td&gt;[A, B]&lt;/td&gt;
    &lt;td&gt;{A, B, D}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;5&lt;/td&gt;
    &lt;td&gt;Call on E&lt;/td&gt;
    &lt;td&gt;[A, B, E]&lt;/td&gt;
    &lt;td&gt;{A, B, D}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;6&lt;/td&gt;
    &lt;td&gt;Visit E, Return from E&lt;/td&gt;
    &lt;td&gt;[A, B]&lt;/td&gt;
    &lt;td&gt;{A, B, D, E}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;7&lt;/td&gt;
    &lt;td&gt;Return from B, Call on C&lt;/td&gt;
    &lt;td&gt;[A, C]&lt;/td&gt;
    &lt;td&gt;{A, B, D, E}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;8&lt;/td&gt;
    &lt;td&gt;Visit C, Call on F&lt;/td&gt;
    &lt;td&gt;[A, C, F]&lt;/td&gt;
    &lt;td&gt;{A, B, D, E, C}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;9&lt;/td&gt;
    &lt;td&gt;Visit F, Return from F&lt;/td&gt;
    &lt;td&gt;[A, C]&lt;/td&gt;
    &lt;td&gt;{A, B, D, E, C, F}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;10&lt;/td&gt;
    &lt;td&gt;Call on G&lt;/td&gt;
    &lt;td&gt;[A, C, G]&lt;/td&gt;
    &lt;td&gt;{A, B, D, E, C, F}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;11&lt;/td&gt;
    &lt;td&gt;Visit G, Return from G&lt;/td&gt;
    &lt;td&gt;[A, C]&lt;/td&gt;
    &lt;td&gt;{A, B, D, E, C, F, G}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;12&lt;/td&gt;
    &lt;td&gt;Return from C&lt;/td&gt;
    &lt;td&gt;[A]&lt;/td&gt;
    &lt;td&gt;{A, B, D, E, C, F, G}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;13&lt;/td&gt;
    &lt;td&gt;Return from A&lt;/td&gt;
    &lt;td&gt;[]&lt;/td&gt;
    &lt;td&gt;{A, B, D, E, C, F, G}&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Pseudo Code: now down to 5 lines of code&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;DFS-Recursive(node, visited):
    if node is not in visited:
        visit(node)
        visited.add(node)
        for each neighbor of node:
            DFS-Recursive(neighbor, visited)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note: if you want the user-defined stack solution to yield the same result as the recursive solution, you need to add elements to the stack in reverse order. For each node, insert its last child first and its first child last.&lt;/p&gt;

&lt;h3&gt;2. Breath First Search&lt;/h3&gt;

&lt;p&gt;Also called Level Order Search. Compared to DFS, exploring in BFS is level-by-level (or in layers); i.e., start with a node, explore an adjacent node (without deep diving till leaf) - repeat until all adjacent nodes are visited; then, choose an adjacent node (child), explore a level down - until its adjacent nodes are also explored; repeat the process.&lt;/p&gt;

&lt;p&gt;In other words: traverse through one entire level of children nodes first before moving on to traverse through the grandchildren nodes. Repeat: traverse through an entire level of grandchildren nodes before going on to traverse through great-grandchildren nodes.&lt;/p&gt;

&lt;div class=&quot;slider&quot; id=&quot;slider5&quot;&gt;
  &lt;div class=&quot;slides center-image-0 center-image-40&quot;&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-basic/bfs-basic-Page-1.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-basic/bfs-basic-Page-2.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-basic/bfs-basic-Page-3.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-basic/bfs-basic-Page-4.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-basic/bfs-basic-Page-5.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-basic/bfs-basic-Page-6.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-basic/bfs-basic-Page-7.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-basic/bfs-basic-Page-8.svg&quot; class=&quot;slide&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;controls&quot;&gt;
    &lt;button onclick=&quot;plusSlides(-1, &apos;slider5&apos;)&quot; class=&quot;prev black-button&quot;&gt;Prev&lt;/button&gt;
    &lt;button onclick=&quot;playSlides(&apos;slider5&apos;)&quot; class=&quot;play black-button&quot;&gt;Play&lt;/button&gt;
    &lt;button onclick=&quot;stopSlides(&apos;slider5&apos;)&quot; class=&quot;stop black-button&quot; hidden=&quot;&quot;&gt;Stop&lt;/button&gt;
    &lt;button onclick=&quot;plusSlides(1, &apos;slider5&apos;)&quot; class=&quot;next black-button&quot;&gt;Next&lt;/button&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;Explanation of the above example:&lt;/p&gt;
&lt;ul class=&quot;one-line-list&quot;&gt;
&lt;li&gt;Starting with &lt;code&gt;vertex A&lt;/code&gt; (Visit A) - start exploration of all adjacent vertices.&lt;/li&gt;
&lt;li&gt;Explore adjacent nodes in any order, in this case: &lt;code&gt;Node B&lt;/code&gt;, followed by &lt;code&gt;F&lt;/code&gt; and &lt;code&gt;G&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Cannot explore any further, as all adjacent nodes/children are visited.&lt;/li&gt;
&lt;li&gt;Explore any one of the children, say &lt;code&gt;Node B&lt;/code&gt;, and visit all the adjacent nodes of &lt;code&gt;B&lt;/code&gt;: &lt;code&gt;E, C, and D&lt;/code&gt; (in any order).&lt;/li&gt;
&lt;li&gt;Again, cannot explore further, as all children are visited.&lt;/li&gt;
&lt;li&gt;Similar to &lt;code&gt;Node B&lt;/code&gt;, explore &lt;code&gt;Node G&lt;/code&gt; and &lt;code&gt;F&lt;/code&gt; (nothing to explore). BFS is now complete.&lt;/li&gt;
&lt;li&gt;Order of visiting nodes: &lt;code&gt;A, B, F, G, E, C, D&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;2.1 BFS: Implementation&lt;/h3&gt;

&lt;p&gt;Similar to DFS, we need to know if a node is &quot;visited&quot; in order to prevent cycles, i.e., re-visiting a node. Typically, BFS is implemented using a &lt;a href=&quot;https://en.wikipedia.org/wiki/Queue_(abstract_data_type)&quot; target=&quot;_blank&quot;&gt;queue&lt;/a&gt; (FIFO: First-In First-Out) data structure. I wouldn&apos;t necessarily say that it&apos;s impossible to solve it with a stack, but it&apos;s definitely not conventional and introduces complexity.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;slides center-image-0 center-image-35&quot; src=&quot;./assets/posts/graph-theory/queue-ds.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Fun Fact: in the worst-case scenario (for Trees), a stack-based BFS performs better than a queue-based BFS. I&apos;ll explain more on this in a different post dedicated to Trees.&lt;/p&gt;

&lt;div class=&quot;slider&quot; id=&quot;slider6&quot;&gt;
  &lt;div class=&quot;slides center-image-0 center-image-80&quot;&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-1.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-2.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-3.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-4.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-5.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-6.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-7.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-tree/bfs-tree-Page-8.svg&quot; class=&quot;slide&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;controls&quot;&gt;
    &lt;button onclick=&quot;plusSlides(-1, &apos;slider6&apos;)&quot; class=&quot;prev black-button&quot;&gt;Prev&lt;/button&gt;
    &lt;button onclick=&quot;playSlides(&apos;slider6&apos;)&quot; class=&quot;play black-button&quot;&gt;Play&lt;/button&gt;
    &lt;button onclick=&quot;stopSlides(&apos;slider6&apos;)&quot; class=&quot;stop black-button&quot; hidden=&quot;&quot;&gt;Stop&lt;/button&gt;
    &lt;button onclick=&quot;plusSlides(1, &apos;slider6&apos;)&quot; class=&quot;next black-button&quot;&gt;Next&lt;/button&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;One important observation in BFS is that we add nodes that we have discovered but not yet visited to the queue, and come back to (visit) them later.
With the source node (or root node) in the queue, the process is to visit a node (dequeue), add all the children/adjacent nodes to the queue (enqueue), and repeat the process.&lt;/p&gt;

&lt;div class=&quot;table-container&quot;&gt;
&lt;table style=&quot;width: 800px;&quot;&gt;
  &lt;tr&gt;
    &lt;th&gt;Step&lt;/th&gt;
    &lt;th&gt;Action&lt;/th&gt;
    &lt;th&gt;Queue State&lt;/th&gt;
    &lt;th&gt;Visited Nodes&lt;/th&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;1&lt;/td&gt;
    &lt;td&gt;Enqueue A&lt;/td&gt;
    &lt;td&gt;[A]&lt;/td&gt;
    &lt;td&gt;{}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;2&lt;/td&gt;
    &lt;td&gt;Dequeue A, Enqueue B, C&lt;/td&gt;
    &lt;td&gt;[B, C]&lt;/td&gt;
    &lt;td&gt;{A}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;3&lt;/td&gt;
    &lt;td&gt;Dequeue B, Enqueue D, E&lt;/td&gt;
    &lt;td&gt;[C, D, E]&lt;/td&gt;
    &lt;td&gt;{A, B}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;4&lt;/td&gt;
    &lt;td&gt;Dequeue C, Enqueue F, G&lt;/td&gt;
    &lt;td&gt;[D, E, F, G]&lt;/td&gt;
    &lt;td&gt;{A, B, C}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;5&lt;/td&gt;
    &lt;td&gt;Dequeue D&lt;/td&gt;
    &lt;td&gt;[E, F, G]&lt;/td&gt;
    &lt;td&gt;{A, B, C, D}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;6&lt;/td&gt;
    &lt;td&gt;Dequeue E&lt;/td&gt;
    &lt;td&gt;[F, G]&lt;/td&gt;
    &lt;td&gt;{A, B, C, D, E}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;7&lt;/td&gt;
    &lt;td&gt;Dequeue F&lt;/td&gt;
    &lt;td&gt;[G]&lt;/td&gt;
    &lt;td&gt;{A, B, C, D, E, F}&lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
    &lt;td&gt;8&lt;/td&gt;
    &lt;td&gt;Dequeue G&lt;/td&gt;
    &lt;td&gt;[]&lt;/td&gt;
    &lt;td&gt;{A, B, C, D, E, F, G}&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;The intuition to follow along is: Queues follow the first-in, first-out (FIFO) principle, which means that whatever was enqueued first is the first item that will be read and removed from the queue.&lt;/p&gt;

&lt;p&gt;Pseudo Code:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;BFS(graph, start):
    let queue be a queue
    let visited be a set
    queue.enqueue(start)
    
    while queue is not empty:
        node = queue.dequeue()
        if node is not in visited:
            visit(node)
            visited.add(node)
            for each neighbor of node in graph:
                if neighbor is not in visited:
                    queue.enqueue(neighbor)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I hate to be the person who uses a tree to explain a graph. Reminds me of the physics class at school, where the lectures and exams are miles apart! So, here is the visualization of BFS for a graph:&lt;/p&gt;

&lt;div class=&quot;slider&quot; id=&quot;slider7&quot;&gt;
  &lt;div class=&quot;slides center-image-0 center-image-90&quot;&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-queue/bfs-queue-Page-1.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-queue/bfs-queue-Page-2.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-queue/bfs-queue-Page-3.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-queue/bfs-queue-Page-4.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-queue/bfs-queue-Page-5.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-queue/bfs-queue-Page-6.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-queue/bfs-queue-Page-7.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-queue/bfs-queue-Page-8.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-queue/bfs-queue-Page-9.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-queue/bfs-queue-Page-10.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-queue/bfs-queue-Page-11.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-queue/bfs-queue-Page-12.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-queue/bfs-queue-Page-13.svg&quot; class=&quot;slide&quot; /&gt;
    &lt;img src=&quot;./assets/posts/graph-theory/bfs-queue/bfs-queue-Page-14.svg&quot; class=&quot;slide&quot; /&gt;
  &lt;/div&gt;
  &lt;div class=&quot;controls&quot;&gt;
    &lt;button onclick=&quot;plusSlides(-1, &apos;slider7&apos;)&quot; class=&quot;prev black-button&quot;&gt;Prev&lt;/button&gt;
    &lt;button onclick=&quot;playSlides(&apos;slider7&apos;)&quot; class=&quot;play black-button&quot;&gt;Play&lt;/button&gt;
    &lt;button onclick=&quot;stopSlides(&apos;slider7&apos;)&quot; class=&quot;stop black-button&quot; hidden=&quot;&quot;&gt;Stop&lt;/button&gt;
    &lt;button onclick=&quot;plusSlides(1, &apos;slider7&apos;)&quot; class=&quot;next black-button&quot;&gt;Next&lt;/button&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;In the Breadth-First Search (BFS) for a graph, the same element might be added to the queue multiple times in the presence of cycles (i.e. same nodes can be visited from multiple nodes). However, it will be ignored later based on the visited check. In the above graph BFS visualization, I have skipped adding the same element into the queue and indicated it with arrows (from other node(s)) instead.&lt;/p&gt;
&lt;p&gt;This can be prevented by: searching the entire queue (increasing time complexity), using another hashtable to track enqueued nodes (increasing space complexity), or slightly optimized with tail checks.&lt;/p&gt;

&lt;h3&gt;3. Conclusion&lt;/h3&gt;
&lt;p&gt;Both Breadth-First Search (BFS) and Depth-First Search (DFS) have a lot of applications and come up way too often when dealing with graphs.&lt;/p&gt;
&lt;p&gt;BFS is the first that pops up when finding the shortest path in an unweighted graph. DFS has tons of use-cases—be it computing a graph&apos;s minimum spanning tree, detecting cycles in a graph, checking if a graph is bipartite, finding bridges, articulation points, strongly connected components, topologically sorting a graph, and many more. BFS and DFS can often be used interchangeably.&lt;/p&gt;

&lt;h3&gt;4. References&lt;/h3&gt;
&lt;pre style=&quot;max-height: 180px&quot;&gt;&lt;code&gt;[1] Wikipedia contributors, &quot;Depth-first search,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Depth-first_search.
[2] Wikipedia contributors, &quot;Breadth-first search,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Breadth-first_search.
[3] Abdul Bari, &quot;Graph Traversals - BFS &amp;amp; DFS -Breadth First Search and Depth First Search,&quot; YouTube. [Online]. Available: https://youtu.be/pcKY4hjDrxk.
[4] Pravin Kumar Sinha, &quot;Stack-based breadth-first search tree traversal,&quot; IBM Developer. [Online]. Available: https://developer.ibm.com/articles/au-aix-stack-tree-traversal/.
[5] W. Fiset, &quot;Algorithms repository,&quot; GitHub, 2017. [Online]. Available: https://github.com/williamfiset/Algorithms.
&lt;/code&gt;&lt;/pre&gt;</content><author><name>Adesh Nalpet Adimurthy</name></author><category term="Code on the Road" /><category term="Graph Theory" /><category term="Data Structures" /><summary type="html">0. Graph Traversal Breadth-First Search (BFS) and Depth-First Search (DFS) are two of the most commonly used graph traversal methods. The traversal of a graph, whether BFS or DFS, involves two main concepts: visiting a node and exploring a node. Exploration refers to visiting all the children/adjacent nodes. Among BFS and DFS, Depth-First Search is more intuitive to perform, so let&apos;s first explore DFS to set a clear standpoint on what BFS is not.</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://pyblog.xyz/assets/featured/webp/graph-theory-search.webp" /><media:content medium="image" url="https://pyblog.xyz/assets/featured/webp/graph-theory-search.webp" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Graph Theory: Introduction</title><link href="https://pyblog.xyz/graph-theory-introduction" rel="alternate" type="text/html" title="Graph Theory: Introduction" /><published>2024-07-14T00:00:00+00:00</published><updated>2024-07-14T00:00:00+00:00</updated><id>https://pyblog.xyz/graph-theory-introduction</id><content type="html" xml:base="https://pyblog.xyz/graph-theory-introduction">&lt;p&gt;Before heading into details of how we store, represent, and traverse various kinds of graphs, this post is more of a ramp-up to better understand what graphs are and the different kinds from a computer science point of view, rather than a mathematical one. So, no proofs and equations, mostly just diagrams and implementation details, with an emphasis on how to apply graph theory to real-world applications.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Graph_theory&quot; target=&quot;_blank&quot;&gt;Graph theory&lt;/a&gt; is the mathematical theory of the properties and applications of graphs/networks, which is just a collection of objects that are all interconnected.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-60&quot; src=&quot;./assets/posts/graph-theory/gt-wardrobe.svg&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Graph theory is a broad enough topic to say it can be applied to almost any problem—first (maybe not first, make it 21st) thing in the morning, choosing what to wear - given all of the wardrobe, how many sets of clothes can I make by choosing one from each category (by category, I mean tops, bottoms, shoes, hats, and glasses)? While this sounds like a math problem to find permutations, using graphs to visualize each clothing item as a node and edges to represent relationships between them can be helpful.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/graph-theory/gt-social-network.svg&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Another everyday example is the social network. A graph representation answers questions such as how many mutual friends or how many degrees of separation exist between two people.&lt;/p&gt;

&lt;h3&gt;1. Types of Graphs&lt;/h3&gt;

&lt;p&gt;There are a lot of types of graphs, and it&apos;s important to understand the kind of graph you are dealing with. Let&apos;s go over the most commonly known graph variants.&lt;/p&gt;

&lt;h3&gt;1.1. Undirected Graph&lt;/h3&gt;
&lt;p&gt;The most simple kind of graph, where the edges have no orientation (&lt;a href=&quot;https://en.wikipedia.org/wiki/Bidirected_graph&quot; target=&quot;_blank&quot;&gt;bi-directional&lt;/a&gt;). i.e., edge &lt;code&gt;(u, v)&lt;/code&gt; is identical to edge &lt;code&gt;(v, u)&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-50&quot; src=&quot;./assets/posts/graph-theory/gt-undirected.svg&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Example: A city interconnected by bi-directional roads. You can drive from one city to another and can retrace the same path back.&lt;/p&gt;

&lt;h3&gt;1.2. Directed Graph/Digraph&lt;/h3&gt;
&lt;p&gt;In contrast to an undirected graph, &lt;a href=&quot;https://en.wikipedia.org/wiki/Directed_graph&quot; target=&quot;_blank&quot;&gt;directed graphs&lt;/a&gt; or digraphs have edges that are directed/have orientation. Edge &lt;code&gt;(u, v)&lt;/code&gt; represents that you can only go from node u to node v and not the other way around. As shown in the figure below, the edges are directed, indicated by the arrowheads on the edges between nodes.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-55&quot; src=&quot;./assets/posts/graph-theory/gt-directed.svg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Example: This graph could represent people who bought each other gifts. C and D got gifts for each other, E didn&apos;t get any nor give any, B got one from A, gave a gift to D, and sent a gift to itself.&lt;/p&gt;

&lt;h3&gt;1.3. Weighted Graphs&lt;/h3&gt;
&lt;p&gt;So far, we have seen unweighted graphs, but edges on graphs can contain weights to represent arbitrary values such as distance, cost, quantity, etc.&lt;/p&gt;
&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-50&quot; src=&quot;./assets/posts/graph-theory/gt-weighted.svg&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Weighted graphs can again be directed or undirected. An edge of a weighted graph can be denoted with &lt;code&gt;(u, v, w)&lt;/code&gt;, where &lt;code&gt;w&lt;/code&gt; is the weight.&lt;p&gt;

&lt;h3&gt;2. Special Graphs&lt;/h3&gt;
&lt;p&gt;While directed, undirected and weighted graphs covers the basic types, there are many other types of graphs governed by rules and restrictions.&lt;/p&gt;

&lt;h3&gt;2.1. Trees&lt;/h3&gt;
&lt;p&gt;A &lt;a href=&quot;https://en.wikipedia.org/wiki/Tree_(graph_theory)&quot; target=&quot;_blank&quot;&gt;tree&lt;/a&gt; is simply a collection of nodes connected by directed (or undirected) edges with no cycles or loops (no node can be its own ancestor). A tree has &lt;code&gt;N&lt;/code&gt; nodes and &lt;code&gt;N-1&lt;/code&gt; edges.&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image&quot; src=&quot;./assets/posts/graph-theory/gt-trees.svg&quot; /&gt;
&lt;p&gt;All of the above are indeed trees, even the left-most graph, which has no cycles and N-1 edges.&lt;/p&gt;

&lt;h3&gt;2.2. Rooted Trees&lt;/h3&gt;
&lt;p&gt;A related but totally different kind of graph is a rooted tree. It has a designated root node, where every edge either points away from or towards the root node. When edges point away from the root, it&apos;s called an out-tree (arborescence) and an in-tree (anti-arborescence) otherwise.&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image&quot; src=&quot;./assets/posts/graph-theory/gt-rooted-trees.svg&quot; /&gt;
&lt;p&gt;Out-trees are more commonly used than in-trees, so much so that out-trees are often referred to as just &quot;trees.&quot;&lt;/p&gt;

&lt;h3&gt;2.3. Directed Acyclic Graphs (DAGs)&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Directed_acyclic_graph&quot; target=&quot;_blank&quot;&gt;DAGs&lt;/a&gt; are directed acyclic graphs, i.e., with directed edges and no cycles or loops. DAGs play an important role and are very common in computer science, including dependency management, workflows, schedulers, and many more.&lt;/p&gt;
&lt;p&gt;When dealing with DAGs, commonly used algorithms include finding the shortest path and topological sort (how to process nodes in a graph in the correct order considering dependencies).&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image&quot; src=&quot;./assets/posts/graph-theory/gt-dags.svg&quot; /&gt;
&lt;p&gt;Fun Fact: All out-trees are DAGs, but not all DAGs are out-trees.&lt;/p&gt;
&lt;p&gt;DAG nodes can have multiple parents, meaning there can be multiple paths that eventually merge. Out-trees are DAGs with the restriction that a child can only have one parent. Another way to see it is that a tree is like single-class inheritance, and a DAG is like multiple-class inheritance.&lt;/p&gt;

&lt;h3&gt;2.4. Bipartite Graph&lt;/h3&gt;
&lt;p&gt;A &lt;a href=&quot;https://en.wikipedia.org/wiki/Bipartite_graph&quot; target=&quot;_blank&quot;&gt;bipartite graph&lt;/a&gt; is one whose vertices can be split into two independent groups, &lt;code&gt;U&lt;/code&gt; and &lt;code&gt;V&lt;/code&gt;, such that every edge connects between &lt;code&gt;U&lt;/code&gt; and &lt;code&gt;V&lt;/code&gt;. A bipartite graph is two-colorable, in other words, it is a graph in which every edge connects a vertex of one set (Example, set 1: red color) to a vertex of the other set (Example, set 2: blue color).&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-65&quot; src=&quot;./assets/posts/graph-theory/gt-bipartite.svg&quot; /&gt;
&lt;p&gt;A common question is to find the maximum matching that can be created on a bipartite graph (covered in a follow-up post). For example, say red nodes are jobs and blue nodes are people. The problem is to determine how many people can be matched to jobs.&lt;/p&gt;

&lt;h3&gt;2.5. Complete Graph&lt;/h3&gt;
&lt;p&gt;In a &lt;a href=&quot;https://en.wikipedia.org/wiki/Complete_graph&quot; target=&quot;_blank&quot;&gt;complete graph&lt;/a&gt;, there is a unique edge between every pair of nodes, i.e., every node is connected to every other node except itself. A complete graph with &lt;code&gt;n&lt;/code&gt; vertices is denoted by the graph &lt;code&gt;K&lt;sub&gt;n&lt;/sub&gt;&lt;/code&gt;.&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/graph-theory/gt-complete.svg&quot; /&gt;
&lt;p&gt;A complete graph is often seen as the worst-case possible graph and is used for performance testing.&lt;/p&gt;

&lt;h3&gt;3. Graph Representation&lt;/h3&gt;
&lt;p&gt;The next important aspect is the data structure we use to represent a graph, which can have a huge impact on performance. The simplest and most common way is using an adjacency matrix.&lt;/p&gt;

&lt;h3&gt;3.1. Adjacency Matrix&lt;/h3&gt;
&lt;p&gt;An &lt;a href=&quot;https://en.wikipedia.org/wiki/Adjacency_matrix&quot; target=&quot;_blank&quot;&gt;adjacency matrix&lt;/a&gt; &lt;code&gt;m&lt;/code&gt; represents a graph, where &lt;code&gt;m[i][j]&lt;/code&gt; is the edge weight of going from node &lt;code&gt;i&lt;/code&gt; to node &lt;code&gt;j&lt;/code&gt;. Unless specified, it&apos;s often assumed that the edge of going from a node to itself has zero cost. Which is why the diagonal of the matrix has all zeroes.&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-65&quot; src=&quot;./assets/posts/graph-theory/gt-adjacency-matrix.svg&quot; /&gt;
&lt;p&gt;For example, the weight of the edge going from node D to node B is 5, as represented in the matrix.&lt;/p&gt;

&lt;p&gt;Pros:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Space efficient for representing dense graphs.&lt;/li&gt;
&lt;li&gt;Edge weight lookup is constant time: &lt;code&gt;O(1)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Simplest graph representation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Requires &lt;code&gt;O(V&lt;sup&gt;2&lt;/sup&gt;)&lt;/code&gt; space, where &lt;code&gt;V&lt;/code&gt; is the number of nodes/vertices.&lt;/li&gt;
&lt;li&gt;Iterating over all edges requires &lt;code&gt;O(V&lt;sup&gt;2&lt;/sup&gt;)&lt;/code&gt; time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The quadratic space complexity becomes less feasible when dealing with networks with nodes in the order of thousands or more.&lt;/p&gt;

&lt;h3&gt;3.2. Adjacency List&lt;/h3&gt;
&lt;p&gt;The other alternative to the adjacency matrix is the &lt;a href=&quot;https://en.wikipedia.org/wiki/Adjacency_list&quot; target=&quot;_blank&quot;&gt;adjacency list&lt;/a&gt;. This is a way to represent the graph as a map from nodes to lists of outgoing edges. In other words, each node tracks all its outgoing edges. i.e., &lt;code&gt;N&lt;sub&gt;1&lt;/sub&gt; = [(N&lt;sub&gt;x&lt;/sub&gt;, W), (N&lt;sub&gt;y&lt;/sub&gt;, W), ...]&lt;/code&gt;&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/graph-theory/gt-adjacency-list.svg&quot; /&gt;
&lt;p&gt;For example, Node C has 3 outgoing edges, so the map entry for Node C has those 3 entries, each represented by the combination of the destination node and edge weight/cost.&lt;/p&gt;

&lt;p&gt;Pros:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Space efficient for representing sparse graphs (no extra space for unused edges).&lt;/li&gt;
&lt;li&gt;Iterating over all edges is efficient.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Less space efficient for dense graphs.&lt;/li&gt;
&lt;li&gt;Edge weight lookup is &lt;code&gt;O(E)&lt;/code&gt;, where &lt;code&gt;E&lt;/code&gt; is the number of edges of a node.
&lt;/li&gt;
&lt;li&gt;Slightly more complex graph representation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Adjacency lists are still very commonly used, since edge weight lookup is not a common use case and many real-world use cases involve sparse graphs.&lt;/p&gt;

&lt;h3&gt;3.3. Edge List&lt;/h3&gt;
&lt;p&gt;The edge list takes an overly simplified approach to represent a graph simply as an unordered list of edges with the source node, destination node, and the weight. For example, &lt;code&gt;(u, v, w)&lt;/code&gt; represents the cost from node &lt;code&gt;u&lt;/code&gt; to node &lt;code&gt;v&lt;/code&gt; as &lt;code&gt;w&lt;/code&gt;.&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-65&quot; src=&quot;./assets/posts/graph-theory/gt-edge-list.svg&quot; /&gt;
&lt;p&gt;Pros:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Space efficient for representing sparse graphs.&lt;/li&gt;
&lt;li&gt;Iterating over all edges is efficient.&lt;/li&gt;
&lt;li&gt;Overly simple structure/representation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Less space efficient for dense graphs.&lt;/li&gt;
&lt;li&gt;Edge weight lookup is &lt;code&gt;O(E)&lt;/code&gt;, where &lt;code&gt;E&lt;/code&gt; is the number of edges.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Despite the seeming simplicity and lack of structure, edge lists do come in handy for a variety of problems and algorithms.&lt;/p&gt;

&lt;h3&gt;4. Graph Problems&lt;/h3&gt;

&lt;p&gt;One of the best approaches to dealing with graph problems is to better understand and familiarize yourself with common graph theory algorithms. Many other problems can be reduced to a known graph problem.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the graph already exist, or is it to be derived/constructed?&lt;/li&gt;
&lt;li&gt;Is the graph directed or undirected?&lt;/li&gt;
&lt;li&gt;Is it a weighted graph (edges)?&lt;/li&gt;
&lt;li&gt;Is it a sparse graph or a dense graph?&lt;/li&gt;
&lt;li&gt;Based on all of the above, should I use an adjacency matrix, adjacency list, edge list, or other structures?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;4.1. Shortest Path Problem&lt;/h3&gt;

&lt;p&gt;Given a weighted graph, find the shortest path of edges from Node A to Node B (source and destination nodes).&lt;/p&gt;
&lt;p&gt;Algorithms: &lt;a href=&quot;https://en.wikipedia.org/wiki/Breadth-first_search&quot; target=&quot;_blank&quot;&gt;Breadth First Search&lt;/a&gt; (unweighted graph), &lt;a href=&quot;https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm&quot; target=&quot;_blank&quot;&gt;Dijkstra&lt;/a&gt;&apos;s, Bellman-Ford, Floyd-Warshall, A*, and many more.&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/graph-theory/gt-shortest-path.svg&quot; /&gt;
&lt;p&gt;In the example, to find the shortest path from Node A to Node H, the sum of all the weights/costs of the path taken should be the least.&lt;/p&gt;

&lt;h3&gt;4.2. Connectivity&lt;/h3&gt;

&lt;p&gt;Along the same lines, to determine if connectivity exists from Node A to Node B. In other words, given the nodes, do they exist in the same network/graph? This is quite commonly used in communication networks such as WiFi, Thread, Zigbee, etc.&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-65&quot; src=&quot;./assets/posts/graph-theory/gt-connectivity.svg&quot; /&gt;
&lt;p&gt;Algorithms: Any search algorithm such as BFS (Breadth First Search) or DFS (&lt;a href=&quot;https://en.wikipedia.org/wiki/Depth-first_search&quot; target=&quot;_blank&quot;&gt;Depth First Search&lt;/a&gt;).&lt;/p&gt;

&lt;h3&gt;4.3. Negative Cycles&lt;/h3&gt;
&lt;p&gt;To detect negative cycles in a directed graph. Also known as a negative-weight cycle, it is a cycle in a graph whose edges sum to a negative value.&lt;/p&gt;
&lt;p&gt;&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-45&quot; src=&quot;./assets/posts/graph-theory/gt-cycles.svg&quot; /&gt;
&lt;p&gt;In the example, nodes B, C, and D form a negative cycle, where the sum of costs is -1, which can lead to cycling endlessly with a smaller cost for every iteration. For instance, finding the shortest path without detecting negative cycles would be a trap, never escaping out of it.&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-55&quot; src=&quot;./assets/posts/graph-theory/gt-currency.svg&quot; /&gt;
&lt;p&gt;Detecting negative cycles has other applications, such as currency arbitrage. In this context, assign currencies to different vertices, and let the edge weight represent the exchange rate.&lt;/p&gt;
&lt;p&gt;Algorithms to detect negative cycles: &lt;a href=&quot;https://en.wikipedia.org/wiki/Bellman%E2%80%93Ford_algorithm&quot; target=&quot;_blank&quot;&gt;Bellman-Ford&lt;/a&gt; and &lt;a href=&quot;https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm&quot; target=&quot;_blank&quot;&gt;Floyd-Warshall&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;/p&gt;

&lt;h3&gt;4.4. Strongly Connected Components&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Strongly_connected_component&quot; target=&quot;_blank&quot;&gt;SSCs&lt;/a&gt; are self-contained cycles within a directed graph, i.e., every vertex/node in a cycle can reach every other vertex in the same cycle. &lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image&quot; src=&quot;./assets/posts/graph-theory/gt-ssc.svg&quot; /&gt;
&lt;p&gt;If each strongly connected component is contracted to a single vertex, the resulting graph is a directed acyclic graph (DAG), the condensation of Graph G.&lt;/p&gt;
&lt;p&gt;Algorithms: &lt;a href=&quot;https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm&quot; target=&quot;_blank&quot;&gt;Tarjan&apos;s SSC&lt;/a&gt; and &lt;a href=&quot;https://en.wikipedia.org/wiki/Kosaraju%27s_algorithm&quot; target=&quot;_blank&quot;&gt;Kosaraju&lt;/a&gt;&apos;s algorithm.&lt;/p&gt;

&lt;h3&gt;4.5. Traveling Salesman Problem&lt;/h3&gt;
&lt;p&gt;or the travelling salesperson problem (&lt;a href=&quot;https://en.wikipedia.org/wiki/Travelling_salesman_problem&quot; target=&quot;_blank&quot;&gt;TSP&lt;/a&gt;) asks &quot;Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?&quot; It is an NP-hard problem.&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-65&quot; src=&quot;./assets/posts/graph-theory/gt-tsp.svg&quot; /&gt;
&lt;p&gt;For the above graph, the TSP (Traveling Salesman Problem) solution has a cost of 9 to travel from Node A to all the other nodes and back to Node A.&lt;/p&gt;
&lt;p&gt;Algorithms: &lt;a href=&quot;https://en.wikipedia.org/wiki/Held%E2%80%93Karp_algorithm&quot; target=&quot;_blank&quot;&gt;Held-Karp&lt;/a&gt;, Brand and Bound, Approximation (Ex: Ant Colony) algorithms&lt;/p&gt;

&lt;h3&gt;4.6. Bridges&lt;/h3&gt;
&lt;p&gt;A &lt;a href=&quot;https://en.wikipedia.org/wiki/Bridge_(graph_theory)&quot; target=&quot;_blank&quot;&gt;bridge&lt;/a&gt;, cut-edge, or cut-arc is an edge of a graph whose deletion increases the graph&apos;s number of connected components (islands or clusters).&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-55&quot; src=&quot;./assets/posts/graph-theory/gt-bridge.svg&quot; /&gt;
&lt;p&gt;Detecting bridges is important as they often signify bottlenecks, weak points, or vulnerabilities in a graph. For instance, it&apos;s common to ensure that a mesh network is a bridgeless graph.&lt;/p&gt;

&lt;h3&gt;4.7. Articulation Points&lt;/h3&gt;
&lt;p&gt;An articulation point, or cut vertex, is similar to a bridge, but instead of edges, they are nodes. When removed, they increase the number of connected components.&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-60&quot; src=&quot;./assets/posts/graph-theory/gt-ap.svg&quot; /&gt;
&lt;p&gt;In the same graph as for bridges, the nodes connected by the bridges are articulation points.&lt;/p&gt;

&lt;h3&gt;4.8. Minimum Spanning Tree (MST)&lt;/h3&gt;
&lt;p&gt;A minimum spanning tree (&lt;a href=&quot;https://en.wikipedia.org/wiki/Minimum_spanning_tree&quot; target=&quot;_blank&quot;&gt;MST&lt;/a&gt;) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight/cost.&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-80&quot; src=&quot;./assets/posts/graph-theory/gt-mst.svg&quot; /&gt;
&lt;p&gt;A graph can have multiple minimum spanning trees with the same cost, but the resulting trees (MSTs) are not unique. Common use cases include designing a least-cost network, transportation networks, and more.&lt;/p&gt;
&lt;p&gt;Algorithms: Kruskal&apos;s, Prim&apos;s and Boruvka&apos;s algorithm.&lt;/p&gt;
&lt;p&gt;&lt;/p&gt;

&lt;h3&gt;4.7. Flow Network&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Flow_network&quot; target=&quot;_blank&quot;&gt;Flow network&lt;/a&gt; or the transportation network is a directed graph where the edge weight represents &quot;capacity.&quot; The amount of flow on an edge cannot exceed the capacity of the edge. Capacity can represent fluids in a pipe, currents in an electrical circuit, cars on a road, etc.&lt;/p&gt;
&lt;p&gt;Problem: For an infinite input to reach the sink, what&apos;s the max flow? With this, it&apos;s easier to see bottlenecks in the network that slow the flow. Correlating to the example, max flow would be the number of cars, volume of fluid, etc.&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-65&quot; src=&quot;./assets/posts/graph-theory/gt-flow-network.svg&quot; /&gt;
&lt;p&gt;Also, there cannot be blockages in the network/flow, the amount of flow into a node equals the amount of flow out of it.&lt;/p&gt;

&lt;h3&gt;5. Conclusion&lt;/h3&gt;
&lt;p&gt;With the basics of graph theory covered, including various types of graphs and their representations, we&apos;ve laid the groundwork for understanding how to efficiently store, represent, and traverse graphs in real-world applications. The next set of posts on Graph Theory will be a deep dive into specific problems and algorithms.&lt;/p&gt;

&lt;h3&gt;6. References&lt;/h3&gt;
&lt;pre style=&quot;max-height: 180px&quot;&gt;&lt;code&gt;[1] W. Fiset, &quot;Algorithms repository,&quot; GitHub, 2017. [Online]. Available: https://github.com/williamfiset/Algorithms.
[2] V. Schwartz, &quot;Currency Arbitrage and Graphs (2),&quot; Reasonable Deviations, Apr. 21, 2019. [Online]. Available: https://reasonabledeviations.com/2019/04/21/currency-arbitrage-graphs-2/. 
[3] Wikipedia, &quot;Graph theory,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Graph_theory.
[4] Wikipedia, &quot;Bidirected graph,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Bidirected_graph.
[5] Wikipedia, &quot;Directed graph,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Directed_graph.
[6] Wikipedia, &quot;Tree (graph theory),&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Tree_(graph_theory).
[7] Wikipedia, &quot;Directed acyclic graph,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Directed_acyclic_graph.
[8] Wikipedia, &quot;Bipartite graph,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Bipartite_graph.
[9] Wikipedia, &quot;Complete graph,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Complete_graph.
[10] Wikipedia, &quot;Adjacency matrix,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Adjacency_matrix.
[11] Wikipedia, &quot;Adjacency list,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Adjacency_list.
[12] Wikipedia, &quot;Breadth-first search,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Breadth-first_search.
[13] Wikipedia, &quot;Depth-first search,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Depth-first_search.
[14] Wikipedia, &quot;Bellman–Ford algorithm,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Bellman%E2%80%93Ford_algorithm.
[15] Wikipedia, &quot;Floyd–Warshall algorithm,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm.
[16] Wikipedia, &quot;Strongly connected component,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Strongly_connected_component.
[17] Wikipedia, &quot;Travelling salesman problem,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Travelling_salesman_problem.
[18] Wikipedia, &quot;Held–Karp algorithm,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Held%E2%80%93Karp_algorithm.
[19] Wikipedia, &quot;Bridge (graph theory),&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Bridge_(graph_theory).
[20] Wikipedia, &quot;Minimum spanning tree,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Minimum_spanning_tree.
[21] Wikipedia, &quot;Flow network,&quot; Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Flow_network.
&lt;/code&gt;&lt;/pre&gt;
&lt;/p&gt;&lt;/p&gt;</content><author><name>Adesh Nalpet Adimurthy</name></author><category term="Code on the Road" /><category term="Graph Theory" /><category term="Data Structures" /><summary type="html">Before heading into details of how we store, represent, and traverse various kinds of graphs, this post is more of a ramp-up to better understand what graphs are and the different kinds from a computer science point of view, rather than a mathematical one. So, no proofs and equations, mostly just diagrams and implementation details, with an emphasis on how to apply graph theory to real-world applications.</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://pyblog.xyz/assets/featured/webp/graph-theory-101.webp" /><media:content medium="image" url="https://pyblog.xyz/assets/featured/webp/graph-theory-101.webp" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Spatial Index: R Trees</title><link href="https://pyblog.xyz/spatial-index-r-tree" rel="alternate" type="text/html" title="Spatial Index: R Trees" /><published>2024-06-26T00:00:00+00:00</published><updated>2024-06-26T00:00:00+00:00</updated><id>https://pyblog.xyz/spatial-index-r-tree</id><content type="html" xml:base="https://pyblog.xyz/spatial-index-r-tree">&lt;p&gt;If you have been following the &lt;a href=&quot;https://pyblog.xyz/tags/spatial-index&quot;&gt;Spatial Index Series&lt;/a&gt;, it started with the need for multi-dimensional indexes and an introduction to &lt;a href=&quot;https://pyblog.xyz/spatial-index-space-filling-curve&quot;&gt;space-filling curves&lt;/a&gt;, followed by a deep dive into &lt;a href=&quot;https://pyblog.xyz/spatial-index-grid-system&quot;&gt;grid systems&lt;/a&gt; (GeoHash and Google S2) and &lt;a href=&quot;https://pyblog.xyz/spatial-index-tessellation&quot;&gt;tessellation&lt;/a&gt; (Uber H3).&lt;/p&gt;

&lt;p&gt;In this post, let&apos;s explore the &lt;a href=&quot;https://en.wikipedia.org/wiki/R-tree&quot; target=&quot;_blank&quot;&gt;R-Tree&lt;/a&gt; data structure (data-driven structure), which is popularly used to store multi-dimensional data, such as data points, segments, and rectangles.&lt;/p&gt;

&lt;h3&gt;1. R-Trees and Rectangles&lt;/h3&gt;

&lt;p&gt;For example, consider the plan of a university layout below. We can use the R-Tree data structure to index the buildings on the map.&lt;/p&gt;

&lt;p&gt;To do so, we can place rectangles around a building or group of buildings and then index them. Suppose there&apos;s a much bigger section of the map signifying a larger department, and we need to query all the buildings within a department. We can use the R-Tree to find all the buildings within (partially or fully contained) the larger section (query rectangle).&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image&quot; src=&quot;./assets/posts/spatial-index/r-tree-campus-level-2.svg&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 0: Layout with MBRs and Query Rectangle&lt;/p&gt;

&lt;p&gt;In the above figure, the red rectangle represent the query rectangle, used to ask the R-Tree to get all the buildings that intersect with the query rectangle (&lt;code&gt;R2, R3, R6&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;2. R-Tree - Intuition&lt;/h3&gt;

&lt;p&gt;The main idea in R-trees is the &lt;a href=&quot;https://en.wikipedia.org/wiki/Minimum_bounding_rectangle&quot; target=&quot;_blank&quot;&gt;minimum bounding rectangles&lt;/a&gt;. We&apos;ll come to what &quot;minimum&quot; implies in a second.&lt;/p&gt;

&lt;p&gt;The inner node of an R-tree is as follows: We start with the root node, representing the large landscape. The inner nodes are guideposts that hold pointers to the child nodes we need to go down to in the tree. i.e. each entry of a node points to an area of the data space (described by MBR).&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/spatial-index/r-tree-inner-node.svg&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 1: R-Tree Inner Node&lt;/p&gt;

&lt;p&gt;For instance, think of a &lt;a href=&quot;https://en.wikipedia.org/wiki/Binary_search_tree&quot; target=&quot;_blank&quot;&gt;Binary Search Tree&lt;/a&gt;. From the root node, we make a decision to go left or right. The R-tree is similar, but more of an &lt;a href=&quot;/b-tree&quot; target=&quot;_blank&quot;&gt;M-way tree&lt;/a&gt;, where each node can have multiple entries as seen above. Instead of having integer or string values (one-dimensional), the inner nodes consist of entries (multi-dimensional). In the example, there are 4 entries of rectangles.&lt;/p&gt;

&lt;h3&gt;2.1. MBR - Minimum Bounding Rectangle&lt;/h3&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-35&quot; src=&quot;./assets/posts/spatial-index/r-tree-mbr.svg&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 2: R-Tree Minimum Bounding Rectangle&lt;/p&gt;

&lt;p&gt;Minimum Bounding Rectangles, &lt;code&gt;R1, R2, R3, R4&lt;/code&gt;, contain the objects which are stored in the sub-trees in a minimal way. For instance, say we have 3 rectangles &lt;code&gt;R11, R12, R13&lt;/code&gt;. &lt;code&gt;R1&lt;/code&gt; is the smallest rectangle that can be created to completely contain all three rectangles, hence the name &quot;minimum.&quot;&lt;/p&gt;

&lt;h3&gt;2.2. Search Process and Overlapping MBRs&lt;/h3&gt;

&lt;p&gt;The search process in an R-tree is simple: for a query object/query rectangle; at an inner node, it is the decision to check if any of the entries in a node intersect with the query rectangle.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/spatial-index/r-tree-query-rectangle.svg&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 3: R-Tree Query Rectangle(s)&lt;/p&gt;

&lt;p&gt;For example, consider a query rectangle &lt;code&gt;Q1&lt;/code&gt;. It&apos;s clear that R1 intersects with &lt;code&gt;Q1&lt;/code&gt;, so we would follow down the tree from &lt;code&gt;R1&lt;/code&gt;. Similarly, &lt;code&gt;Q2&lt;/code&gt; intersects with &lt;code&gt;R2&lt;/code&gt;. However, in scenarios where the query rectangle intersects with multiple entries/rectangles (&lt;code&gt;Q3&lt;/code&gt; with &lt;code&gt;R2, R3, R4&lt;/code&gt;), all the intersecting rectangles have to be searched. This can happen if the indexing is not optimized and has to be avoided as it defeats the purpose of indexing in the first place.&lt;/p&gt;

&lt;h3&gt;2.3. R-Tree - Properties&lt;/h3&gt;

&lt;p&gt;Here&apos;s a bit of a larger example of an R-tree.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-85&quot; src=&quot;./assets/posts/spatial-index/r-tree-l-3.svg&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 4: R-Tree Level-2&lt;/p&gt;

&lt;p&gt;Every node in an R-tree has between &lt;code&gt;m&lt;/code&gt; and &lt;code&gt;M&lt;/code&gt; entries. More specifically, each node has between &lt;code&gt;m ≤ ⌈M/2⌉ and M&lt;/code&gt; entries. The node has at least 2 entries unless it&apos;s a leaf.&lt;/p&gt;

&lt;p&gt;By now, if you also read the blog post on &lt;a href=&quot;/b-tree&quot; target=&quot;_blank&quot;&gt;B-Trees and B+ Trees&lt;/a&gt;, you’ll see that an R-Tree is quite similar to a B+ Tree. It uses a similar idea to split the space at each (inner) node into multiple areas. However, B+ Trees mostly work with one-dimensional data, and the data ranges do not overlap.&lt;/p&gt;

&lt;h3&gt;3. Search using an R-Tree&lt;/h3&gt;

&lt;p&gt;Now that we know the idea behind R-Trees and the search process, Let&apos;s put a clear-cut definition to the search process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Goal: Find all rectangles that overlap with the given rectangle &lt;code&gt;S&lt;/code&gt; (query rectangle).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Let &lt;code&gt;T&lt;/code&gt; denote the node (at the current level/sub-tree).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;S1 (Search in sub-trees): If &lt;code&gt;T&lt;/code&gt; is not a leaf, check all the entries &lt;code&gt;E&lt;/code&gt; in &lt;code&gt;T&lt;/code&gt;. If the MBR of &lt;code&gt;E&lt;/code&gt; overlaps with &lt;code&gt;S&lt;/code&gt;, then continue the search in the sub-tree to which &lt;code&gt;E&lt;/code&gt; points.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;S2 (Search in Leaves): If &lt;code&gt;T&lt;/code&gt; is a leaf node, inspect all entries of &lt;code&gt;E&lt;/code&gt;. All entries that overlap with &lt;code&gt;S&lt;/code&gt; are part of the query result.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;4. Inserting to an R-Tree&lt;/h3&gt;

&lt;p&gt;Coming to inserts, consider a leaf node (MBR) as shown below with 3 entries/objects, &lt;code&gt;R1&lt;/code&gt;, &lt;code&gt;R2&lt;/code&gt;, and &lt;code&gt;R3&lt;/code&gt;. Let&apos;s assume that the leaf is not full yet (MBR has a threshold capacity on the number of objects it can hold).&lt;/p&gt;

&lt;p&gt;Say, there&apos;s a new rectangle &lt;code&gt;R4&lt;/code&gt; coming and it has to be inserted inside the leaf node. As you can see, in order to capture the new objects, the MBR is adjusted, i.e., enlarged to minimally contain &lt;code&gt;R1&lt;/code&gt; to &lt;code&gt;R4&lt;/code&gt;. Going on and inserting another object &lt;code&gt;R5&lt;/code&gt;, the MBR is once again adjusted.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/spatial-index/r-tree-insert.svg&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 5: R-Tree Insert (Adjusting MBR)&lt;/p&gt;

&lt;p&gt;On an insert, when the MBR is updated, i.e., contains more objects, the new MBR has to be updated not only for the node but also propagated to other lower levels and potentially (not always) up to the root node. This is to reflect that the sub-tree now contains more information.&lt;/p&gt;

&lt;h3&gt;4.1. Choice for Insert&lt;/h3&gt;

&lt;p&gt;Unlike the example, it&apos;s not always clear in which node/sub-tree an object should be inserted. Here: &lt;code&gt;MBR1&lt;/code&gt;, &lt;code&gt;MBR2&lt;/code&gt;, or &lt;code&gt;MBR3&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-55&quot; src=&quot;./assets/posts/spatial-index/r-tree-insert-mbrs.svg&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 6: R-Tree Choice for Insert (1)&lt;/p&gt;

&lt;p&gt;The question is, in which MBR should we insert &lt;code&gt;R1&lt;/code&gt; into? Setting aside any rules or justification for a second, &lt;code&gt;R1&lt;/code&gt; can be inserted into any MBR.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-55&quot; src=&quot;./assets/posts/spatial-index/r-tree-insert-mbr1.svg&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 7: R-Tree Choice for Insert (2)&lt;/p&gt;

&lt;p&gt;Inserting into &lt;code&gt;MBR1&lt;/code&gt; would need to immensely grow/expand &lt;code&gt;MBR1&lt;/code&gt; to fully contain &lt;code&gt;R1&lt;/code&gt;. The implication? Say there&apos;s a query rectangle &lt;code&gt;Q1&lt;/code&gt;. After leading down the sub-tree to &lt;code&gt;MBR1&lt;/code&gt;, we find that there&apos;s nothing (no objects). This is because, to contain &lt;code&gt;R1&lt;/code&gt;, we have expanded &lt;code&gt;MBR1&lt;/code&gt; so much that there is a lot of space without any objects. So, it&apos;s fair to conclude that one criterion to add is to insert into MBRs that need to expand the least.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-55&quot; src=&quot;./assets/posts/spatial-index/r-tree-insert-mbr2.svg&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 8: R-Tree Choice for Insert (3)&lt;/p&gt;

&lt;p&gt;Going by that, inserting into &lt;code&gt;MBR2&lt;/code&gt; is a better option as opposed to &lt;code&gt;MBR1&lt;/code&gt;. Similarly, &lt;code&gt;MBR3&lt;/code&gt; may not be a bad option either, depending on the expansion factor.&lt;/p&gt;

&lt;hr class=&quot;post-hr&quot; /&gt;

&lt;p&gt;Stating the obvious (for implementation), the minimum-bounding-rectangle (MBR) is defined as the rectangle that has the maximal and minimal values of all rectangles in each dimension.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/spatial-index/r-tree-overlap-criterion.svg&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 9: R-Tree MBR Implementation&lt;/p&gt;

&lt;hr class=&quot;post-hr&quot; /&gt;

&lt;p&gt;Summarizing the insertion into R-Tree so far:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In principle, a new rectangle can be inserted into any node.&lt;/li&gt;
&lt;li&gt;If the node is full, a split needs to be performed (more on that in the next section).&lt;/li&gt;
&lt;li&gt;If not, the MBR may have to be adjusted/expanded to accommodate new objects (as seen ).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Observations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Extending bounding boxes is a critical factor for the performance of the R-Tree.&lt;/li&gt;
&lt;li&gt;Try to minimize overlap (of the MBRs).&lt;/li&gt;
&lt;li&gt;Try to minimize spread (the size of the MBR, as seen in section 4.1).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;4.2. Insert - Algorithm&lt;/h3&gt;

&lt;p&gt;Here&apos;s the algorithm proposed by the author of the R-Tree paper &quot;&lt;a href=&quot;https://www.researchgate.net/publication/221213205_R_Trees_A_Dynamic_Index_Structure_for_Spatial_Searching&quot; target=&quot;_blank&quot;&gt;A Dynamic Index Structure for Spatial Searching&lt;/a&gt;,&quot; by A. Guttman, 1984.&lt;/p&gt;

&lt;p&gt;The rest of this section is mostly going over snippets of code and explanations from this paper, but with more examples and visualization.&lt;/p&gt;

&lt;p&gt;Algorithm: Search for leaf to insert (&lt;a href=&quot;https://en.wikipedia.org/wiki/Hilbert_R-tree#Insertion&quot; target=&quot;_blank&quot;&gt;ChooseLeaf&lt;/a&gt;):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;CS1: Let &lt;code&gt;N&lt;/code&gt; be the root.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;CS2:
&lt;ul&gt;
    &lt;li&gt;If &lt;code&gt;N&lt;/code&gt; is a leaf, return &lt;code&gt;N&lt;/code&gt;.&lt;/li&gt;
    &lt;li&gt;&lt;p&gt;If &lt;code&gt;N&lt;/code&gt; is not a leaf: Search for an entry in &lt;code&gt;N&lt;/code&gt; whose rectangle (MBR) requires the least area increase in order to accommodate the new rectangle. In the case where there are multiple options, consider an entry that has the smallest (in area) MBR.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;CS3: Let &lt;code&gt;N&lt;/code&gt; be the child node, then continue to step CS2 (repeat).&lt;/li&gt;
&lt;/ul&gt;

&lt;hr class=&quot;post-hr&quot; /&gt;

&lt;p&gt;A much simpler example of 8 objects, each object with one multidimensional attribute (Range or line-segments on x-axis) and one identity (Color). To insert these objects one by one in an empty R-tree of degree &lt;code&gt;M = 3&lt;/code&gt; (maximum number of entries at each node) and &lt;code&gt;m ≥ M/2&lt;/code&gt; (minimum number of entries at each node = 2).&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/spatial-index/r-tree-insert-example.svg&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 10: R-Tree Insertion Example&lt;/p&gt;

&lt;p&gt;Observation: in the case where the selected leaf is already full, a splitting operation is performed. Let&apos;s understand the overflow problem better (the split problem):&lt;/p&gt;

&lt;h3&gt;4.3. Handling Overflow&lt;/h3&gt;

&lt;p&gt;In the case a node/leaf is full and a new entry cannot be stored anymore, a split needs to be performed, just as for a B+ Tree. The difference is that the split can be done arbitrarily and not only in the middle as for a B+ Tree.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-30&quot; src=&quot;./assets/posts/spatial-index/r-tree-split-problem.svg&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 11: R-Tree Insertion: Overflow&lt;/p&gt;

&lt;h3&gt;4.3.1. The Split Problem&lt;/h3&gt;
&lt;p&gt;Given &lt;code&gt;M + 1&lt;/code&gt; entries in a node (exceeded maximum capacity per node), which two subsets of these entries should be considered as new and old nodes?&lt;/p&gt;

&lt;p&gt;To better understand the split problem, let&apos;s take a step back and consider 4 rectangles (&lt;code&gt;R1, R2, R3, R4&lt;/code&gt;) that need to be assigned to two nodes (MBRs) in a meaningful way.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/spatial-index/r-tree-split-problem-example.svg&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 12: R-Tree Insertion: Split Problem&lt;/p&gt;

&lt;p&gt;Why is one better than the other? As mentioned before (Section 4.1), the area of expansion of the poor split is much larger compared to the good split (despite the overlap). This leads to more empty spaces in the node/MBR that do not have any objects.&lt;/p&gt;

&lt;p&gt;A realistic use case for an R-Tree is &lt;code&gt;M = 50&lt;/code&gt; and there are &lt;code&gt;2^(M-1)&lt;/code&gt; possibilities. Hence, a naive approach to look at all possible subsets and choose the best one is not practical (too expensive!).&lt;/p&gt;

&lt;h3&gt;4.3.2. The Split Problem: Quadratic Cost&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Search for split with smallest possible area&lt;/li&gt;
&lt;li&gt;Cost is Quadratic in &lt;code&gt;M&lt;/code&gt; and linear in number of dimensions &lt;code&gt;d&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Idea:&lt;/li&gt;
&lt;ul&gt;
&lt;li&gt;Search for pairs of entries that would cause the largest MBR area if placed in the same node. Then put these entries in two different nodes&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Then: Consider all remaining entries and consider the one (among the 2 nodes) for which the increase in area (of MBR) has the largest possible difference between the two nodes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;This entry is assigned to the node with the smallest increase. Repeat until all entries are assigned&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-80&quot; src=&quot;./assets/posts/spatial-index/r-tree-split-quadratic.svg&quot; /&gt;&lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 13: R-Tree Insertion: Choosing MBR&lt;/p&gt;

&lt;p&gt;In this example, two nodes, &lt;code&gt;MBR1&lt;/code&gt; and &lt;code&gt;MBR2&lt;/code&gt;, are created. &lt;code&gt;R1&lt;/code&gt; and &lt;code&gt;R2&lt;/code&gt; in the same MBR would lead to creating the largest MBR. &lt;code&gt;R3&lt;/code&gt; is then inserted into &lt;code&gt;MBR1&lt;/code&gt; and not &lt;code&gt;MBR2&lt;/code&gt;, as the area increase of &lt;code&gt;MBR1&lt;/code&gt; is smaller compared to &lt;code&gt;MBR2&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Method &quot;AdjustTree,&quot; is called whenever a new entry is inserted. It is responsible for adapting the parent&apos;s MBR and propagating the changes bottom up, handling splits as well as changes to MBRs. In the worst case, the propagation can be up to the root node.&lt;/p&gt;

&lt;h3&gt;5. R-Tree Variants&lt;/h3&gt;

&lt;p&gt;R-trees do not guarantee good worst-case performance, but generally speaking, they perform well with real-world data. Addressing this specific problem, the &lt;a href=&quot;https://en.wikipedia.org/wiki/Priority_R-tree&quot; target=&quot;_blank&quot;&gt;Priority R-tree&lt;/a&gt; is a worst-case &lt;a href=&quot;https://en.wikipedia.org/wiki/Asymptotically_optimal_algorithm&quot; target=&quot;_blank&quot;&gt;asymptotically optimal&lt;/a&gt; alternative to the spatial tree R-tree, which is essentially a hybrid between a k-dimensional tree (&lt;a href=&quot;https://en.wikipedia.org/wiki/K-d_tree&quot; target=&quot;_blank&quot;&gt;k-d tree&lt;/a&gt;) and an R-tree.&lt;/p&gt;

&lt;p&gt;Another commonly used variant is the &lt;a href=&quot;https://en.wikipedia.org/wiki/R*-tree&quot; target=&quot;_blanl&quot;&gt;R*-Tree&lt;/a&gt;, which uses the same algorithm as the regular R-tree for query and delete operations. However, while inserting, the R*-tree uses a combined strategy: for leaf nodes, overlap is minimized, and for inner nodes, enlargement and area are minimized, making the tree construction slightly more expensive.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://en.wikipedia.org/wiki/R%2B_tree&quot; target=&quot;_blanl&quot;&gt;R+-Tree&lt;/a&gt;, on the other hand, solves one main problem to ensure nodes do not overlap with each other, leading to better point query performance. However, it does so by inserting an object into multiple leaves if necessary, which is a disadvantage due to duplicate entries and larger tree size.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://en.wikipedia.org/wiki/Hilbert_R-tree&quot; target=&quot;_blanl&quot;&gt;Hilbert R-Tree&lt;/a&gt; uses &lt;a href=&quot;/spatial-index-space-filling-curve&quot;&gt;space-filling curves&lt;/a&gt;, specifically the Hilbert curve, to impose a linear ordering on the data rectangles. It has two variants: Packed Hilbert R-trees, suitable for static databases in which updates are very rare, and dynamic Hilbert R-trees, suitable for dynamic databases where insertions, deletions, or updates may occur in real time.&lt;/p&gt;

&lt;h3&gt;6. Conclusion&lt;/h3&gt;

&lt;p&gt;R-trees have come a long way since the first paper was published in 1984. Today, their applications span over multi-dimensional indexes, computer graphics, video games, spatial data management systems, and many more.&lt;/p&gt;

&lt;p&gt;On the flip side, R-trees can degrade badly with discrete data. Hence, it&apos;s highly recommended to understand the data representation before using R-trees. R-trees are also relatively slow when there&apos;s a very high mutation rate, i.e., where the index changes often; this is because of the higher cost for constructing and updating the index (due to tree rebalancing) and they are more optimized for various search operations. Lastly, R-trees can be a poor algorithm choice when primarily dealing with points as opposed to polygons/regions.&lt;/p&gt;

&lt;h3&gt;7. References&lt;/h3&gt;
&lt;pre style=&quot;max-height: 300px&quot;&gt;&lt;code&gt;[1] A. Guttman, &quot;A Dynamic Index Structure for Spatial Searching,&quot; presented at the ACM SIGMOD International Conference on Management of Data, 1984. [Online]. Available: https://www.researchgate.net/publication/220805321_A_Dynamic_Index_Structure_for_Spatial_Searching.
[2] &quot;R-Tree,&quot; Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/R-tree.
[3] &quot;B-Trees and B+ Trees,&quot; PyBlog. [Online]. Available: https://www.pyblog.xyz/b-trees-b-plus-trees.
[4] &quot;Spatial Index R-Tree,&quot; YouTube, https://www.youtube.com/watch?v=U0jUvvQkaFw.
&lt;/code&gt;&lt;/pre&gt;</content><author><name>Adesh Nalpet Adimurthy</name></author><category term="System Wisdom" /><category term="Database" /><category term="Spatial Index" /><summary type="html">If you have been following the Spatial Index Series, it started with the need for multi-dimensional indexes and an introduction to space-filling curves, followed by a deep dive into grid systems (GeoHash and Google S2) and tessellation (Uber H3).</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://pyblog.xyz/assets/featured/webp/rtree-spatial-index.webp" /><media:content medium="image" url="https://pyblog.xyz/assets/featured/webp/rtree-spatial-index.webp" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Spatial Index: Tessellation</title><link href="https://pyblog.xyz/spatial-index-tessellation" rel="alternate" type="text/html" title="Spatial Index: Tessellation" /><published>2024-06-17T00:00:00+00:00</published><updated>2024-06-17T00:00:00+00:00</updated><id>https://pyblog.xyz/spatial-index-tessellation</id><content type="html" xml:base="https://pyblog.xyz/spatial-index-tessellation">&lt;p&gt;Brewing! this post a continuation of &lt;a href=&quot;/spatial-index-grid-system&quot;&gt;Spatial Index: Grid Systems&lt;/a&gt; where we will set the foundation for tessellation and delve into the details of &lt;a href=&quot;https://github.com/uber/h3&quot; target=&quot;_blank&quot;&gt;Uber H3&lt;/a&gt;&lt;/p&gt;

&lt;details open=&quot;&quot;&gt;&lt;summary class=&quot;h3&quot;&gt;0. Foundation&lt;/summary&gt;
&lt;p&gt;Tessellation or tiling is the process of covering/dividing a space into smaller, non-overlapping shapes that fit together perfectly without gaps or overlaps. In spatial indexing, tessellation is used to break down the Earth&apos;s surface into manageable units for efficient data storage, querying, and analysis.&lt;/p&gt;

&lt;p&gt;The rationale behind why a geographical grid system (&lt;a href=&quot;cartograms-documentation#tessellation&quot; target=&quot;_blank&quot;&gt;Tessellation system&lt;/a&gt;) is necessary: The real world is cluttered with various geographical elements, both natural and man-made, none of which follow any consistent structure. To perform geographic algorithms or analyses on it, we need a more abstract form.&lt;/p&gt;

&lt;p&gt;Maps are a good start and are the most common abstraction, with which most people are familiar. However, maps still contain all sorts of inconsistencies. This calls for a grid system, which takes the cluttered geographic space and provides a more clean and structured mathematical space, making it much easier to perform computations and queries.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/spatial-index/h3-why-grids.png&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 0: Tessellated View of Halifax&lt;/p&gt;

&lt;p&gt;The primary principle of the grid is to break the space into uniform cells. These cells are the units of analysis used in geographic systems. Think of it as pixels in an image.&lt;/p&gt;

&lt;p&gt;A grid system adds a couple more layers on top of this, consisting of a series of nested grids, usually at increasingly fine resolutions. They include a way to uniquely identify any cell in the system. Other common grid systems include &lt;a href=&quot;https://en.wikipedia.org/wiki/Graticule_(cartography)&quot; target=&quot;_blank&quot;&gt;Graticule&lt;/a&gt; (latitude and longitude), &lt;a href=&quot;https://learn.microsoft.com/en-us/bingmaps/articles/bing-maps-tile-system#tile-coordinates-and-quadkeys&quot; target=&quot;_blank&quot;&gt;Quad Key&lt;/a&gt;  (Mercator projection), &lt;a href=&quot;/spatial-index-grid-system#3-geohash&quot; target=&quot;_blank&quot;&gt;Geohash&lt;/a&gt; (Equirectangular projection) and &lt;a href=&quot;/spatial-index-grid-system#4-google-s2&quot; target=&quot;_blank&quot;&gt;Google S2&lt;/a&gt; (Spherical projection).&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;clear-hr&quot; /&gt;

&lt;details open=&quot;&quot;&gt;&lt;summary class=&quot;h3&quot;&gt;1. Uber H3 - Intuition&lt;/summary&gt;
&lt;p&gt;Most systems use four-sided polygons (Square, Rectangle and Quadrilateral). H3 is the grid system developed by Uber, which uses hexagon cells as its base. It covers the space/world with hexagons and has different levels of resolution, with the smallest cells representing about &lt;code&gt;1 cm²&lt;/code&gt; of space.&lt;/p&gt;

&lt;h3&gt;1.1. Why Hexagons?&lt;/h3&gt;

&lt;p&gt;Starting off by adding rules or needs for choosing a tile, such as:&lt;/p&gt;
&lt;ul style=&quot;list-style-type:none;&quot;&gt;
&lt;li&gt;(a) Uniform shape&lt;/li&gt;
&lt;li&gt;(b) Uniform edge length&lt;/li&gt;
&lt;li&gt;(c) Uniform angles&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Brings down the number of options, with the most commonly used shapes being squares, equilateral triangles, and hexagons.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/spatial-index/h3-tile-options-2.svg&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 1: Triangle vs Square vs Hexagon (neighbors)&lt;/p&gt;

&lt;p&gt;Another important property of tiles is uniform adjacency, i.e., how unambiguous the neighbors are. For example, squares have 4 unambiguous neighbors but also have 4 ambiguous neighbors at the corners, which may not provide the best perception of neighbors if you consider a circular radius.&lt;/p&gt; 

&lt;p&gt;Equilateral triangles are much worse, with 3 unambiguous neighbors and 9 ambiguous neighbors, which is one of the reasons why triangles are not commonly used, along with the rotation of cells necessary for tessellation. Lastly, hexagons are the best, with 6 unambiguous neighbors and a structure very close to finding neighbors by radius.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image&quot; src=&quot;./assets/posts/spatial-index/hex-square-tessellation.png&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 2: Square vs Hexagon (Optimal Space-Filling)&lt;/p&gt;

&lt;p&gt;Hexagons are more space-efficient and have optimal space-filling properties. This means that when filling a polygon with uniform cells, hexagons generally result in less over/under filling compared to squares.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-50&quot; src=&quot;./assets/posts/spatial-index/h3-tile-options-3.svg&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 3: Square vs Hexagon (Child Containment)&lt;/p&gt;

&lt;p&gt;Hierarchical relationships between resolutions are another important property. Evidently, squares have hierarchical relationships with perfect child containment and can use algorithms such as quad trees to navigate up and down the hierarchy and space-filling curves to traverse the grid. Hexagons, while not having perfect child containment, can still function effectively with a tolerable margin of error.&lt;/p&gt;

&lt;p&gt;Without taking triangles into account, the summary of the comparison between squares and hexagons:&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-50&quot; src=&quot;./assets/posts/spatial-index/h3-tile-options.svg&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 4: Squares vs Hexagons (Full Comparison)&lt;/p&gt;

&lt;p&gt;More on Hexagons vs Squares at &lt;a href=&quot;/cartograms-documentation#hexagonsvssquares&quot;&gt;Conceptualization of a Cartogram&lt;/a&gt;&lt;/p&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;h3&gt;1.2. Why Icosahedron?&lt;/h3&gt;

&lt;p&gt;Lastly, low shape and area distortion is more related to the projection than the shape of the tile. There are many types of projections, but the most commonly used are polyhedra. One such projection is the &lt;a href=&quot;/spatial-index-grid-system#3-1-geohash-intuition&quot;&gt;cylindrical projection&lt;/a&gt;, used in &lt;a href=&quot;/spatial-index-grid-system#3-geohash&quot;&gt;Geohash&lt;/a&gt;, which works well for squares but has the problem of distortion near the poles, making it hard to get equal surface area cells across the projection.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-80&quot; src=&quot;./assets/posts/spatial-index/uniform-shape-polyhedrons.png&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 5: Uniform Shape Polyhedrons&lt;/p&gt;

&lt;p&gt;The smaller the face, the lesser the distortion. An icosahedron, with 20 faces, is the better option among the uniform-face polyhedrons for fitting hexagons and triangles on them. Fitting squares on an icosahedron or even a tetrahedron is not ideal. Squares are mostly suitable for cubes (as seen in &lt;a href=&quot;/spatial-index-grid-system#4-google-s2&quot;&gt;S2&lt;/a&gt;). Taking the best of both worlds, an icosahedron with hexagons is the way to go.&lt;/p&gt;

&lt;h3&gt;1.3. H3 Grid System&lt;/h3&gt;

&lt;p&gt;Putting it all together, we take the polyhedron, the &lt;a href=&quot;https://en.wikipedia.org/wiki/Icosahedron&quot; target=&quot;_blank&quot;&gt;icosahedron&lt;/a&gt;, project it on the surface of the Earth, then each face on the icosahedron is split into hexagon cells. More specifically, 4 full hexagon cells are completely contained by the face, 3 cells are half contained, and 3 corners form the pentagon.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/spatial-index/h3-tessellation.svg&quot; /&gt;
&lt;p&gt;Each hexagonal cell can be further subdivided into 7 hexagon cells with marginal error for containment. The number of levels decides the resolution.&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image&quot; src=&quot;./assets/posts/spatial-index/h3-tessellation-2.svg&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 6: H3 Projection and Tessellation&lt;/p&gt;

&lt;p&gt;The H3 grid system divides the surface of the Earth into &lt;code&gt;122&lt;/code&gt; (110 hexagons and 12 icosahedron vertex-centered pentagons) base cells (resolution 0), which are used as the foundation for higher resolution cells. Each base cell has a specific orientation relative to the face of the icosahedron it is on. This orientation determines how cells at higher resolutions are positioned and indexed.&lt;/p&gt;

&lt;h3&gt;1.4. Why Pentagons?&lt;/h3&gt;

&lt;p&gt;Looking at the icosahedron, the 5 faces come together at every vertex, and truncating that creates the base cell. Pentagons are unavoidable at the vertices. However, there are only 12 of them at every resolution. But again, for most cases dealing with spaces within a city where the resolution is higher than 9, the pentagons, if far off in the water, they are safe to ignore.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-80&quot; src=&quot;./assets/posts/spatial-index/dymaxion-layout.svg&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 7: Dymaxion layout (12 Vertices in Water)&lt;/p&gt;

&lt;p&gt;While the layout of the faces on the icosahedron can be done in any fashion, H3 uses the layout developed by Buckminster Fuller called the &lt;a href=&quot;https://en.wikipedia.org/wiki/Dymaxion_map&quot; target=&quot;_blank&quot;&gt;Dymaxion layout&lt;/a&gt;.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-30&quot; src=&quot;./assets/posts/spatial-index/h3-tessellation.gif&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 8: H3 Projection and Tessellation (Animated)&lt;/p&gt;

&lt;p&gt;The benefit is that all the vertices end up in the water. For most applications, land is more important than water, and since the vertices are in the water, it reduces the need to deal with pentagons.&lt;/p&gt;

&lt;h3&gt;1.5. Cell ID&lt;/h3&gt;
&lt;p&gt;A cell ID is a 64-bit integer that uniquely identifies a hexagonal cell at a particular resolution. The composition of an H3 cell ID is as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mode (4 bits): Identifies the H3 mode, which indicates the type of the identifier. For cell IDs, this value defaults set to 1.&lt;/li&gt;
&lt;li&gt;Edge Mode (Reserved, 3 bits): Indicates the edge mode, which is 0 for cell IDs.&lt;/li&gt;
&lt;li&gt;Resolution (4 bits): Specifies the resolution of the cell. H3 supports resolutions from 0 (coarsest) to 15 (finest).&lt;/li&gt;
&lt;li&gt;Base Cell (7 bits): Identifies the base cell, which is one of the 122 base cells that form the foundation of the H3 grid.&lt;/li&gt;
&lt;li&gt;Cell Index (45 bits): Contains the specific index of the cell within the base cell and resolution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This structure (&lt;a href=&quot;#2-5-faceijk-to-h3-index&quot;&gt;Figure 14&lt;/a&gt;) allows H3 to efficiently encode the hierarchical location and resolution of each hexagonal cell in a compact 64-bit integer.&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;clear-hr&quot; /&gt;

&lt;details open=&quot;&quot;&gt;&lt;summary class=&quot;h3&quot;&gt;2. H3 - Implementation&lt;/summary&gt;

&lt;p&gt;The implementation below, loosely follows the steps of the actual H3 index calculation for demonstration purposes (to better understand the H3 Index). Here&apos;s a step-by-step process with reasonable simplifications:&lt;/p&gt;

&lt;h3&gt;2.1. LatLong to Vec3D&lt;/h3&gt;
&lt;p&gt;Convert latitude and longitude to &lt;a href=&quot;https://en.wikipedia.org/wiki/Cartesian_coordinate_system&quot; target=&quot;_blank&quot;&gt;3D Cartesian coordinates&lt;/a&gt; using the formulas (similar to Section &lt;a href=&quot;/spatial-index-grid-system#4-2-1-lat-long-to-x-y-z-&quot;&gt;4.2.1 in S2&lt;/a&gt;):.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-80&quot; src=&quot;./assets/posts/spatial-index/ecef.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 9: (lat, long) to (x, y, z) Transformation&lt;/p&gt;

&lt;details class=&quot;code-container&quot; open=&quot;&quot;&gt;&lt;summary class=&quot;p&quot;&gt;2.1a. LatLong to Vec3D - Snippet&lt;/summary&gt;
&lt;pre&gt;&lt;code&gt;private static double[] latLonToVec3D(double lat, double lon) {
    double r = Math.cos(Math.toRadians(lat));
    double x = r * Math.cos(Math.toRadians(lon));
    double y = r * Math.sin(Math.toRadians(lon));
    double z = Math.sin(Math.toRadians(lat));
    return new double[]{x, y, z};
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;h3&gt;2.2. Icosahedron Properties&lt;/h3&gt;
&lt;p&gt;We can identify the &lt;code&gt;12&lt;/code&gt; vertices of the icosahedron using the &lt;a href=&quot;https://en.wikipedia.org/wiki/Golden_ratio&quot; target=&quot;_blank&quot;&gt;golden ratio&lt;/a&gt; &lt;code&gt;(ϕ)&lt;/code&gt;. It a well known property of a regular icosahedron, where three mutually perpendicular rectangles of aspect ratio &lt;code&gt;(ϕ)&lt;/code&gt; are arranged such that they share a common center.&lt;/p&gt;

&lt;p&gt;The icosahedron has 12 vertices, 20 faces, and 30 edges. The 12 vertices are given by: &lt;code&gt;(±1, ±ϕ, 0)&lt;/code&gt;, &lt;code&gt;(±ϕ, 0, ±1)&lt;/code&gt;, &lt;code&gt;(0, ±1, ±ϕ)&lt;/code&gt;. Lastly, the vertices need to be normalized to lie on the surface of a unit sphere.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image&quot; src=&quot;./assets/posts/spatial-index/golden-ratio.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 10: Golden Ratio Rectangles&lt;/p&gt;

&lt;p&gt;To calculate the &lt;code&gt;20&lt;/code&gt; face centers of the icosahedron:&lt;/p&gt;
&lt;p&gt;For each face, average the coordinates of its three vertices and normalize the resulting vector to lie on the unit sphere. Use the formula:&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-65&quot; src=&quot;./assets/posts/spatial-index/face-centers.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 11: Icosahedron Face Center&lt;/p&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt;2.2a. Icosahedron Vertices - Snippet&lt;/summary&gt;
&lt;pre&gt;&lt;code&gt;double PHI = (1.0 + Math.sqrt(5.0)) / 2.0;
double[][] vertices = {
        {-1, PHI, 0}, {1, PHI, 0}, {-1, -PHI, 0}, {1, -PHI, 0},
        {0, -1, PHI}, {0, 1, PHI}, {0, -1, -PHI}, {0, 1, -PHI},
        {PHI, 0, -1}, {PHI, 0, 1}, {-PHI, 0, -1}, {-PHI, 0, 1}
};

// Normalize the vertices to lie on the unit sphere
for (int i = 0; i &amp;lt; vertices.length; i++) {
    vertices[i] = normalize(vertices[i]);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;// Computes the center of a face defined by three vertices.
private static double[] computeFaceCenter(double[] a, double[] b, double[] c) {
    double[] center = new double[3];
    center[0] = (a[0] + b[0] + c[0]) / 3.0;
    center[1] = (a[1] + b[1] + c[1]) / 3.0;
    center[2] = (a[2] + b[2] + c[2]) / 3.0;
    return normalize(center);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;pre&gt;&lt;code&gt;// Normalizes a vector to lie on the unit sphere.
private static double[] normalize(double[] v) {
    double length = Math.sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
    return new double[]{v[0] / length, v[1] / length, v[2] / length};
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;h3&gt;2.3. Vec3D to Vec2D&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;Vec2D&lt;/code&gt; represents the cartesian coordinates on the face of the icosahedron. It provides a 2D projection (&lt;a href=&quot;#1-4-why-pentagons-&quot;&gt;Figure 7&lt;/a&gt;) of a point on the spherical surface of the Earth onto one of the icosahedron&apos;s faces, used to map geographic coordinates (latitude and longitude) onto a planar hexagonal grid. The conversion involves &lt;a href=&quot;https://en.wikipedia.org/wiki/Gnomonic_projection&quot; target=&quot;_blank&quot;&gt;gnomonic projection&lt;/a&gt;, which translates 3D coordinates to a 2D plane by projecting from the center of the sphere to the plane tangent to the face of the icosahedron.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Calculate &lt;code&gt;r&lt;/code&gt; (Radial Distance): Convert the distance from the face center to an angle using the inverse cosine function.&lt;/li&gt;
&lt;li&gt;Gnomonic Scaling: Scale the angle &lt;code&gt;r&lt;/code&gt; for the hexagonal grid at the given resolution.&lt;/li&gt;
&lt;li&gt;Calculate θ (Azimuthal Angle): Determine the angle from the face center, adjusting for face orientation and resolution.&lt;/li&gt;
&lt;li&gt;Convert to local 2D Coordinates: Transform polar coordinates &lt;code&gt;(r, θ)&lt;/code&gt; into Cartesian coordinates &lt;code&gt;(x, y)&lt;/code&gt;.&lt;/li&gt;   
&lt;/ul&gt;

&lt;img class=&quot;center-image-0 center-image&quot; src=&quot;./assets/posts/spatial-index/h3-to-vec2d.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 12: Gnomonic Projection (XYZ to rθ)&lt;/p&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt;2.3a. Vec3D to Vec2D - Snippet&lt;/summary&gt;
&lt;pre&gt;&lt;code&gt;// faceAxesAzRadsCII: Icosahedron face `ijk` axes as azimuth in radians from face center to vertex
// faceCenterGeo: Icosahedron face centers in lat/lng radians.
// RES0_U_GNOMONIC: Scaling factor from `Vec2d` resolution 0 unit length (or distance between adjacent cell center points on the plane) to gnomonic unit length.
// SQRT7_POWERS: Power of √7 for each resolution.
// AP7_ROT_RADS: Rotation angle between Class II and Class III resolution axes: asin(sqrt(3/28))

public Vec2d toVec2d(int resolution, int face, double distance) {
    // cos(r) = 1 - 2 * sin^2(r/2) = 1 - 2 * (sqd / 4) = 1 - sqd/2
    double r = acos(1.0 - distance / 2.0);
    if (r &amp;lt; EPSILON) {
        return new Vec2d(0.0, 0.0);
    }
    
    // Perform gnomonic scaling of `r` (`tan(r)`) and scale for current
    r = (tan(r) / RES0_U_GNOMONIC) * SQRT7_POWERS[resolution];
    
    // Compute counter-clockwise `theta` from Class II i-axis.
    double theta = faceAxesAzRadsCII[face][0] - this.azimuth(faceCenterGeo[face]);
    
    // Adjust `theta` for Class III.
    if ((resolution % 2) != 0) {
        theta -= AP7_ROT_RADS;
    }
    
    // Convert to local x, y.
    return new Vec2d(r * cos(theta), r * sin(theta));
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;p&gt;About &lt;code&gt;SQRT7_POWERS&lt;/code&gt;. Each resolution beyond 0 is created using an aperture 7 resolution spacing, i.e. number of cells in the next finer resolution (&lt;a href=&quot;#1-uber-h3-intuition&quot;&gt;Figure 1 and 3&lt;/a&gt;). So, as resolution increases, unit length is scaled by &lt;code&gt;sqrt(7)&lt;/code&gt;. H3 has 15 resolutions/levels (+resolution 0).&lt;/p&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;h3&gt;2.4. Vec2D to FaceIJK&lt;/h3&gt;
&lt;p&gt;Hexagonal grids have three primary axes, unlike the two we have for square grids. In &lt;a href=&quot;https://www.redblobgames.com/grids/hexagons/#coordinates&quot; target=&quot;_blank&quot;&gt;Axial coordinates&lt;/a&gt; or the Cube coordinates, the three coordinates (i, j, k) ensure that any point in the hexagonal grid can be described without ambiguity.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/spatial-index/h3-axial.png&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 13: Axial Coordinates (Class II and Class III)&lt;/p&gt;

&lt;p&gt;There are several other hex coordinate systems based, in this case, the constraints are &lt;code&gt;i + j + k = 0&lt;/code&gt;, with &lt;code&gt;120°&lt;/code&gt; axis separation.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;faceIJK&lt;/code&gt; represents the position/location of a hexagon within a face of the icosahedron using three coordinates &lt;code&gt;(i, j, k)&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reverse Conversion: Translate Cartesian coordinates into the hexagonal coordinate system by aligning them with the hex grid&apos;s axes.&lt;/li&gt;
&lt;li&gt;Quantize and Round: Convert floating-point coordinates to integer grid positions, determining the closest hexagon center.&lt;/li&gt;
&lt;/ul&gt;
&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/spatial-index/h3-vec2d-facexyz.svg&quot; /&gt; 
&lt;ul&gt;
&lt;li&gt;Check Hex Center and Round: Use remainders to accurately determine which hexagon the point falls into by rounding to the nearest hex center.&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;// Determine i and j based on r1 and r2
IF r1 &amp;lt; 0.5 THEN
    IF r1 &amp;lt; 1 / 3 THEN
        i = m1
        j = m2 + (r2 &amp;gt;= (1 + r1) / 2)
    ELSE
        i = m1 + ((1 - r1) &amp;lt;= r2 &amp;amp;&amp;amp; r2 &amp;lt; (2 * r1))
        j = m2 + (r2 &amp;gt;= (1 - r1))
ELSE IF r1 &amp;lt; 2 / 3 THEN
    j = m2 + (r2 &amp;gt;= (1 - r1))
    i = m1 + ((2 * r1 - 1) &amp;gt;= r2 || r2 &amp;gt;= (1 - r1))
ELSE
    i = m1 + 1
    j = m2 + (r2 &amp;gt;= (r1 / 2))
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;/p&gt;
&lt;li&gt;Fold Across Axes if Necessary: Correct the coordinates if they fall into negative regions, ensuring the coordinates remain within the valid grid.&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;IF value.x &amp;lt; 0 THEN
    offset = j % 2
    axis_i = (j + offset) / 2
    diff = i - axis_i
    i = i - 2 * diff - offset

IF value.y &amp;lt; 0 THEN
    i = i - (2 * j + 1) / 2
    j = -j
&lt;/code&gt;&lt;/pre&gt;
&lt;li&gt;Normalize: Purpose: Adjust the coordinates to maintain the properties of the hexagonal grid, ensuring &lt;code&gt;i + j + k = 0&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt;2.4a. Vec2D to FaceIJK - Snippet&lt;/summary&gt;
&lt;pre&gt;&lt;code&gt;public static CoordIJK fromVec2d(Vec2d value) {
    int k = 0;

    double a1 = Math.abs(value.x);
    double a2 = Math.abs(value.y);

    // Reverse conversion
    double x2 = a2 / SIN60;
    double x1 = a1 + x2 / 2.0;

    // Quantize and round
    int m1 = (int) x1;
    int m2 = (int) x2;

    double r1 = x1 - m1;
    double r2 = x2 - m2;

    int i, j;
    if (r1 &amp;lt; 0.5) {
        if (r1 &amp;lt; 1.0 / 3.0) {
            i = m1;
            j = m2 + (r2 &amp;gt;= (1.0 + r1) / 2.0 ? 1 : 0);
        } else {
            i = m1 + ((1.0 - r1) &amp;lt;= r2 &amp;amp;&amp;amp; r2 &amp;lt; (2.0 * r1) ? 1 : 0);
            j = m2 + (r2 &amp;gt;= (1.0 - r1) ? 1 : 0);
        }
    } else if (r1 &amp;lt; 2.0 / 3.0) {
        j = m2 + (r2 &amp;gt;= (1.0 - r1) ? 1 : 0);
        i = m1 + ((2.0 * r1 - 1.0) &amp;gt;= r2 || r2 &amp;gt;= (1.0 - r1) ? 1 : 0);
    } else {
        i = m1 + 1;
        j = m2 + (r2 &amp;gt;= (r1 / 2.0) ? 1 : 0);
    }

    // Fold Across Axes if Necessary
    if (value.x &amp;lt; 0) {
        int offset = j % 2;
        int axis_i = (j + offset) / 2;
        int diff = i - axis_i;
        i = i - 2 * diff - offset;
    }

    if (value.y &amp;lt; 0) {
        i = i - (2 * j + 1) / 2;
        j = -j;
    }

    return new CoordIJK(i, j, k).normalize();
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;p&gt;Each grid resolution is rotated &lt;code&gt;~19.1°&lt;/code&gt; relative to the next coarser resolution. The rotation alternates between counterclockwise (CCW) and clockwise (CW) at each successive resolution, so that each resolution will have one of two possible orientations as shown in Figure 13: &lt;code&gt;Class II&lt;/code&gt; or &lt;code&gt;Class III&lt;/code&gt;. The base cells, which make up resolution 0, are &lt;code&gt;Class II&lt;/code&gt;.&lt;/p&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;h3&gt;2.5. FaceIJK to H3 Index&lt;/h3&gt;
&lt;p&gt;Lastly, the &lt;a href=&quot;https://h3geo.org/docs/core-library/latLngToCellDesc&quot; target=&quot;_blank&quot;&gt;face and face-centered ijk coordinates are converted to H3 Index&lt;/a&gt;.&lt;/p&gt; 

&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/spatial-index/h3-index-structure.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 14: H3 Index Structure&lt;/p&gt;

&lt;p&gt;If the resolution is not uptill level 15, rest of the vits are set to 1s, for example: &lt;code&gt;83001dfffffffff&lt;/code&gt;. The binary representation is as below (Figure 15); &lt;code&gt;Index mode = 1&lt;/code&gt; i.e. indexes the regular hexagon type. Resolution = 3; Base Cell = 0; Resolution 1, 2 and 3 are 0, 3 and 5, rest are 1s.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/spatial-index/h3-index-structure-example.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 15: H3 Index Structure (Example: 83001dfffffffff)&lt;/p&gt;

&lt;p&gt;This primarily involves coverting to Direction bits, representing the hierarchical path from a base cell to a specific cell at a given resolution. These bits encode the sequence of directional steps taken within the hexagonal grid to reach the target cell from the base cell.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handle Base Cell: If the resolution is 0 (base cell), directly set the base cell in the index.&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;// Convert IJK to Direction Bits
faceIJK.coord = directions_bits_from_ijk(faceIJK.coord, resolution)

// Set the Base Cell
base_cell = get_base_cell(faceIJK)
bits = set_base_cell(bits, base_cell)
&lt;/code&gt;&lt;/pre&gt;
&lt;li&gt;Build from Finest Resolution Up and Set Base Cell: Convert IJK coordinates to direction bits starting from the finest resolution (r), updating the index progressively. Identify and set the correct base cell for the given IJK coordinates.&lt;/li&gt;
&lt;pre&gt;&lt;code&gt;// Handle Pentagon Cells
IF base_cell.is_pentagon() THEN
    IF first_axe(bits) == Direction.K THEN
        // Check for a CW/CCW offset face (default is CCW).
        IF base_cell.is_cw_offset(faceIJK.face) THEN
            bits = rotate60(bits, 1, CW)
        ELSE
            bits = rotate60(bits, 1, CCW)
        END IF
    END IF
    FOR i = 0 TO rotation_count DO
        bits = pentagon_rotate60(bits, CCW)
    END FOR
ELSE
    bits = rotate60(bits, rotation_count, CCW)
END IF
&lt;/code&gt;&lt;/pre&gt;
&lt;li&gt;Handle Pentagon Cells: Apply necessary rotations if the base cell is a pentagon to ensure the correct orientation and avoid the missing k-axes subsequence (if the direction bits indicate a move along the &lt;code&gt;k-axis&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since each base cell can be oriented differently (&lt;a href=&quot;#1-3-h3-grid-system&quot;&gt;Section 1.3&lt;/a&gt;) on the icosahedron&apos;s faces, rotations are needed to standardize these orientations. &lt;code&gt;rotation_count&lt;/code&gt; refers to the number of 60-degree rotations that need to be applied to the H3 cell index to align it with the canonical orientation of the base cell (also &lt;a href=&quot;https://h3geo.org/docs/core-library/latLngToCellDesc&quot; target=&quot;_blank&quot;&gt;refer&lt;/a&gt;).&lt;/p&gt;


&lt;hr class=&quot;hr&quot; /&gt;

&lt;h3&gt;2.6. Official H3 library&lt;/h3&gt;
&lt;p&gt;Here&apos;s a Java snippet using the official H3 library provided by Uber:&lt;/p&gt;
&lt;details open=&quot;&quot; class=&quot;code-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt;2.7a. Official H3 - Snippet&lt;/summary&gt;
&lt;pre&gt;&lt;code&gt;import com.uber.h3core.H3Core;

public class H3Index {
    public static void main(String[] args) throws Exception {
        H3Core h3 = H3Core.newInstance();
        double lat = 37.7749;
        double lon = -122.4194;
        int resolution = 9;

        long h3Index = h3.geoToH3(lat, lon, resolution);
        System.out.println(Long.toHexString(h3Index));
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;


&lt;/details&gt;

&lt;hr class=&quot;clear-hr&quot; /&gt;

&lt;details open=&quot;&quot;&gt;&lt;summary class=&quot;h3&quot;&gt;3. H3 - Conclusion&lt;/summary&gt;
&lt;p&gt;So far, in the Spatial Index Series, we have seen the use of space-filling curves and their application in grid systems like Geohash and S2. Finally, we explored Uber&apos;s H3, which falls under grid systems and more specifically relies on tessellation. By now, it&apos;s likely clear that H3 indexes are not directly queryable on the database by ranges or prefixes, but they have more importance towards the accuracy of filling a polygon, nearby search by radius, high resolution, and many more.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/spatial-index/h3_level_0_1.png&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 16: H3 grid segmentation (Level 0 and Level 1)&lt;/p&gt;

&lt;p&gt;If you missed the series, it starts with &lt;a href=&quot;/spatial-index-space-filling-curve&quot;&gt;Spatial Index: Space-Filling Curves&lt;/a&gt;, followed by &lt;a href=&quot;/spatial-index-grid-system&quot;&gt;Spatial Index: Grid Systems&lt;/a&gt;, and finally, the current post, &lt;a href=&quot;#spatial-index-tessellation&quot;&gt;Spatial Index: Tessellation&lt;/a&gt;.&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;clear-hr&quot; /&gt;

&lt;details&gt;&lt;summary class=&quot;h3&quot;&gt;4. References&lt;/summary&gt;
&lt;pre style=&quot;max-height: 300px&quot;&gt;&lt;code&gt;1. Uber Technologies, Inc., &quot;H3: A Hexagonal Hierarchical Spatial Index,&quot; GitHub. [Online]. Available: https://github.com/uber/h3.
2. Wikipedia, &quot;Graticule,&quot; [Online]. Available: https://en.wikipedia.org/wiki/Graticule.
3. Microsoft, &quot;QuadKey,&quot; Microsoft Docs. [Online]. Available: https://learn.microsoft.com/en-us/bingmaps/articles/bing-maps-tile-system.
4. Wikipedia, &quot;Geohash,&quot; [Online]. Available: https://en.wikipedia.org/wiki/Geohash.
5. Google, &quot;Google S2 Geometry Library,&quot; [Online]. Available: https://s2geometry.io/.
6. Wikipedia, &quot;Icosahedron,&quot; [Online]. Available: https://en.wikipedia.org/wiki/Icosahedron.
7. Wikipedia, &quot;Dot product,&quot; [Online]. Available: https://en.wikipedia.org/wiki/Dot_product.
8. Wikipedia, &quot;Basis vectors,&quot; [Online]. Available: https://en.wikipedia.org/wiki/Basis_(linear_algebra).
9. Wikipedia, &quot;3D Cartesian coordinates,&quot; [Online]. Available: https://en.wikipedia.org/wiki/Cartesian_coordinate_system.
10. A. N. Adimurthy, &quot;Spatial Index: Tessellation,&quot; PyBlog. [Online]. Available: https://www.pyblog.xyz/spatial-index-tessellation.
11. Wikipedia, &quot;Conceptualization of a Cartogram,&quot; [Online]. Available: https://en.wikipedia.org/wiki/Cartogram.
12. Wikipedia, &quot;Golden ratio,&quot; [Online]. Available: https://en.wikipedia.org/wiki/Golden_ratio.
13. Wikipedia, &quot;Icosahedron vertices,&quot; [Online]. Available: https://en.wikipedia.org/wiki/Icosahedron#Vertices.
14. Wikipedia, &quot;H3: A Hexagonal Hierarchical Spatial Index,&quot; [Online]. Available: https://en.wikipedia.org/wiki/H3_(spatial_index).
15. Wikipedia, &quot;Dymaxion map,&quot; [Online]. Available: https://en.wikipedia.org/wiki/Dymaxion_map.
16. K. Sahr, &quot;Geodesic Discrete Global Grid Systems,&quot; Southern Oregon University. [Online]. Available: https://webpages.sou.edu/~sahrk/sqspc/pubs/gdggs03.pdf.
17. D. F. Marble, &quot;The Fundamental Data Structures for Implementing Digital Tessellation,&quot; University of Edinburgh. [Online]. Available: https://www.geos.ed.ac.uk/~gisteac/gis_book_abridged/files/ch36.pdf.
18. J. Castner, &quot;The Application of Tessellation in Geographic Data Handling,&quot; Semantic Scholar. [Online]. Available: https://pdfs.semanticscholar.org/feb2/3e69e19875817848ac8694b15f58d2ef52b0.pdf.
19. &quot;Hexagonal Tessellation and Its Application in Geographic Information Systems,&quot; YouTube. [Online]. Available: https://www.youtube.com/watch?v=wDuKeUkNLkQ&amp;amp;list=PL0HGds8aHQsAYm86RzQdZtFFeLpIOjk00.
20. Hydronium Labs. &quot;h3o: A safer, faster, and more flexible H3 library written in Rust.&quot; GitHub Repository. Available: https://github.com/HydroniumLabs/h3o/tree/master.
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;</content><author><name>Adesh Nalpet Adimurthy</name></author><category term="System Wisdom" /><category term="Database" /><category term="Spatial Index" /><summary type="html">Brewing! this post a continuation of Spatial Index: Grid Systems where we will set the foundation for tessellation and delve into the details of Uber H3</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://pyblog.xyz/assets/featured/webp/space-tessellation.webp" /><media:content medium="image" url="https://pyblog.xyz/assets/featured/webp/space-tessellation.webp" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Spatial Index: Grid Systems</title><link href="https://pyblog.xyz/spatial-index-grid-system" rel="alternate" type="text/html" title="Spatial Index: Grid Systems" /><published>2024-06-12T00:00:00+00:00</published><updated>2024-06-12T00:00:00+00:00</updated><id>https://pyblog.xyz/spatial-index-grid-system</id><content type="html" xml:base="https://pyblog.xyz/spatial-index-grid-system">&lt;p&gt;This post is a continuation of &lt;a href=&quot;/spatial-index-space-filling-curve&quot;&gt;Stomping Grounds: Spatial Indexes&lt;/a&gt;, but don’t worry if you missed the first part—you’ll still find plenty of new insights right here.&lt;/p&gt;

&lt;h3&gt;3. Geohash&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Geohash&quot; target=&quot;_blank&quot;&gt;Geohash&lt;/a&gt;: Invented in 2008 by Gustavo Niemeyer, encodes a geographic location into a short string of letters and digits. It&apos;s a hierarchical spatial data structure that subdivides space into buckets of grid shape using a Z-order curve (&lt;a href=&quot;/spatial-index-space-filling-curve#2-space-filling-curves&quot;&gt;Section 2.&lt;/a&gt;).&lt;/p&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;3.1. Geohash - Intuition&lt;/summary&gt;

&lt;p&gt;Earth is round or more accurately, an ellipsoid. Map projection is a set of transformations represent the globe on a plane. In a map projection. Coordinates (latitude and longitude) of locations from the surface of the globe are transformed to coordinates on a plane. And GeoHash Uses &lt;a href=&quot;https://en.wikipedia.org/wiki/Equirectangular_projection&quot; target=&quot;_blank&quot;&gt;Equirectangular projection&lt;/a&gt;&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image&quot; src=&quot;./assets/posts/spatial-index/projection.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 21: Equirectangular projection/ Equidistant Cylindrical Projection&lt;/p&gt;

&lt;p&gt;The core of GeoHash is just an clever use of Z-order curves. Split the map-projection (rectangle) into 2 equal rectangles, each identified by unique bit strings.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/spatial-index/geohash-level-0.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 22: GeoHash Level 1 - Computation&lt;/p&gt;

&lt;p&gt;Observation: the divisions along X and Y axes are interleaved between bit strings. For example: an arbitrary bit string &lt;code&gt;01110 01011 00000&lt;/code&gt;, follows:&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image&quot; src=&quot;./assets/posts/spatial-index/geohash-bit-interleave.svg&quot; /&gt;

&lt;p&gt;By futher encoding this to Base32 (&lt;code&gt;0123456789bcdefghjkmnpqrstuvwxyz&lt;/code&gt;), we map a unique string to a quadrant in a grid and quadrants that share the same prefix are closer to each other; e.g. &lt;code&gt;000000&lt;/code&gt; and &lt;code&gt;000001&lt;/code&gt;. By now we know that interleaving trace out a Z-order curve.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/spatial-index/geohash-z-order.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 23: GeoHash Level 1 - Z-Order Curve&lt;/p&gt;

&lt;p&gt;Higher levels (higher order z-curves) lead to higher precision. The geohash algorithm can be iteratively repeated for higher precision. That&apos;s one cool property of geohash, adding more characters increase precision of the location.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-80&quot; src=&quot;./assets/posts/spatial-index/geohash-level-1.svg&quot; /&gt; 
&lt;img class=&quot;center-image-0 center-image-80&quot; src=&quot;./assets/posts/spatial-index/geohash-level-2.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 24: GeoHash Level 2&lt;/p&gt;

&lt;p&gt;Despite the easy implementation and wide usage of geohash, it inherits the disadvantages of Z-order curves (&lt;a href=&quot;/spatial-index-space-filling-curve#2-5-z-order-curve-implementation&quot;&gt;Section 2.5&lt;/a&gt;): weakly preserved latitude-longitude proximity; does not always guarantee that locations that are physically close are also close on the Z-curve. &lt;/p&gt;

&lt;p&gt;Adding on to it, is the use of &lt;a href=&quot;https://en.wikipedia.org/wiki/Tissot%27s_indicatrix&quot; target=&quot;_blank&quot;&gt;equirectangular projection&lt;/a&gt;, where the division of the map into equal subspaces leads to unequal/disproportional surface areas, especially near the poles (northern and southern hemisphere). However, there are alternatives such as &lt;a href=&quot;https://www.researchgate.net/publication/328727378_GEOHASH-EAS_-_A_MODIFIED_GEOHASH_GEOCODING_SYSTEM_WITH_EQUAL-AREA_SPACES&quot; target=&quot;_blank&quot;&gt;Geohash-EAS&lt;/a&gt; (Equal-Area Spaces).&lt;/p&gt;
&lt;/details&gt;
&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;3.2. Geohash - Implementation&lt;/summary&gt;
&lt;p&gt;To Convert a geographical location (latitude, longitude) into a concise string of characters and vice versa:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Convert latitude and longitude to a binary strings.&lt;/li&gt;
&lt;li&gt;Interleave the binary strings of latitude and longitude.&lt;/li&gt;
&lt;li&gt;Geohash: Convert the interleaved binary string into a base32 string.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt;3.2a. Geohash Encoder - Snippet&lt;/summary&gt;

&lt;pre&gt;&lt;code&gt;public class GeohashEncoder {

    public static String encodeGeohash(double latitude, double longitude, int precision) {
        // 1. Convert Lat and Long into a binary string based on the range.
        String latBin = convertToBinary(latitude, -90, 90, precision * 5 / 2);
        String lonBin = convertToBinary(longitude, -180, 180, precision * 5 / 2);

        // 2. Interweave the binary strings.
        String interwovenBin = interweave(lonBin, latBin);

        // 3. Converts a binary string to a base32 geohash.
        String geohash = binaryToBase32(interwovenBin);

        return geohash.substring(0, precision);
    }

    private static String convertToBinary(double value, double min, double max, int precision) {
        StringBuilder binaryStr = new StringBuilder();
        for (int i = 0; i &amp;lt; precision; i++) {
            double mid = (min + max) / 2;
            if (value &amp;gt;= mid) {
                binaryStr.append(&apos;1&apos;);
                min = mid;
            } else {
                binaryStr.append(&apos;0&apos;);
                max = mid;
            }
        }
        return binaryStr.toString();
    }

    private static String interweave(String str1, String str2) {
        StringBuilder interwoven = new StringBuilder();
        for (int i = 0; i &amp;lt; str1.length(); i++) {
            interwoven.append(str1.charAt(i));
            interwoven.append(str2.charAt(i));
        }
        return interwoven.toString();
    }

    private static String binaryToBase32(String binaryStr) {
        String base32Alphabet = &quot;0123456789bcdefghjkmnpqrstuvwxyz&quot;;
        StringBuilder base32Str = new StringBuilder();
        for (int i = 0; i &amp;lt; binaryStr.length(); i += 5) {
            String chunk = binaryStr.substring(i, Math.min(i + 5, binaryStr.length()));
            int decimalVal = Integer.parseInt(chunk, 2);
            base32Str.append(base32Alphabet.charAt(decimalVal));
        }
        return base32Str.toString();
    }

    public static void main(String[] args) {
        double latitude = 37.7749;
        double longitude = -122.4194;
        int precision = 5;
        String geohash = encodeGeohash(latitude, longitude, precision);
        System.out.println(&quot;Geohash: &quot; + geohash);
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;3.3. Geohash - Conclusion&lt;/summary&gt;
&lt;p&gt;Similar to &lt;a href=&quot;/spatial-index-space-filling-curve#2-7-z-order-curve-and-hilbert-curve-conclusion&quot;&gt;Section 2.7&lt;/a&gt; (Indexing the Z-values); Geohashes convert latitude and longitude into a single, sortable string, simplifying spatial data management. A &lt;a href=&quot;/b-tree&quot;&gt;B-trees&lt;/a&gt; or search tree such as GiST/SP-GiST (Generalized Search Tree) index are commonly used for geohash indexing in databases.&lt;/p&gt;

&lt;p&gt;Prefix Search: Nearby locations share common geohash prefixes, enabling efficient filtering of locations by performing prefix searches on the geohash column&lt;/p&gt;

&lt;p&gt;Neighbor Searches: Generate geohashes for a target location and its neighbors to quickly retrieve nearby points. Which also extends to Area Searches: Calculate geohash ranges that cover a specific area and perform range queries to find all relevant points within that region.&lt;/p&gt;

&lt;p&gt;Popular databases such as &lt;a href=&quot;https://clickhouse.com/docs/en/sql-reference/functions/geo/geohash&quot; target=&quot;_blank&quot;&gt;ClickHouse&lt;/a&gt;, &lt;a href=&quot;https://dev.mysql.com/doc/refman/8.4/en/spatial-geohash-functions.html&quot; target=&quot;_blank&quot;&gt;MySQL&lt;/a&gt;, &lt;a href=&quot;https://postgis.net/docs/ST_GeoHash.html&quot; target=&quot;_blank&quot;&gt;PostGIS&lt;/a&gt;, &lt;a href=&quot;https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions#st_geohash&quot; target=&quot;_blank&quot;&gt;BigQuery&lt;/a&gt;, &lt;a href=&quot;https://docs.aws.amazon.com/redshift/latest/dg/ST_GeoHash-function.html&quot; target=&quot;_blank&quot;&gt;RedShift&lt;/a&gt; and many others offer built-in geohash function. And many variations have been developed, such as the &lt;a href=&quot;https://github.com/yinqiwen/geohash-int&quot; target=&quot;_blank&quot;&gt;64-bit Geohash&lt;/a&gt; and &lt;a href=&quot;https://ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2404058&quot; target=&quot;_blank&quot;&gt;Hilbert-Geohash&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Interactive Geohash Visualization: &lt;a href=&quot;/geohash&quot; target=&quot;_blank&quot;&gt;/geohash&lt;/a&gt;&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;clear-hr&quot; /&gt;

&lt;h3&gt;4. Google S2&lt;/h3&gt;
&lt;p&gt;&lt;/p&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;4.1. S2 - Intuition&lt;/summary&gt;

&lt;p&gt;Google&apos;s S2 library was released more than 10 years ago and didn&apos;t exactly the get the attention it deserved, much later in 2017, Google announced the release of open-source C++ &lt;a href=&quot;https://github.com/google/s2geometry&quot; target=&quot;_blank&quot;&gt;s2geometry library&lt;/a&gt;. With the use of Hilbert Curve (&lt;a href=&quot;/spatial-index-space-filling-curve#2-2-hilbert-curve-intuition&quot;&gt;Section 2.2&lt;/a&gt;) and cube face (spherical) projection instead of geohash&apos;s Z-order curve and equirectangular projection; S2 addresses (to an extent) the large jumps (&lt;a href=&quot;/spatial-index-space-filling-curve#2-5-z-order-curve-implementation&quot;&gt;Section 2.5&lt;/a&gt;) problem with Z-order curves and disproportional surface areas associated with equirectangular projection.&lt;/p&gt;

&lt;p&gt;The core of S2 is the hierarchical decomposition of the sphere into &quot;cells&quot;; done using a &lt;a href=&quot;/quadtree&quot; target=&quot;_blank&quot;&gt;Quad-tree&lt;/a&gt;, where a quadrant is recursively subdivided into four equal sub-cells and the use of Hilbet Curve goes hand-in-hand - runs across the centers of the quad-tree’s leaf nodes.&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;4.2. S2 - Implementation&lt;/summary&gt;

&lt;p&gt;The overview of solution is to:&lt;/p&gt;
&lt;ul&gt;
    &lt;li&gt;Enclose sphere in cube&lt;/li&gt;
    &lt;li&gt;Project point(s) &lt;code&gt;p&lt;/code&gt; onto the cube&lt;/li&gt;
    &lt;li&gt;Build a quad-tree/hilbert-curve on each cube face (6 faces)&lt;/li&gt;
    &lt;li&gt;Assign ID to the quad-tree cell that contains the projection of point(s) &lt;code&gt;p&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Starting with the input &lt;a href=&quot;https://en.wikipedia.org/wiki/Geographic_coordinate_system#Latitude_and_longitude&quot; target=&quot;_blank&quot;&gt;co-ordinates&lt;/a&gt;, latitude (Degrees: -90° to +90°. Radians: -π/2 to π/2) and longitude (-180° to +180°. Radians: 0 to 2π). And &lt;a href=&quot;https://en.wikipedia.org/wiki/World_Geodetic_System&quot; target=&quot;_blank&quot;&gt;WGS84&lt;/a&gt; is a commmonly standard used in &lt;a href=&quot;https://en.wikipedia.org/wiki/Earth-centered,_Earth-fixed_coordinate_system&quot; target=&quot;_blank&quot;&gt;geocentric coordinate system&lt;/a&gt;.&lt;/p&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;h3&gt;4.2.1. (Lat, Long) to (X,Y,Z)&lt;/h3&gt;

&lt;p&gt;Covert &lt;code&gt;p = (lattitude,longitude) =&amp;gt; (x,y,z)&lt;/code&gt; XYZ co-ordinate system (&lt;code&gt;x = [-1.0, 1.0], y = [-1.0, 1.0], z = [-1.0, -1.0]&lt;/code&gt;), based on coordinates on the unit sphere (unit radius), which is similar to &lt;a href=&quot;https://en.wikipedia.org/wiki/Earth-centered,_Earth-fixed_coordinate_system&quot; target=&quot;_blank&quot;&gt;Earth-centered, Earth-fixed coordinate system&lt;/a&gt;.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-80&quot; src=&quot;./assets/posts/spatial-index/ecef.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 25: (lat, long) to (x, y, z) Transformation with ECEF&lt;/p&gt;

&lt;p&gt;Where, &lt;code&gt;(x, y, z)&lt;/code&gt;: X-axis at latitude 0°, longitude 0° (equator and prime meridian intersection), Y-axis at latitude 0°, longitude 90° (equator and 90°E meridian intersection), Z-axis at latitude 90° (North Pole), Altitude (&lt;code&gt;PM&lt;/code&gt; on Figure 25) = Height to the reference ellipsoid/Sphere (Zero for a Round Planet approximation)&lt;/p&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;h3&gt;4.2.2. (X,Y,Z) to (Face,U,V)&lt;/h3&gt;

&lt;p&gt;To map &lt;code&gt;(x,y,z)&lt;/code&gt; to &lt;code&gt;(face, u,v)&lt;/code&gt;, each of the six faces of the cube is projected onto the sphere. The process is similar to &lt;a href=&quot;https://en.wikipedia.org/wiki/UV_mapping&quot; target=&quot;_blank&quot;&gt;UV Mapping&lt;/a&gt;: to project 3D model surface into a 2D coordinate space. where &lt;code&gt;u&lt;/code&gt; and &lt;code&gt;v&lt;/code&gt; denote the axes of the 2D plane. In this case, &lt;code&gt;U,V&lt;/code&gt; represent the location of a point on one face of the cube.&lt;/p&gt;

&lt;p&gt;The projection can simply be imagined as a unit sphere circumscribed by a cube. And a ray is emitted from the center of the sphere to obtain the projection of the point on the sphere to the 6 faces of the cube, that is, the sphere is projected into a cube.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image&quot; src=&quot;./assets/posts/spatial-index/s2-cell-step-1-2.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 26: (lat, long) to (x, y, z) and (x, y, z) to (face, u, v)&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;face&lt;/code&gt; denotes which of the 6 (0 to 5) cube faces a point on the sphere is mapped onto. Figure 27, shows the 6 faces of the cube (&lt;a href=&quot;https://en.wikipedia.org/wiki/Cube_mapping&quot; target=&quot;_blank&quot;&gt;cube mapping&lt;/a&gt;) after the projection. For a unit-sphere, for each face, the point &lt;code&gt;u,v = (0,0)&lt;/code&gt; represent the center of the face.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/spatial-index/s2-globe.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 27: Cube Face (Spherical) Projection&lt;/p&gt;

&lt;p&gt;The evident problem here is that, the linear projection leads to same-area cells on the cube having different sizes on the sphere (Length and Area Distortion), with the ratio of highest to lowest area of &lt;code&gt;5.2&lt;/code&gt; (areas on the cube can be up to 5.2 times longer or shorter than the corresponding distances on the sphere).&lt;/p&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt;4.2.2a. S2 FaceXYZ to UV - Snippet&lt;/summary&gt;
&lt;pre&gt;&lt;code&gt;public static class Vector3 {
    public double x;
    public double y;
    public double z;

    public Vector3(double x, double y, double z) {
        this.x = x;
        this.y = y;
        this.z = z;
    }
}

public static int findFace(Vector3 r) {
    double absX = Math.abs(r.x);
    double absY = Math.abs(r.y);
    double absZ = Math.abs(r.z);

    if (absX &amp;gt;= absY &amp;amp;&amp;amp; absX &amp;gt;= absZ) {
        return r.x &amp;gt; 0 ? 0 : 3;
    } else if (absY &amp;gt;= absX &amp;amp;&amp;amp; absY &amp;gt;= absZ) {
        return r.y &amp;gt; 0 ? 1 : 4;
    } else {
        return r.z &amp;gt; 0 ? 2 : 5;
    }
}

public static double[] validFaceXYZToUV(int face, Vector3 r) {
    switch (face) {
        case 0:
            return new double[]{r.y / r.x, r.z / r.x};
        case 1:
            return new double[]{-r.x / r.y, r.z / r.y};
        case 2:
            return new double[]{-r.x / r.z, -r.y / r.z};
        case 3:
            return new double[]{r.z / r.x, r.y / r.x};
        case 4:
            return new double[]{r.z / r.y, -r.x / r.y};
        default:
            return new double[]{-r.y / r.z, -r.x / r.z};
    }
}

public static void main(String[] args) {
    Vector3 r = new Vector3(1.0, 2.0, 3.0);
    int face = 0;
    double[] uv = validFaceXYZToUV(face, r);
    System.out.println(&quot;u: &quot; + uv[0] + &quot;, v: &quot; + uv[1]);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;p&gt;The Cube &lt;code&gt;Face&lt;/code&gt; is the largest absolute X,Y,Z component, when component is -ve, back faces are used.&lt;/p&gt;
&lt;img class=&quot;center-image-0 center-image-60&quot; src=&quot;./assets/posts/spatial-index/s2-xyz-uv.svg&quot; /&gt; 
&lt;p&gt;Face and XYZ is mapped to UV by using the other two X, Y, Z components (other than largest component of face) and diving it by the largest component, a value between &lt;code&gt;[-1, 1]&lt;/code&gt;. Additionally, some faces of the cube are transposed (-ve) to produce the single continuous hilbert curve on the cube.&lt;/p&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;h3&gt;4.2.3. (Face,U,V) to (Face,S,T)&lt;/h3&gt;

&lt;p&gt;The ST coordinate system is an extension of UV with an additional non-linear transformation layer to address the (Area Preservation) disproportionate sphere surface-area to cube cell mapping. Without which, cells near the cube face edges would be smaller than those near the cube face centers.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-80&quot; src=&quot;./assets/posts/spatial-index/s2-cell-step-3.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 28: (u, v) to (s, t)&lt;/p&gt;

&lt;p&gt;S2 uses Quadratic projection for &lt;code&gt;(u,v)&lt;/code&gt; =&amp;gt; &lt;code&gt;(s,t)&lt;/code&gt;. Comparing &lt;code&gt;tan&lt;/code&gt; and &lt;code&gt;quadratic&lt;/code&gt; projections: The tan projection has the least Area/Distance Distortion. However, quadratic projection, which is an approximation of the tan projection - is much faster and almost as good as tangent.&lt;/p&gt;
&lt;table&gt;
        &lt;tr&gt;
            &lt;td&gt;&lt;/td&gt;
            &lt;td&gt;Area Ratio&lt;/td&gt;
            &lt;td&gt;Cell → Point (µs)&lt;/td&gt;
            &lt;td&gt;Point → Cell (µs)&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;Linear&lt;/td&gt;
            &lt;td&gt;5.20&lt;/td&gt;
            &lt;td&gt;0.087&lt;/td&gt;
            &lt;td&gt;0.085&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr&gt;
            &lt;td&gt;Tangent&lt;/td&gt;
            &lt;td&gt;1.41&lt;/td&gt;
            &lt;td&gt;0.299&lt;/td&gt;
            &lt;td&gt;0.258&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr style=&quot;background-color: rgb(213, 232, 212);&quot;&gt;
            &lt;td&gt;Quadratic&lt;/td&gt;
            &lt;td&gt;2.08&lt;/td&gt;
            &lt;td&gt;0.096&lt;/td&gt;
            &lt;td&gt;0.108&lt;/td&gt;
        &lt;/tr&gt;
    &lt;/table&gt;

&lt;p&gt;&lt;code&gt;Cell → Point&lt;/code&gt; and &lt;code&gt;Point → Cell&lt;/code&gt; represents the transformation from (U, V) to (S, T) coordinates and vice versa.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/spatial-index/s2-uv-st-face-0.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 29: (face, u, v) to (face, s, t); for face = 0&lt;/p&gt;

&lt;p&gt;For the quadratic transformation: Apply a square root transformation; sqrt(1 + 3 * u) and to maintain the uniformity of the grid cells&lt;/p&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt;4.2.3a. S2 UV to ST - Snippet&lt;/summary&gt;
&lt;pre&gt;&lt;code&gt;public static double uvToST(double u) {
    if (u &amp;gt;= 0) {
        return 0.5 * Math.sqrt(1 + 3 * u);
    } else {
        return 1 - 0.5 * Math.sqrt(1 - 3 * u);
    }
}

public static void main(String[] args) {
    // (u, v) values in the range [-1, 1]
    double u1 = 0.5;
    double v1 = -0.5;
    
    // Convert (u, v) to (s, t)
    double s1 = uvToST(u1);
    double t1 = uvToST(v1);

    System.out.println(&quot;For (u, v) = (&quot; + u1 + &quot;, &quot; + v1 + &quot;):&quot;);
    System.out.println(&quot;s: &quot; + s1);
    System.out.println(&quot;t: &quot; + t1);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;h3&gt;4.2.4. (Face,S,T) to (Face,I,J)&lt;/h3&gt;

&lt;p&gt;The IJ coordinates are discretized ST coordinates and divides the ST plane into &lt;code&gt;2&lt;sup&gt;30&lt;/sup&gt; × 2&lt;sup&gt;30&lt;/sup&gt;&lt;/code&gt;, i.e. the i and j coordinates in S2 range from &lt;code&gt;0 to 2&lt;sup&gt;30&lt;/sup&gt; - 1&lt;/code&gt;. And represent the two dimensions of the leaf-cells (lowest-level cells) on a cube face.&lt;/p&gt;

&lt;p&gt;Why 2&lt;sup&gt;30&lt;/sup&gt;? The i and j coordinates are each represented using 30 bits, which is &lt;code&gt;2&lt;sup&gt;30&lt;/sup&gt;&lt;/code&gt; distinct values for both i and j coordinates (every cm² of the earth), this large range allows precise positioning within each face of the cube (high spatial resolution). The total number of unique cells is &lt;code&gt;6 x (2&lt;sup&gt;30&lt;/sup&gt; × 2&lt;sup&gt;30&lt;/sup&gt;)&lt;/code&gt;&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/spatial-index/s2-st-ij.svg&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 30: (face, s, t) to (face, i, j); for face = 0&lt;/p&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt;4.2.4a. S2 ST to IJ - Snippet&lt;/summary&gt;
&lt;pre&gt;&lt;code&gt;public static int stToIj(double s) {
  return Math.max(
    0, Math.min(1073741824 - 1, (int) Math.round(1073741824 * s))
  );
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;h3&gt;4.2.5. (Face,I,J) to S2 Cell ID&lt;/h3&gt;
&lt;p&gt;The hierarchical sub-division of each cube face into 4 equal quadrants calls for Hilbert Space-Filling Curve (&lt;a href=&quot;/spatial-index-space-filling-curve#2-2-hilbert-curve-intuition&quot;&gt;Section 2.2&lt;/a&gt;): to enumerate cells along a Hilbert space-filling curve.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/spatial-index/s2-ij-cell.svg&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 31: (face, i, j) to Hilbert Curve Position&lt;/p&gt;

&lt;p&gt;Hilbert Curve preserves spatial locality, meaning, the values that are close on the cube face/surface, are numerically close in the Hilbert curve position (illustration in Figure 31 - Level 3).&lt;/p&gt;

&lt;p&gt;Transformation: The Hilbert curve transforms the IJ coordinate position on the cube face from 2D to 1D and is given by a &lt;code&gt;60 bit&lt;/code&gt; integer (&lt;code&gt;0 to 2&lt;sup&gt;60&lt;/sup&gt;&lt;/code&gt;).&lt;/p&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt;4.2.5a. S2 IJ to S2 Cell ID - Snippet&lt;/summary&gt;
&lt;pre&gt;&lt;code&gt;public class S2CellId {
    private static final long MAX_LEVEL = 30;
    private static final long POS_BITS = 2 * MAX_LEVEL + 1;
    private static final long FACE_BITS = 3;
    private static final long FACE_MASK = (1L &amp;lt;&amp;lt; FACE_BITS) - 1;
    private static final long POS_MASK = (1L &amp;lt;&amp;lt; POS_BITS) - 1;

    public static long faceIjToCellId(int face, int i, int j) {
        // Face Encoding
        long cellId = ((long) face) &amp;lt;&amp;lt; POS_BITS;
        // Loop from MAX_LEVEL - 1 down to 0
        for (int k = MAX_LEVEL - 1; k &amp;gt;= 0; --k) {
            // Hierarchical Position Encoding
            int mask = 1 &amp;lt;&amp;lt; k;
            long bits = (((i &amp;amp; mask) != 0) ? 1 : 0) &amp;lt;&amp;lt; 1 | (((j &amp;amp; mask) != 0) ? 1 : 0);
            cellId |= bits &amp;lt;&amp;lt; (2 * k);
        }
        return cellId;
    }

    public static void main(String[] args) {
        int face = 2; 
        int i = 536870912;
        int j = 536870912;

        long cellId = faceIjToCellId(face, i, j);
        System.out.println(&quot;S2 Cell ID: &quot; + cellId);
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;p&gt;The &lt;b&gt;S2 Cell ID&lt;/b&gt; is represented by a &lt;code&gt;64-bit&lt;/code&gt; integer,&lt;/p&gt; 
&lt;ul&gt;
&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/spatial-index/s2-cell-id.svg&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 32: (face, i, j) to S2 Cell ID&lt;/p&gt;
&lt;li&gt;the left &lt;code&gt;3 bits&lt;/code&gt; are used to represent the cube face &lt;code&gt;[0-5],&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;the next following &lt;code&gt;60 bits&lt;/code&gt; represents the Hilbert Curve position,&lt;/li&gt;
&lt;li&gt;with &lt;code&gt;[0-30]&lt;/code&gt; levels; two bits for every higher order/level, followed by a trailing &lt;code&gt;1&lt;/code&gt; bit, which is a marker to identify the level of the cell (by position).&lt;/li&gt;
&lt;li&gt;and the last digits are padded with 0s&lt;/li&gt;
&lt;/ul&gt;

&lt;pre&gt;&lt;code&gt;fffpppp...pppppppp1  # Level 30 cell ID
fffpppp...pppppp100  # Level 29 cell ID
fffpppp...pppp10000  # Level 28 cell ID
...
...
...
fffpp10...000000000  # Level 1 cell ID
fff1000...000000000  # Level 0 cell ID
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notice the position of trailing &lt;code&gt;1&lt;/code&gt; and padded &lt;code&gt;0&lt;/code&gt;s, correlated to the level.&lt;/p&gt;
&lt;hr class=&quot;hr&quot; /&gt;

&lt;p&gt;&lt;b&gt;S2 Tokens&lt;/b&gt; are a string representation of S2 Cell IDs (uint64), which can be more convenient for storage.&lt;/p&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt;4.2.5b. S2 Cell ID to S2 Token - Snippet&lt;/summary&gt;
&lt;pre&gt;&lt;code&gt;public static String cellIdToToken(long cellId) {
    // The zero token is encoded as &apos;X&apos; rather than as a zero-length string
    if (cellId == 0) {
        return &quot;X&quot;;
    }

    // Convert cell ID to a hex string and strip any trailing zeros
    String hexString = Long.toHexString(cellId).replaceAll(&quot;0*$&quot;, &quot;&quot;);
    return hexString;
}

public static void main(String[] args) {
    long cellId = 3383821801271328768L; // Given example value

    // Convert S2 Cell ID to S2 Token
    String token = cellIdToToken(cellId);

    System.out.println(&quot;S2 Cell ID: &quot; + cellId);
    System.out.println(&quot;S2 Token: &quot; + token);
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;p&gt;It&apos;s similar to Geohash, however, prefixes from a high-order S2 token does not yield a parent lower-order token, because the trailing 1 bit in S2 cell ID wouldn&apos;t be set correctly. Convert S2 Cell ID to an S2 Token by encoding the ID into a base-16 (hexadecimal) string.&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;4.3. S2 - Conclusion&lt;/summary&gt;
&lt;p&gt;Google&apos;s S2 provides spatial indexing by using hierarchical decomposition of the sphere into cells through a combination of Hilbert curves and cube face (spherical) projection. This approach mitigates some of the spatial locality issues present in Z-order curves and offers more balanced surface area representations. S2&apos;s use of (face, u, v) coordinates, quadratic projection, and Hilbert space-filling curves ensures efficient and precise spatial indexing.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/spatial-index/s2-stats.svg&quot; /&gt;

&lt;p&gt;Closing with a strong pro and a con, S2 offers a high resolution of as low as &lt;code&gt;0.48 cm²&lt;/code&gt; cell size (level 30), but the number of cells required to cover a given polygon isn&apos;t the best. This makes it a good transition to talk about Uber&apos;s &lt;a href=&quot;https://www.uber.com/en-CA/blog/h3/&quot; target=&quot;_blank&quot;&gt;H3&lt;/a&gt;. The question is, &lt;a href=&quot;/cartograms-documentation#hexagonsvssquares&quot;&gt;Why Hexagons?&lt;/a&gt;&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;clear-hr&quot; /&gt;

&lt;details&gt;&lt;summary class=&quot;h3&quot;&gt;3. References&lt;/summary&gt;

&lt;pre style=&quot;max-height: 300px&quot;&gt;&lt;code&gt;6. Christian S. Perone, &quot;Google’s S2, geometry on the sphere, cells and Hilbert curve,&quot; in Terra Incognita, 14/08/2015, https://blog.christianperone.com/2015/08/googles-s2-geometry-on-the-sphere-cells-and-hilbert-curve/. [Accessed: 12-Jun-2024].
7. B. Feifke, &quot;Geospatial Indexing Explained,&quot; Ben Feifke, Dec. 2022. [Online]. Available: https://benfeifke.com/posts/geospatial-indexing-explained/. [Accessed: 12-Jun-2024].
8. &quot;S2 Concepts,&quot; S2 Geometry Library Documentation, 2024. [Online]. Available: https://docs.s2cell.aliddell.com/en/stable/s2_concepts.html. [Accessed: 13-Jun-2024].
9. &quot;Geospatial Indexing: A Look at Google&apos;s S2 Library,&quot; CNIter Blog, Mar. 2023. [Online]. Available: https://cniter.github.io/posts/720275bd.html. [Accessed: 13-Jun-2024].
10. &quot;S2 Geometry Library,&quot; S2 Geometry, 2024. [Online]. Available: https://s2geometry.io/. [Accessed: 13-Jun-2024].
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;</content><author><name>Adesh Nalpet Adimurthy</name></author><category term="System Wisdom" /><category term="Database" /><category term="Spatial Index" /><summary type="html">This post is a continuation of Stomping Grounds: Spatial Indexes, but don’t worry if you missed the first part—you’ll still find plenty of new insights right here.</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://pyblog.xyz/assets/featured/webp/space-grids.webp" /><media:content medium="image" url="https://pyblog.xyz/assets/featured/webp/space-grids.webp" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Spatial Index: Space-Filling Curves</title><link href="https://pyblog.xyz/spatial-index-space-filling-curve" rel="alternate" type="text/html" title="Spatial Index: Space-Filling Curves" /><published>2024-06-11T00:00:00+00:00</published><updated>2024-06-11T00:00:00+00:00</updated><id>https://pyblog.xyz/spatial-index-space-filling-curve</id><content type="html" xml:base="https://pyblog.xyz/spatial-index-space-filling-curve">&lt;details open=&quot;&quot;&gt;&lt;summary class=&quot;h3&quot;&gt;0. Overview&lt;/summary&gt;
&lt;p&gt;Spatial data has grown (/is growing) rapidly thanks to web services tracking where and when users do things. Most applications add location tags and often allow users check in specific places and times. This surge is largely due to smartphones, which act as location sensors, making it easier than ever to capture and analyze this type of data.&lt;/p&gt;

&lt;p&gt;The goal of this post is to dive into the different spatial indexes that are widely used in both relational and non-relational databases. We&apos;ll look at the pros and cons of each type, and also discuss which indexes are the most popular today.&lt;/p&gt;

&lt;img class=&quot;center-image-0&quot; src=&quot;./assets/posts/spatial-index/spatial-index-types.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 0: Types of Spatial Indexes&lt;/p&gt;

&lt;p&gt;Spatial indexes fall into two main categories: space-driven and data-driven structures. Data-driven structures, like the R-tree family, are tailored to the distribution of the data itself. Space-driven structures include partitioning trees (kd-trees, quad-trees), space-filling curves (Z-order, Hilbert), and grid systems (H3, S2, Geohash), each partitioning space to optimize spatial queries. This classification isn&apos;t exhaustive, as many other methods cater to specific needs in spatial data management.&lt;/p&gt;

&lt;/details&gt;

&lt;hr class=&quot;clear-hr&quot; /&gt;

&lt;details open=&quot;&quot;&gt;&lt;summary class=&quot;h3&quot;&gt;1. Foundation&lt;/summary&gt;
&lt;p&gt;To understand the need for spatial indexes, or more generally, a way to index multi-dimensional data.&lt;/p&gt;
&lt;img class=&quot;center-image-40&quot; src=&quot;./assets/posts/spatial-index/no-sort-no-partition-table.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 1: Initial Table Structure&lt;/p&gt;
&lt;p&gt;Consider a table with the following fields: &lt;code&gt;device&lt;/code&gt;, &lt;code&gt;X&lt;/code&gt;, and &lt;code&gt;Y&lt;/code&gt;, all of which are integers ranging from 1 to 4. Data is inserted into this table randomly by an external application.&lt;/p&gt;

&lt;img class=&quot;center-image&quot; src=&quot;./assets/posts/spatial-index/no-sort-no-partition-full-scan.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 2: Unpartitioned and Unsorted Table&lt;/p&gt;
&lt;p&gt;Currently, the table is neither partitioned nor sorted. As a result, the data is distributed across all files (8 files), each containing a mix of all ranges. This means all files are similar in nature. Running a query like &lt;code&gt;Device = 1 and X = 2&lt;/code&gt; requires a full scan of all files, which is inefficient.&lt;/p&gt;

&lt;img class=&quot;center-image-90&quot; src=&quot;./assets/posts/spatial-index/no-sort-full-scan.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 3: Partitioning by Device&lt;/p&gt;
&lt;p&gt;To optimize this, we partition the table by the &lt;code&gt;device&lt;/code&gt; field into 4 partitions: &lt;code&gt;Device = 1&lt;/code&gt;, &lt;code&gt;Device = 2&lt;/code&gt;, &lt;code&gt;Device = 3&lt;/code&gt;, and &lt;code&gt;Device = 4&lt;/code&gt;. Now, the same query (&lt;code&gt;Device = 1 and X = 2&lt;/code&gt;) only needs to scan the relevant partition. This reduces the scan to just 2 files.&lt;/p&gt;

&lt;img class=&quot;center-image-90&quot; src=&quot;./assets/posts/spatial-index/partial-scan-x.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 4: Sorting Data Within Partitions&lt;/p&gt;
&lt;p&gt;Further optimization can be achieved by sorting the data within each partition by the &lt;code&gt;X&lt;/code&gt; field. With this setup, each file in a partition holds a specific range of &lt;code&gt;X&lt;/code&gt; values. For example, one file in the &lt;code&gt;Device = 1&lt;/code&gt; partition hold &lt;code&gt;X = 1 to 2&lt;/code&gt;. This makes the query &lt;code&gt;Device = 1 and X = 2&lt;/code&gt; even more efficient.&lt;/p&gt;

&lt;img class=&quot;center-image-90&quot; src=&quot;./assets/posts/spatial-index/no-sort-full-scan-y.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 5: Limitation with Sorting on a Single Field&lt;/p&gt;
&lt;p&gt;However, if the query changes to &lt;code&gt;Device = 1 and Y = 2&lt;/code&gt;, the optimization is lost because the sorting was done on &lt;code&gt;X&lt;/code&gt; and not &lt;code&gt;Y&lt;/code&gt;. This means the query will still require scanning the entire partition for &lt;code&gt;Device = 1&lt;/code&gt;, bringing us back to a less efficient state.&lt;/p&gt;

&lt;p&gt;At this point, there&apos;s a clear need for efficiently partitioning 2-dimensional data. Why not use &lt;a href=&quot;/b-tree&quot;&gt;B-tree&lt;/a&gt; with a composite index? A composite index prioritizes the first column in the index, leading to inefficient querying for the second column. This leads us back to the same problem, particularly when both dimensions need to be considered simultaneously for efficient querying.&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;clear-hr&quot; /&gt;

&lt;details open=&quot;&quot;&gt;&lt;summary class=&quot;h3&quot;&gt;2. Space-Filling Curves&lt;/summary&gt;

&lt;p&gt;&lt;code&gt;X&lt;/code&gt; and &lt;code&gt;Y&lt;/code&gt; from 1 to 4 on a 2D axis. The goal is to traverse the data and number them accordingly (the path). using Space-Filling Curves AKA squiggly lines.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/spatial-index/space-filling-trivial-details.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 6: Exploring Space-Filling Curve and Traversing the X-Y Axis&lt;/p&gt;

&lt;p&gt;Starting from &lt;code&gt;Y = 1&lt;/code&gt; and &lt;code&gt;X = 1&lt;/code&gt;, as we traverse up to &lt;code&gt;X = 1&lt;/code&gt; and &lt;code&gt;Y = 4&lt;/code&gt;, it&apos;s evident that there is no locality preservation (Lexicographical Order). The distance between points &lt;code&gt;(1, 4)&lt;/code&gt; and &lt;code&gt;(1, 3)&lt;/code&gt; is 6, a significant difference for points that are quite close to each other. Grouping this data into files keeps unrelated data together and ended up sorting by one column while ignoring the information in the other column (back to square one). i.e. &lt;code&gt;X = 2&lt;/code&gt; leads to a full scan.&lt;/p&gt;


&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;2.1. Z-Order Curve - Intuition&lt;/summary&gt;
&lt;p&gt;A recursive Z pattern, also known as the Z-order curve, is an effective way to preserve locality in many cases.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/spatial-index/z-order-types.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 7: Z-Order Curve Types&lt;/p&gt;
&lt;p&gt;The Z-order curve can take many shapes, depending on which coordinate goes first. The typical Z-shape occurs when the Y-coordinate goes first (most significant bit), and the upper left corner is the base. A mirror image Z-shape occurs when the Y-coordinate goes first and the lower left corner is the base. An N-shape occurs when the X-coordinate goes first and the lower left corner is the base.&lt;/p&gt;

&lt;p&gt;Z-order curve grows exponentially, and the next size is the second-order curve that has 2-bit sized dimensions. Duplicate the first-order curve four times and connect them together to form a continuous curve.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/spatial-index/z-order.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 8: Z-Order Curve&lt;/p&gt;

&lt;p&gt;Points &lt;code&gt;(1, 4)&lt;/code&gt; and &lt;code&gt;(1, 3)&lt;/code&gt; are separated by a single square. With 4 files based on this curve, the data is not spread out along a single dimension. Instead, the 4 files are clustered across both dimensions, making the data selective on both &lt;code&gt;X&lt;/code&gt; and &lt;code&gt;Y&lt;/code&gt; dimensions.&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;2.2. Hilbert Curve - Intuition&lt;/summary&gt;

&lt;p&gt;The Hilbert curve is another type of space-filling curve that serve a similar purpose, rather than using a Z-shaped pattern like the Z-order curve, it uses a gentler U-shaped pattern. When compared with the Z-order curve in Figure 9, it’s quite clear that the Hilbert curve always maintains the same distance between adjacent data points.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/spatial-index/hilbert-second-order.svg&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 9: First Order and Second Order Hilbert Curve&lt;/p&gt;
&lt;p&gt;Hilbert curve also grows exponentially, to do so, duplicate the first-order curve and connect them. Additionally, some of the first-order curves are rotated to ensure that the interconnections are not larger than 1 point.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/spatial-index/hilbert-exponent.svg&quot; /&gt; 
&lt;p&gt;Comparing with the Z-curves (from Figure 8, higher-order in Figure 18), the Z-order curve is longer than the Hilbert curve at all levels, for the same area.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/spatial-index/hilbert-types.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 10: Hilbert Curve Types&lt;/p&gt;
&lt;p&gt;Although there are quite a lot of varaints of Hilbert curve, the common pattern is to rotate by 90 degrees and repeat the pattern in next higher order(s).&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/spatial-index/hilbert-curve.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 11: Hilbert Curve&lt;/p&gt;
&lt;p&gt;Hilbert curves traverse through the data, ensuring that multi-dimensional data points that are close together in 2D space remain close together along the 1D line or curve, thus preserving locality and enhancing query efficiency across both dimensions.&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;2.3. Z-Order Curve and Hilbert Curve - Comparison&lt;/summary&gt;

&lt;p&gt;Taking an example, if we query for &lt;code&gt;X = 3&lt;/code&gt;, we only need to search 2 of the files. Similarly, for &lt;code&gt;Y = 3&lt;/code&gt;, the search is also limited to 2 files in both Z-order and Hilbert Curves&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/spatial-index/z-order-curve-example.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 12: Z-Order Curve - Example&lt;/p&gt;

&lt;p&gt;Unlike a hierarchical sort on only one dimension, the data is selective across both dimensions, making the multi-dimensional search more efficient.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-100&quot; src=&quot;./assets/posts/spatial-index/hilbert-curve-example.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 13: Hilbert Curve - Example&lt;/p&gt;

&lt;p&gt;Although both the curves give a similar advantage, the main shortcoming with Z-order curve: it fails to maintain perfect data locality across all the data points in the curve. In Figure 12, notice the data points between index 8 and 9 are further apart. As the size of the Z-curve increases, so does the distance between such points that connect different parts of curve together.&lt;/p&gt;

&lt;p&gt;Hilbert curve is more preferred over the Z-order curve for ensuring better data locality and Z-order curve is still widely used because of it&apos;s simplicity.&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;2.4. Optimizing with Z-Values&lt;/summary&gt;

&lt;p&gt;In the examples so far, we have presumed that the &lt;code&gt;X&lt;/code&gt; and &lt;code&gt;Y&lt;/code&gt; values are dense, meaning that there is a value for every combination of &lt;code&gt;X&lt;/code&gt; and &lt;code&gt;Y&lt;/code&gt;. However, in real-world scenarios, data can be sparse, with many &lt;code&gt;X, Y&lt;/code&gt; combinations missing&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-80&quot; src=&quot;./assets/posts/spatial-index/3-partition-curves.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 14: Flexibility in Number of Files&lt;/p&gt;
&lt;p&gt;The number of files (4 in the prior examples) isn&apos;t necessarily dictated. Here&apos;s what 3 files would look like using both Z-order and Hilbert curves. The benefits still holds to an extent because of the space-filling curve, which efficiently clusters related data points.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-90&quot; src=&quot;./assets/posts/spatial-index/z-order-sparse.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 15: Optimizing with Z-Values&lt;/p&gt;
&lt;p&gt;To improve efficiency, we can use Z-values. If files are organized by Z-values, each file has a min-max Z-value range. Filters on &lt;code&gt;X&lt;/code&gt; and &lt;code&gt;Y&lt;/code&gt; can be transformed into Z-values, enabling efficient querying by limiting the search to relevant files based on their Z-value ranges.&lt;/p&gt;

&lt;img class=&quot;center-image-0&quot; src=&quot;./assets/posts/spatial-index/z-order-z-values.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 16: Efficient Querying with Min-Max Z-Values&lt;/p&gt;
&lt;p&gt;Consider a scenario where the min-max Z-values of 3 files are &lt;code&gt;1 to 5&lt;/code&gt;, &lt;code&gt;6 to 9&lt;/code&gt;, and &lt;code&gt;13 to 16&lt;/code&gt;. Querying by &lt;code&gt;2 ≤ X ≤ 3&lt;/code&gt; and &lt;code&gt;3 ≤ Y ≤ 4&lt;/code&gt; would initially require scanning 2 files. However, if we convert these ranges to their Z-value equivalent, which is &lt;code&gt;10 ≤ Z ≤ 15&lt;/code&gt;, we only need to scan one file, since the min-max Z-values are known.&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;2.5. Z-Order Curve - Implementation&lt;/summary&gt;

&lt;p&gt;So far, wkt, Z-ordering arranges the 2D pairs on a 1-dimensional line. More importantly, values that were close together in the 2D plane would still be close to each other on the Z-order line. The implementation goal is to derive Z-Values that preserves spatial locality from M-dimensional data-points (Z-ordering is not limited to 2-dimensional space and it can be abstracted to work in any number of dimensions)&lt;/p&gt;

&lt;p&gt;Z-order bit-interleaving is a technique that interleave bits of two or more values to create a 1-D value while spatial locality is preserved:&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-45&quot; src=&quot;./assets/posts/spatial-index/interleave.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 17: Bit Interleaving&lt;/p&gt;
&lt;p&gt;Example: 4-bit values &lt;code&gt;X = 10&lt;/code&gt;, &lt;code&gt;Y = 12&lt;/code&gt; on a 2D grid, &lt;code&gt;X = 1010&lt;/code&gt;, &lt;code&gt;Y = 1100&lt;/code&gt;, then interleaved value &lt;code&gt;Z = 1110 0100&lt;/code&gt; (&lt;code&gt;228&lt;/code&gt;)&lt;/p&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt;2.5a. Z-Order Curve - Snippet&lt;/summary&gt;

&lt;pre&gt;&lt;code&gt;public class ZOrderCurve {

    // Function to interleave bits of two integers x and y
    public static long interleaveBits(int x, int y) {
        long z = 0;
        for (int i = 0; i &amp;lt; 32; i++) {
            z |= (long)((x &amp;amp; (1 &amp;lt;&amp;lt; i)) &amp;lt;&amp;lt; i) | ((y &amp;amp; (1 &amp;lt;&amp;lt; i)) &amp;lt;&amp;lt; (i + 1));
        }
        return z;
    }

    // Function to compute the Z-order curve values for a list of points
    public static long[] zOrderCurve(int[][] points) {
        long[] zValues = new long[points.length];
        for (int i = 0; i &amp;lt; points.length; i++) {
            int x = points[i][0];
            int y = points[i][1];
            zValues[i] = interleaveBits(x, y);
        }
        return zValues;
    }

    public static void main(String[] args) {
        int[][] points = { {1, 2}, {3, 4}, {5, 6} };
        long[] zValues = zOrderCurve(points);

        System.out.println(&quot;Z-order values:&quot;);
        for (long z : zValues) {
            System.out.println(z);
        }
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/spatial-index/z-order-2d-plane.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 18: 2-D Z-Order Curve Space&lt;/p&gt;

&lt;p&gt;From the above Z-order keys, we see that points that are close to each other in the original space have close Z-order keys. For instance, points sharing the prefix &lt;code&gt;000&lt;/code&gt; in their Z-order keys are close in 2D space, while points with the prefix &lt;code&gt;110&lt;/code&gt; indicate greater distance.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/spatial-index/z-order-success.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 19: 2-D Z-Order Curve Space and a Query Region&lt;/p&gt;
&lt;p&gt;Now that we know how to calculate the z-order keys, we can use the z-order keys to define a range of values to read (reange-query), to do so, we have to find the lower and upper counds. For example: The query rectangle: &lt;code&gt;2 ≤ X ≤ 3&lt;/code&gt; to &lt;code&gt;4 ≤ Y ≤ 5&lt;/code&gt;, the lower bound is &lt;code&gt;Z-Order(X = 2, Y = 4) = 100100&lt;/code&gt; and upper bound is &lt;code&gt;(X = 3, Y = 5) = 100111&lt;/code&gt;, translates to Z-order values of &lt;code&gt;36&lt;/code&gt; and &lt;code&gt;39&lt;/code&gt;.&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/spatial-index/z-order-danger.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 20: 2-D Z-Order Curve Space and a Query Region (The Problem)&lt;/p&gt;
&lt;p&gt;However, range queries based on Z-Order keys are not always present in a continuous Z path. For example: The query rectangle &lt;code&gt;1 ≤ X ≤ 3&lt;/code&gt; to &lt;code&gt;3 ≤ Y ≤ 4&lt;/code&gt;, the lower bound &lt;code&gt;Z-Order(X = 1, Y = 3) = 001011&lt;/code&gt; and upper bound is &lt;code&gt;(X = 3, Y = 4) = 100101&lt;/code&gt;, translates to Z-order values of &lt;code&gt;11 and 37&lt;/code&gt; - optimized using subranges.&lt;/p&gt;

&lt;p&gt;The Z-order curve weakly preserves latitude-longitude proximity, i.e. two locations that are close in physical distance are not guaranteed to be close following the Z-curve&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;2.6. Hilbert Curve - Implementation&lt;/summary&gt;
&lt;p&gt;From &lt;a href=&quot;#2-2-hilbert-curve-intuition&quot;&gt;Section 2.2&lt;/a&gt;, wkt: The Hilbert curve implementation converts 2D coordinates to a single scalar value that preserves spatial locality by recursively rotating and transforming the coordinate space.&lt;/p&gt;

&lt;p&gt;In the code snippet: The &lt;code&gt;xyToHilbert&lt;/code&gt; function computes this scalar value using bitwise operations, while the &lt;code&gt;hilbertToXy&lt;/code&gt; function reverses this process. This method ensures that points close in 2D space remain close in the 1D Hilbert curve index, making it useful for spatial indexing.&lt;/p&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;p&quot;&gt;2.6a. Hilbert Curve - Snippet&lt;/summary&gt;
&lt;pre&gt;&lt;code&gt;public class HilbertCurve {
    // Rotate/flip a quadrant appropriately
    private static void rot(int n, int[] x, int[] y, int rx, int ry) {
        if (ry == 0) {
            if (rx == 1) {
                x[0] = n - 1 - x[0];
                y[0] = n - 1 - y[0];
            }
            // Swap x and y
            int temp = x[0];
            x[0] = y[0];
            y[0] = temp;
        }
    }

    // Convert (x, y) to Hilbert curve distance
    public static int xyToHilbert(int n, int x, int y) {
        int d = 0;
        int[] ix = { x };
        int[] iy = { y };

        for (int s = n / 2; s &amp;gt; 0; s /= 2) {
            int rx = (ix[0] &amp;amp; s) &amp;gt; 0 ? 1 : 0;
            int ry = (iy[0] &amp;amp; s) &amp;gt; 0 ? 1 : 0;
            d += s * s * ((3 * rx) ^ ry);
            rot(s, ix, iy, rx, ry);
        }

        return d;
    }

    // Convert Hilbert curve distance to (x, y)
    public static void hilbertToXy(int n, int d, int[] x, int[] y) {
        int rx, ry, t = d;
        x[0] = y[0] = 0;
        for (int s = 1; s &amp;lt; n; s *= 2) {
            rx = (t / 2) % 2;
            ry = (t ^ rx) % 2;
            rot(s, x, y, rx, ry);
            x[0] += s * rx;
            y[0] += s * ry;
            t /= 4;
        }
    }

    public static void main(String[] args) {
        int n = 16; // size of the grid (must be a power of 2)
        int x = 5;
        int y = 10;
        int d = xyToHilbert(n, x, y);
        System.out.println(&quot;The Hilbert curve distance for (&quot; + x + &quot;, &quot; + y + &quot;) is: &quot; + d);

        int[] point = new int[2];
        hilbertToXy(n, d, point, point);
        System.out.println(&quot;The coordinates for Hilbert curve distance &quot; + d + &quot; are: (&quot; + point[0] + &quot;, &quot; + point[1] + &quot;)&quot;);
    }
}
&lt;/code&gt;&lt;/pre&gt;
&lt;/details&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;2.7. Z-Order Curve and Hilbert Curve - Conclusion&lt;/summary&gt;

&lt;p&gt;Usage: Insert data points and their Z-order keys/Hilbert Keys (let&apos;s call it Z and H keys) into a one-dimensional hierarchical index structure, such as a &lt;a href=&quot;/b-tree&quot;&gt;B-Tree&lt;/a&gt; or Quad-Tree. For range or nearest neighbor queries, convert the search criteria into Z/H keys or range of keys. After retrieval, further filter the results as necessary to remove any garbage values.&lt;/p&gt;

&lt;p&gt;To conclude: Space-Filling Curves such as Z-Order/Hilbert indexing is a powerful technique to query higher-dimensional data, especially as the data volumes grows. By combining bits from multiple dimensions into a single value, space-Filling Curves indexing preserves spatial locality, enabling efficient data indexing and retrieval.&lt;/p&gt;

&lt;p&gt;However, as seen in &lt;a href=&quot;#2-5-z-order-curve-implementation&quot;&gt;Section 2.5&lt;/a&gt;, large jumps along the Z-Order curve can affect certain types of queries (better with Hilbert curves &lt;a href=&quot;#2-2-hilbert-curve-intuition&quot;&gt;Section 2.2&lt;/a&gt;). The success of Z-Order indexing relies on the data&apos;s distribution and cardinality. Therefore, it is essential to evaluate the nature of the data, query patterns, performance needs and limitation(s) of indexing strategies.&lt;/p&gt;
&lt;/details&gt;

&lt;/details&gt;

&lt;hr class=&quot;clear-hr&quot; /&gt;

&lt;details&gt;&lt;summary class=&quot;h3&quot;&gt;3. References&lt;/summary&gt;

&lt;pre style=&quot;max-height: 300px&quot;&gt;&lt;code&gt;1. &quot;Programming the Hilbert curve&quot; (American Institue of Physics (AIP) Conf. Proc. 707, 381 (2004)).
2. Wikipedia. “Z-order curve,” [Online]. Available: https://en.wikipedia.org/wiki/Z-order_curve.
3. Amazon Web Services, “Z-order indexing for multifaceted queries in Amazon DynamoDB – Part 1,” [Online]. Available: https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/. [Accessed: 10-Jun-2024].
4. N. Chandra, “Z-order indexing for efficient queries in Data Lake,” Medium, 20-Sep-2021. [Online]. Available: https://medium.com/@nishant.chandra/. [Accessed: 10-Jun-2024]z-order-indexing-for-efficient-queries-in-data-lake-48eceaeb2320. [Accessed: 10-Jun-2024].
5. YouTube, “Z-order indexing for efficient queries in Data Lake,” [Online]. Available: https://www.youtube.com/watch?v=YLVkITvF6KU. [Accessed: 10-Jun-2024].
&lt;/code&gt;&lt;/pre&gt;

&lt;/details&gt;</content><author><name>Adesh Nalpet Adimurthy</name></author><category term="System Wisdom" /><category term="Database" /><category term="Spatial Index" /><summary type="html">0. Overview Spatial data has grown (/is growing) rapidly thanks to web services tracking where and when users do things. Most applications add location tags and often allow users check in specific places and times. This surge is largely due to smartphones, which act as location sensors, making it easier than ever to capture and analyze this type of data.</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://pyblog.xyz/assets/featured/webp/spatio-temporal-index.webp" /><media:content medium="image" url="https://pyblog.xyz/assets/featured/webp/spatio-temporal-index.webp" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Real-time insights: Telemetry Pipeline</title><link href="https://pyblog.xyz/telemetry-pipeline" rel="alternate" type="text/html" title="Real-time insights: Telemetry Pipeline" /><published>2024-06-07T00:00:00+00:00</published><updated>2024-06-07T00:00:00+00:00</updated><id>https://pyblog.xyz/telemetry-pipeline</id><content type="html" xml:base="https://pyblog.xyz/telemetry-pipeline">&lt;details open=&quot;&quot;&gt;&lt;summary class=&quot;h3&quot;&gt;0. Overview&lt;/summary&gt;
&lt;p&gt;&lt;/p&gt;
&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;0.1. Architecture&lt;/summary&gt;
&lt;p&gt;A &lt;a href=&quot;https://en.wikipedia.org/wiki/Telemetry&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;telemetry&lt;/a&gt; pipeline is a system that collects, ingests, processes, stores, and analyzes telemetry data (metrics, logs, traces) from various sources in real-time or near real-time to provide insights into the performance and health of applications and infrastructure.&lt;/p&gt;

&lt;img class=&quot;telemetry-barebone center-image-90&quot; src=&quot;./assets/posts/telemetry/telemetry-barebone.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 0: Barebone Telemetry Pipeline Architecture&lt;/p&gt;

&lt;p&gt;It typically involves tools like Telegraf for data collection, Kafka for ingestion, Flink for processing, and &lt;a href=&quot;https://cassandra.apache.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Cassandra&lt;/a&gt; and &lt;a href=&quot;https://victoriametrics.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;VictoriaMetrics&lt;/a&gt; for storage and analysis.&lt;/p&gt;

&lt;img class=&quot;telemetry-architecture&quot; src=&quot;./assets/posts/telemetry/telemetry-architecture.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 1: Detailed Telemetry Pipeline Architecture&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;0.2. Stages&lt;/summary&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Collection&lt;/b&gt;: Telemetry data is collected from various sources using agents like Telegraf and &lt;a href=&quot;https://www.fluentd.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Fluentd&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Ingestion&lt;/b&gt;: Data is ingested through message brokers such as Apache Kafka or Kinesis to handle high throughput.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Processing&lt;/b&gt;: Real-time processing is done using stream processing frameworks like Apache Flink for filtering, aggregating, and enriching data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Storage and Analysis&lt;/b&gt;: Processed data is stored in systems like Cassandra, &lt;a href=&quot;https://clickhouse.com/&quot; target=&quot;_blank&quot;&gt;ClickHouse&lt;/a&gt; and &lt;a href=&quot;https://www.elastic.co/downloads/elasticsearch&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Elasticsearch&lt;/a&gt;, and analyzed using tools like &lt;a href=&quot;https://grafana.com/&quot; target=&quot;_blank&quot;&gt;Grafana&lt;/a&gt; and &lt;a href=&quot;https://www.elastic.co/kibana&quot; target=&quot;_blank&quot;&gt;Kibana&lt;/a&gt; for visualization and alerting.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/details&gt;

&lt;/details&gt;

&lt;hr class=&quot;clear-hr&quot; /&gt;

&lt;details open=&quot;&quot;&gt;&lt;summary class=&quot;h3&quot;&gt;1. Collection&lt;/summary&gt;
&lt;p&gt;&lt;/p&gt;
&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;1.1. Collection Agent&lt;/summary&gt;

&lt;p&gt;To start, we&apos;ll use &lt;a href=&quot;https://www.influxdata.com/time-series-platform/telegraf/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Telegraf&lt;/a&gt;, a versatile open-source agent that collects metrics from various sources and writes them to different outputs. Telegraf supports a wide range of &lt;a href=&quot;https://docs.influxdata.com/telegraf/v1/plugins/#input-plugins&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;input&lt;/a&gt; and &lt;a href=&quot;https://docs.influxdata.com/telegraf/v1/plugins/#output-plugins&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;output plugins&lt;/a&gt;, making it easy to gather data from sensors, servers, GPS systems, and more.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-image-0 center-image-80 telegraf-overview&quot; src=&quot;./assets/posts/telemetry/telegraf-overview.svg&quot; /&gt; &lt;/p&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 2: Telegraf for collecting metrics &amp;amp; data&lt;/p&gt;

&lt;p&gt;For this example, we&apos;ll focus on collecting the CPU temperature and Fan speed from a macOS system using the &lt;a href=&quot;https://github.com/influxdata/telegraf/blob/release-1.30/plugins/inputs/exec/README.md&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;exec plugin&lt;/a&gt; in Telegraf. And leverage the &lt;a href=&quot;https://github.com/lavoiesl/osx-cpu-temp&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;osx-cpu-temp&lt;/a&gt; command line tool to fetch the CPU temperature.&lt;/p&gt;

&lt;p&gt;🌵 &lt;a href=&quot;https://github.com/inlets/inlets-pro&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Inlets&lt;/a&gt; allows devices behind firewalls or NAT to securely expose local services to the public internet by tunneling traffic through a public-facing Inlets server&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;1.2. Dependencies&lt;/summary&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Using Homebrew: &lt;code&gt;brew install telegraf&lt;/code&gt;&lt;br /&gt;
For other OS, refer: &lt;a href=&quot;https://docs.influxdata.com/telegraf/v1/install/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;docs.influxdata.com/telegraf/v1/install&lt;/a&gt;. &lt;br /&gt;
Optionally, download the latest telegraf release from: &lt;a href=&quot;https://www.influxdata.com/downloads&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;https://www.influxdata.com/downloads&lt;/a&gt;&lt;br /&gt;&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Using Homebrew: &lt;code&gt;brew install osx-cpu-temp&lt;/code&gt;&lt;br /&gt;
Refer: &lt;a href=&quot;https://github.com/lavoiesl/osx-cpu-temp&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;github.com/lavoiesl/osx-cpu-temp&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/details&gt;
&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;1.3. Events&lt;/summary&gt;

&lt;p&gt;Here&apos;s a &lt;b&gt;custom script&lt;/b&gt; to get the CPU and Fan Speed:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#!/bin/bash
timestamp=$(date +%s)000000000
hostname=$(hostname | tr &quot;[:upper:]&quot; &quot;[:lower:]&quot;)
cpu=$(osx-cpu-temp -c | sed -e &apos;s/\([0-9.]*\).*/\1/&apos;)
fans=$(osx-cpu-temp -f | grep &apos;^Fan&apos; | sed -e &apos;s/^Fan \([0-9]\) - \([a-zA-Z]*\) side *at \([0-9]*\) RPM (\([0-9]*\)%).*/\1,\2,\3,\4/&apos;)
echo &quot;cpu_temp,device_id=$hostname temp=$cpu $timestamp&quot;
for f in $fans; do
  side=$(echo &quot;$f&quot; | cut -d, -f2 | tr &quot;[:upper:]&quot; &quot;[:lower:]&quot;)
  rpm=$(echo &quot;$f&quot; | cut -d, -f3)
  pct=$(echo &quot;$f&quot; | cut -d, -f4)
  echo &quot;fan_speed,device_id=$hostname,side=$side rpm=$rpm,percent=$pct $timestamp&quot;
done
&lt;/code&gt;&lt;/pre&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;p&gt;&lt;b&gt;Output Format&lt;/b&gt;: &lt;code&gt;measurement,host=foo,tag=measure val1=5,val2=3234.34 1609459200000000000&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The output is of &lt;a href=&quot;https://docs.influxdata.com/influxdb/v1/write_protocols/line_protocol_reference/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Line protocol syntax&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Where &lt;code&gt;measurement&lt;/code&gt; is the “table” (“measurement&quot; in InfluxDB terms) to which the metrics are written.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;host=foo,tag=measure&lt;/code&gt; are tags to can group and filter by.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;val1=5,val2=3234.34&lt;/code&gt; are values, to display in graphs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;1716425990000000000&lt;/code&gt; is the current unix timestamp + 9 x &quot;0&quot; — representing nanosecond timestamp.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;b&gt;Sample Output&lt;/b&gt;: &lt;code&gt;cpu_temp,device_id=adeshs-mbp temp=0.0 1716425990000000000&lt;/code&gt;&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;1.4. Configuration&lt;/summary&gt;
&lt;p&gt;The location of &lt;code&gt;telegraf.conf&lt;/code&gt; installed using homebrew: &lt;code&gt;/opt/homebrew/etc/telegraf.conf&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Telegraf&apos;s configuration file is written using &lt;a href=&quot;https://github.com/toml-lang/toml#toml&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;TOML&lt;/a&gt; and is composed of three sections: &lt;a href=&quot;https://github.com/influxdata/telegraf/blob/master/docs/CONFIGURATION.md#global-tags&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;global tags&lt;/a&gt;, &lt;a href=&quot;https://github.com/influxdata/telegraf/blob/master/docs/CONFIGURATION.md#agent&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;agent&lt;/a&gt; settings, and &lt;a href=&quot;https://github.com/influxdata/telegraf/blob/master/docs/CONFIGURATION.md#plugins&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;plugins&lt;/a&gt; (inputs, outputs, processors, and aggregators).&lt;/p&gt;

&lt;p&gt;Once Telegraf collects the data, we need to transmit it to a designated endpoint for further processing. For this, we&apos;ll use the &lt;a href=&quot;https://github.com/influxdata/telegraf/blob/release-1.30/plugins/outputs/http/README.md&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;HTTP output plugin&lt;/a&gt; in Telegraf to send the data in JSON format to a Flask application (covered in the next section).&lt;/p&gt;

&lt;p&gt;Below is what the &lt;code&gt;telegraf.conf&lt;/code&gt; file looks like, with &lt;code&gt;exec&lt;/code&gt; input plugin (format: &lt;code&gt;influx&lt;/code&gt;) and &lt;code&gt;HTTP&lt;/code&gt; output plugin (format: &lt;code&gt;JSON&lt;/code&gt;).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[agent]
  interval = &quot;10s&quot;
  round_interval = true
  metric_buffer_limit = 10000
  flush_buffer_when_full = true
  collection_jitter = &quot;0s&quot;
  flush_interval = &quot;10s&quot;
  flush_jitter = &quot;0s&quot;
  precision = &quot;&quot;
  debug = false
  quiet = false
  logfile = &quot;/path to telegraf log/telegraf.log&quot;
  hostname = &quot;host&quot;
  omit_hostname = false

[[inputs.exec]]
  commands = [&quot;/path to custom script/osx_metrics.sh&quot;]
  timeout = &quot;5s&quot;
  name_suffix = &quot;_custom&quot;
  data_format = &quot;influx&quot;
  interval = &quot;10s&quot;

[[outputs.http]]
  url = &quot;http://127.0.0.1:5000/metrics&quot;
  method = &quot;POST&quot;
  timeout = &quot;5s&quot;
  data_format = &quot;json&quot;
  [outputs.http.headers]
    Content-Type = &quot;application/json&quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Edit &lt;code&gt;telegraf.conf&lt;/code&gt; (use above config):&lt;br /&gt; &lt;code&gt;vi /opt/homebrew/etc/telegraf.conf&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;🚧: Don&apos;t forget to expore tons of other input and output plugins: &lt;a href=&quot;https://docs.influxdata.com/telegraf/v1/plugins/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;docs.influxdata.com/telegraf/v1/plugins&lt;/a&gt;&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;h4&quot; id=&quot;telemetry-1-5&quot;&gt;1.5. Start Capture&lt;/summary&gt;
&lt;p&gt;Run &lt;code&gt;telegraf&lt;/code&gt; (when installed from Homebrew):&lt;br /&gt; &lt;code&gt;/opt/homebrew/opt/telegraf/bin/telegraf -config /opt/homebrew/etc/telegraf.conf&lt;/code&gt;&lt;/p&gt;
&lt;/details&gt;

&lt;/details&gt;

&lt;hr class=&quot;clear-hr&quot; /&gt;

&lt;details open=&quot;&quot;&gt;&lt;summary class=&quot;h3&quot;&gt;2. Ingestion&lt;/summary&gt;
&lt;p&gt;&lt;/p&gt;
&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;2.1. Telemetry Server&lt;/summary&gt;

&lt;p&gt;The telemetry server layer is designed to be &lt;u&gt;lightweight&lt;/u&gt;. Its primary function is to authenticate incoming requests and publish raw events directly to Message Broker/Kafka. Further processing of these events will be carried out by the stream processing framework.&lt;/p&gt;

&lt;p&gt;For our example, the &lt;a href=&quot;https://flask.palletsprojects.com/en/3.0.x/&quot; target=&quot;_blank&quot;&gt;Flask&lt;/a&gt; application serves as the telemetry server, acting as the entry point (via load-balancer) for the requests. It receives the data from a POST request, validates it, and publishes the messages to a &lt;a href=&quot;https://kafka.apache.org/&quot; target=&quot;_blank&quot;&gt;Kafka&lt;/a&gt; topic.&lt;/p&gt;

&lt;p&gt;Topic partition is the unit of parallelism in Kafka. Choose a partition key (ex: client_id) that evenly distributes records to avoid hotspots and &lt;a href=&quot;https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster&quot; target=&quot;_blank&quot;&gt;number of partitions&lt;/a&gt; to achieve good throughput.&lt;/p&gt;

&lt;p&gt;🚧 Message Broker Alternatives: &lt;a href=&quot;https://aws.amazon.com/kinesis/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Amazon Kinesis&lt;/a&gt;, &lt;a href=&quot;https://redpanda.com/&quot; target=&quot;_blank&quot;&gt;Redpanda&lt;/a&gt;&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;2.2. Dependencies&lt;/summary&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Using PIP: &lt;code&gt;pip3 install Flask flask-cors kafka-python&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;b&gt;For Local Kafka Set-up&lt;/b&gt; (Or use Docker from next sub-section):
&lt;li&gt;&lt;p&gt;Using Homebrew: &lt;code&gt;brew install kafka&lt;/code&gt; &lt;br /&gt;Refer: &lt;a href=&quot;https://formulae.brew.sh/formula/kafka&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Homebrew Kafka&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Start Zookeeper: &lt;code&gt;zookeeper-server-start /opt/homebrew/etc/kafka/zookeeper.properties&lt;/code&gt;&lt;br /&gt;
Start Kafka: &lt;code&gt;brew services restart kafka&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Create Topic: &lt;code&gt;kafka-topics --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic learn&lt;/code&gt; &lt;br /&gt;Usage: &lt;a href=&quot;https://kafka.apache.org/documentation/#topicconfigs&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Kafka CLI&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;h4&quot; id=&quot;telemetry-2-3&quot;&gt;2.3. Docker Compose&lt;/summary&gt;

&lt;p&gt;To set up Kafka using Docker Compose, ensure Docker is installed on your machine by following the instructions on the &lt;a href=&quot;https://docs.docker.com/get-docker/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Docker installation&lt;/a&gt; page. Once Docker is installed, create a &lt;code&gt;docker-compose.yml&lt;/code&gt; for &lt;code&gt;Kafka&lt;/code&gt; and &lt;code&gt;Zookeeper&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;version: &apos;3.7&apos;

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.3.5
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
    ports:
      - &quot;2181:2181&quot;

  kafka:
    image: confluentinc/cp-kafka:7.3.5
    ports:
      - &quot;9092:9092&quot;  # Internal port
      - &quot;9094:9094&quot;  # External port
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INTERNAL:PLAINTEXT,OUTSIDE:PLAINTEXT
      KAFKA_ADVERTISED_LISTENERS: INTERNAL://kafka:9092,OUTSIDE://localhost:9094
      KAFKA_LISTENERS: INTERNAL://0.0.0.0:9092,OUTSIDE://0.0.0.0:9094
      KAFKA_INTER_BROKER_LISTENER_NAME: INTERNAL
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      CONFLUENT_SUPPORT_METRICS_ENABLE: &quot;false&quot;
    depends_on:
      - zookeeper

  kafka-topics-creator:
    image: confluentinc/cp-kafka:7.3.5
    depends_on:
      - kafka
    entrypoint: [&quot;/bin/sh&quot;, &quot;-c&quot;]
    command: |
      &quot;
      # blocks until kafka is reachable
      kafka-topics --bootstrap-server kafka:9092 --list

      echo -e &apos;Creating kafka topics&apos;
      kafka-topics --bootstrap-server kafka:9092 --create --if-not-exists --topic raw-events --replication-factor 1 --partitions 1

      echo -e &apos;Successfully created the following topics:&apos;
      kafka-topics --bootstrap-server kafka:9092 --list
      &quot;

  schema-registry:
    image: confluentinc/cp-schema-registry:7.3.5
    environment:
      - SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL=zookeeper:2181
      - SCHEMA_REGISTRY_HOST_NAME=schema-registry
      - SCHEMA_REGISTRY_LISTENERS=http://schema-registry:8085,http://localhost:8085
    ports:
      - 8085:8085
    depends_on: [zookeeper, kafka]
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Run &lt;code&gt;docker-compose up&lt;/code&gt; to start the services (Kafka + Zookeeper).&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;h4&quot; id=&quot;telemetry-2-4&quot;&gt;2.4. Start Server&lt;/summary&gt;

&lt;p&gt;The Flask application includes a &lt;code&gt;/metrics&lt;/code&gt; endpoint, as configured in &lt;code&gt;telegraf.conf&lt;/code&gt; output to collect metrics. When data is sent to this endpoint, the Flask app receives and publishes the message to &lt;code&gt;Kafka&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;New to Flask? Refer: &lt;a href=&quot;https://flask.palletsprojects.com/en/3.0.x/quickstart/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Flask Quickstart&lt;/a&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;import os
from flask_cors import CORS
from flask import Flask, jsonify, request
from dotenv import load_dotenv
from kafka import KafkaProducer
import json


app = Flask(__name__)
cors = CORS(app)
load_dotenv()

producer = KafkaProducer(bootstrap_servers=&apos;localhost:9094&apos;, 
                         value_serializer=lambda v: json.dumps(v).encode(&apos;utf-8&apos;))

@app.route(&apos;/metrics&apos;, methods=[&apos;POST&apos;])
def process_metrics():
    data = request.get_json()
    print(data)
    producer.send(&apos;raw-events&apos;, data)
    return jsonify({&apos;status&apos;: &apos;success&apos;}), 200


if __name__ == &quot;__main__&quot;:
    app.run(debug=True, host=&quot;0.0.0.0&quot;, port=int(os.environ.get(&quot;PORT&quot;, 8080)))
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Start all services 🚀:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Run Flask App (Telemetry Server):&lt;br /&gt; &lt;code&gt;flask run&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ensure &lt;code&gt;telegraf&lt;/code&gt; is running (Refer: &lt;a href=&quot;#telemetry-1-5&quot;&gt;Section 1.5&lt;/a&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/details&gt;

&lt;/details&gt;

&lt;hr class=&quot;clear-hr&quot; /&gt;

&lt;details open=&quot;&quot;&gt;&lt;summary class=&quot;h3&quot;&gt;3. Processing&lt;/summary&gt;
&lt;p&gt;&lt;/p&gt;
&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;3.1. Stream Processor&lt;/summary&gt;
&lt;p&gt;The Stream Processor is responsible for data transformation, enrichment, stateful computations/updates over unbounded (push-model) and bounded (pull-model) data streams and sink enriched and transformed data to various data stores or applications. Key Features to Look for in a Stream Processing Framework:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Scalability and Performance&lt;/b&gt;: Scale by adding nodes, efficiently use resources, process data with minimal delay, and handle large volumes&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Fault Tolerance and Data Consistency&lt;/b&gt;: Ensure fault tolerance with state saving for failure recovery and exactly-once processing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Ease of Use and Community Support&lt;/b&gt;: Provide user-friendly APIs in multiple languages, comprehensive documentation, and active community support.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;img src=&quot;./assets/posts/telemetry/stateful-stream-processing.svg&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 3: Stateful Stream Processing&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Integration and Compatibility&lt;/b&gt;: Seamlessly integrate with various data sources and sinks, and be compatible with other tools in your tech stack.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Windowing and Event Time Processing&lt;/b&gt;: Support various &lt;a href=&quot;https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/datastream/operators/windows/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;windowing strategies&lt;/a&gt; (tumbling, sliding, session) and manage late-arriving data based on event timestamps.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Security and Monitoring&lt;/b&gt;: Include security features like data encryption and robust access controls, and provide tools for monitoring performance and logging.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Although I have set the context to use Flink for this example;&lt;br /&gt;
☢️ Note: While &lt;a href=&quot;https://flink.apache.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Apache Flink&lt;/a&gt; is a powerful choice for stream processing due to its rich feature set, scalability, and advanced capabilities, it can be overkill for a lot of use cases, particularly those with simpler requirements and/or lower data volumes.&lt;/p&gt;

&lt;p&gt;🚧 Open Source Alternatives: &lt;a href=&quot;https://kafka.apache.org/documentation/streams/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Apache Kafka Streams&lt;/a&gt;, &lt;a href=&quot;https://storm.apache.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Apache Storm&lt;/a&gt;, &lt;a href=&quot;https://samza.apache.org/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Apache Samza&lt;/a&gt;&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;3.2. Dependencies&lt;/summary&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Install PyFlink Using PIP: &lt;code&gt;pip3 install apache-flink==1.18.1&lt;/code&gt;&lt;br /&gt;Usage examples: &lt;a href=&quot;https://github.com/apache/flink/tree/release-1.19/flink-python/pyflink/examples&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;flink-python/pyflink/examples&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;

&lt;b&gt;For Local Flink Set-up:&lt;/b&gt; (Or use Docker from next sub-section)
&lt;li&gt;&lt;p&gt;Download Flink and extract the archive: &lt;a href=&quot;https://www.apache.org/dyn/closer.lua/flink/flink-1.18.1/flink-1.18.1-bin-scala_2.12.tgz&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;www.apache.org/dyn/closer.lua/flink/flink-1.18.1/flink-1.18.1-bin-scala_2.12.tgz&lt;/a&gt;&lt;br /&gt;☢️ At the time of writing this post &lt;code&gt;Flink 1.18.1&lt;/code&gt; is the latest stable version that supports &lt;a href=&quot;https://www.apache.org/dyn/closer.lua/flink/flink-connector-kafka-3.1.0/flink-connector-kafka-3.1.0-src.tgz&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;kafka connector plugin&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Download Kafka Connector and extract the archive: &lt;a href=&quot;https://www.apache.org/dyn/closer.lua/flink/flink-connector-kafka-3.1.0/flink-connector-kafka-3.1.0-src.tgz&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;www.apache.org/dyn/closer.lua/flink/flink-connector-kafka-3.1.0/flink-connector-kafka-3.1.0-src.tgz&lt;/a&gt;&lt;br /&gt;Copy/Move the &lt;code&gt;flink-connector-kafka-3.1.0-1.18.jar&lt;/code&gt; to &lt;code&gt;flink-1.18.1/lib&lt;/code&gt; (&lt;code&gt;$FLINK_HOME/lib&lt;/code&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ensure Flink Path is set &lt;code&gt;export FLINK_HOME=/full-path/flink-1.18.1&lt;/code&gt; (add to &lt;code&gt;.bashrc&lt;/code&gt;/&lt;code&gt;.zshrc&lt;/code&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Start Flink Cluster: &lt;code&gt;cd flink-1.18.1 &amp;amp;&amp;amp; ./bin/start-cluster.sh&lt;/code&gt;
&lt;br /&gt;Flink dashboard at: &lt;a href=&quot;http://localhost:8081&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;localhost:8081&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;To Stop Flink Cluster: &lt;code&gt;./bin/stop-cluster.sh&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;h4&quot; id=&quot;telemetry-3-3&quot;&gt;3.3. Docker Compose&lt;/summary&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Create &lt;code&gt;flink_init/Dockerfile&lt;/code&gt; file for Flink and Kafka Connector:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;FROM flink:1.18.1-scala_2.12

RUN wget -P /opt/flink/lib https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-kafka/3.1.0-1.18/flink-connector-kafka-3.1.0-1.18.jar

RUN chown -R flink:flink /opt/flink/lib
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Add Flink to &lt;code&gt;docker-compose.yml&lt;/code&gt; (in-addition to Kafka, from &lt;a href=&quot;#telemetry-2-3&quot;&gt;Section 2.3&lt;/a&gt;)&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;version: &apos;3.8&apos;
services:
  jobmanager:
    build: flink_init/.
    ports:
      - &quot;8081:8081&quot;
    command: jobmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager

  taskmanager:
    build: flink_init/.
    depends_on:
      - jobmanager
    command: taskmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        taskmanager.numberOfTaskSlots: 2
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;p&gt;Run &lt;code&gt;docker-compose up&lt;/code&gt; to start the services (Kafka + Zookeeper, Flink).&lt;/p&gt;
&lt;/ul&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details class=&quot;code-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;3.4. Start Cluster&lt;/summary&gt;
&lt;p&gt;⚠️ PyFlink Job:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Start all services 🚀:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Ensure all the services are running (Refer: Section &lt;a href=&quot;#telemetry-1-5&quot;&gt;1.5&lt;/a&gt;, &lt;a href=&quot;#telemetry-2-4&quot;&gt;2.4&lt;/a&gt;, &lt;a href=&quot;#telemetry-3-3&quot;&gt;3.3&lt;/a&gt;)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/details&gt;

&lt;/details&gt;

&lt;hr class=&quot;clear-hr&quot; /&gt;

&lt;details open=&quot;&quot;&gt;&lt;summary class=&quot;h3&quot;&gt;4. Storage and Analysis &lt;/summary&gt;
&lt;p&gt;The code snippets - stops here! The rest of the post covers key conventions, strategies, and factors for selecting the right data store, performing real-time analytics, and alerts.&lt;/p&gt;
&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;4.1. Datastore &lt;/summary&gt;
&lt;p&gt;When choosing the right database for telemetry data, it&apos;s crucial to consider several factors:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Read and Write Patterns&lt;/b&gt;: Understanding the frequency and volume of read and write operations is key. High write and read throughput require different database optimizations and consistencies.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Data Amplification&lt;/b&gt;: Be mindful of how the data volume might grow over time (+&lt;a href=&quot;https://en.wikipedia.org/wiki/Write_amplification&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Write Amplification&lt;/a&gt;) and how the database handles this increase without significant performance degradation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Cost&lt;/b&gt;: Evaluate the cost implications, including storage, processing, and any associated services.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Analytics Use Cases&lt;/b&gt;: Determine whether the primary need is for real-time analytics, historical data analysis, or both.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Transactions&lt;/b&gt;: Consider the nature and complexity of transactions that will be performed. For example: Batch write transactions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Read and Write Consistency&lt;/b&gt;: Decide on the level of consistency required for the application. For example, OLTP (Online Transaction Processing) systems prioritize consistency and transaction integrity, while OLAP (Online Analytical Processing) systems are optimized for complex queries and read-heavy workloads.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌵 &lt;a href=&quot;https://tikv.github.io/deep-dive-tikv/key-value-engine/B-Tree-vs-Log-Structured-Merge-Tree.html&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;LSM-Tree&lt;/a&gt; favors write-intensive applications.&lt;/p&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;p&gt;For example, to decide between Row-based vs Columar Storage. Or OLTP (Online Transaction Processing), OLAP (Online Analytical Processing), or a Hybrid approach:&lt;/p&gt;

&lt;img class=&quot;center-image-90&quot; src=&quot;./assets/posts/telemetry/storage-scan-direction.png&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 4: Row vs Columnnar Storage&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;b&gt;Transactional and High Throughput Needs&lt;/b&gt;: For high write throughput and transactional batches (all or nothing), with queries needing wide-column family fetches and indexed queries within the partition, Cassandra/&lt;a href=&quot;https://www.scylladb.com/&quot; target=&quot;_blank&quot;&gt;ScyllaDB&lt;/a&gt; is better suited.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;b&gt;Complex Analytical Queries&lt;/b&gt;: For more complex analytical queries, aggregations on specific columns, and machine learning models, data store(s) such as &lt;a href=&quot;https://clickhouse.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;ClickHouse&lt;/a&gt; or &lt;a href=&quot;https://druid.apache.org/&quot; target=&quot;_blank&quot;&gt;Druid&lt;/a&gt; is more appropriate. Its optimized columnar storage and powerful query capabilities make it ideal for handling large-scale analytical tasks. Several others include: VictoriaMetrics and InfluxDB (emphasis on time-series); closed-source: &lt;a href=&quot;https://www.snowflake.com/&quot; target=&quot;_blank&quot;&gt;Snowflake&lt;/a&gt;, &lt;a href=&quot;https://cloud.google.com/bigquery&quot; target=&quot;_blank&quot;&gt;BigQuery&lt;/a&gt; and &lt;a href=&quot;https://aws.amazon.com/redshift/&quot; target=&quot;_blank&quot;&gt;Redshift&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;b&gt;Hybrid Approach&lt;/b&gt;: In scenarios requiring both fast write-heavy transactional processing and complex analytics, a common approach is to use Cassandra for real-time data ingestion and storage, and periodically perform ETL (Extract, Transform, Load) or CDC (Change Data Capture) processes to batch insert data into OLAP DB for analytical processing. This leverages the strengths of both databases, ensuring efficient data handling and comprehensive analytical capabilities. Proper indexing and data modeling goes unsaid 🧐&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🌵 &lt;a href=&quot;https://debezium.io/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Debezium&lt;/a&gt;: Distributed platform for change data capture (more on &lt;a href=&quot;/debezium-postgres-cdc&quot;&gt;previous post&lt;/a&gt;).&lt;/p&gt;

&lt;hr class=&quot;hr&quot; /&gt;

&lt;p&gt;Using a HTAP (Hybrid Transactional/Analytical Processing) database that&apos;s suitable for both transactional and analytical workloads is worth considering. Example: &lt;a href=&quot;https://github.com/pingcap/tidb&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;TiDB&lt;/a&gt;, &lt;a href=&quot;https://www.timescale.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;TimescaleDB&lt;/a&gt; (Kind of).&lt;/p&gt;

&lt;p&gt;While you get some of the best from both worlds 🌎, you also inherit a few of the worst from each! &lt;br /&gt;Lucky for you, I have first hand experience with it 🤭:&lt;/p&gt;
&lt;img src=&quot;./assets/posts/telemetry/of-both-worlds-h.png&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 5: Detailed comparison of OLTP, OLAP and HTAP&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Analogy&lt;/b&gt;: Choosing the right database is like picking the perfect ride. Need pay-as-you-go flexibility? Grab a taxi. Tackling heavy-duty tasks? 🚜 Bring in the bulldozer. For everyday use, 🚗 a Toyota fits. Bringing a war tank to a community center is overkill. Sometimes, you need a fleet—a car for daily use, and a truck for heavy loads.&lt;/p&gt;

&lt;p&gt;☢️ &lt;a&gt;InfluxDB&lt;/a&gt;: Stagnant &lt;a href=&quot;https://github.com/influxdata/influxdb/graphs/contributors&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;contribution&lt;/a&gt; graph, &lt;a href=&quot;https://community.influxdata.com/t/is-flux-being-deprecated-with-influxdb-3-0/30992/4&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;Flux&lt;/a&gt; deprecation, but new &lt;a href=&quot;https://www.influxdata.com/benchmarks/&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;benchmarks&lt;/a&gt;!&lt;/p&gt;
&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;4.2. Partition and Indexes&lt;/summary&gt;

&lt;p&gt;Without getting into too much detail, it&apos;s crucial to choose the right partitioning strategy (Ex: Range, List, Hash) to ensure partitions don&apos;t bloat and effectively support primary read patterns (in this context, example: client_id + region + 1st Day of Month).&lt;/p&gt;

&lt;img class=&quot;center-image-0 center-image-70&quot; src=&quot;./assets/posts/telemetry/index-types.svg&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 6: Types of Indexes and Materialized view&lt;/p&gt;

&lt;p&gt;Following this, clustering columns and indexes help organize data within partitions to optimize range queries and sorting. Secondary indexes (within the partition/local or across partitions/global) are valuable for query patterns where partition or primary keys don&apos;t apply. Materialized views for precomputing and storing complex query results, speeding up read operations for frequently accessed data.&lt;/p&gt;

&lt;img src=&quot;./assets/posts/telemetry/partition-view.svg&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 7: Partition Key, Clustering Keys, Local/Global Secondary Indexes and Materialized views&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Multi-dimensional Index (Spatial/Spatio-temporal)&lt;/b&gt;: Indexes such as B+ trees and LSM trees are not designed to directly store higher-dimensional data. Spatial indexing uses structures like R-trees and &lt;a href=&quot;/hybrid-spatial-index-conclusion&quot; target=&quot;_blank&quot;&gt;Quad-trees&lt;/a&gt; and techniques like &lt;a href=&quot;/geohash&quot; target=&quot;_blank&quot;&gt;geohash&lt;/a&gt;. Space-filling curves like Z-order (Morton) and Hilbert curves interleave spatial and temporal dimensions, preserving locality and enabling efficient queries.&lt;/p&gt;

&lt;img class=&quot;center-image-0&quot; src=&quot;./assets/posts/spatial-index/spatial-index-types.svg&quot; /&gt; 
&lt;p class=&quot;figure-header&quot;&gt;Figure 8: Commonly Used: Types of Spatial Indexes&lt;/p&gt;

&lt;p&gt;🌵 &lt;a href=&quot;https://www.geomesa.org/documentation/stable/index.html&quot; target=&quot;_blank&quot;&gt;GeoMesa&lt;/a&gt;: spatio-temporal indexing on top of the Accumulo, HBase, Redis, Kafka, PostGIS and Cassandra. &lt;a href=&quot;https://www.geomesa.org/documentation/stable/user/datastores/index_overview.html&quot; target=&quot;_blank&quot;&gt;XZ-Ordering&lt;/a&gt;: Customizing Index Creation.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;twemoji&quot; src=&quot;../assets/img/emoji/rocket.svg&quot; alt=&quot;&quot; /&gt; &lt;a href=&quot;/spatial-index-space-filling-curve&quot;&gt;Next blog&lt;/a&gt; post is all about spatial indexes!&lt;/p&gt;

&lt;/details&gt;

&lt;hr class=&quot;sub-hr&quot; /&gt;

&lt;details open=&quot;&quot; class=&quot;text-container&quot;&gt;&lt;summary class=&quot;h4&quot;&gt;4.3. Analytics and Alerts&lt;/summary&gt;

&lt;p&gt;Typically, analytics are performed as batch queries on bounded datasets of recorded events, requiring reruns to incorporate new data.&lt;/p&gt;

&lt;img class=&quot;center-image-65&quot; src=&quot;./assets/posts/telemetry/telemetry-analytics.svg&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 9: Analytics on Static, Relative and In-Motion Data&lt;/p&gt;

&lt;p&gt;In contrast, streaming queries ingest real-time event streams, continuously updating results as events are consumed, with outputs either written to an external database or maintained as internal state.&lt;/p&gt;

&lt;img src=&quot;./assets/posts/telemetry/usecases-analytics.svg&quot; /&gt;
&lt;p class=&quot;figure-header&quot;&gt;Figure 10: Batch Analytics vs Stream Analytics&lt;/p&gt;
&lt;div class=&quot;table-container&quot;&gt;
&lt;table style=&quot;width: 800px;&quot;&gt;
    &lt;tr&gt;
        &lt;td&gt;Feature&lt;/td&gt;
        &lt;td&gt;Batch Analytics&lt;/td&gt;
        &lt;td&gt;Stream Analytics&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Data Processing&lt;/td&gt;
        &lt;td&gt;Processes large volumes of stored data&lt;/td&gt;
        &lt;td&gt;Processes data in real-time as it arrives&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Result Latency&lt;/td&gt;
        &lt;td&gt;Produces results with some delay; near real-time results with frequent query runs&lt;/td&gt;
        &lt;td&gt;Provides immediate insights and actions&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Resource Efficiency&lt;/td&gt;
        &lt;td&gt;Requires querying the database often for necessary data&lt;/td&gt;
        &lt;td&gt;Continuously updates results in transient data stores without re-querying the database&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Typical Use&lt;/td&gt;
        &lt;td&gt;Ideal for historical analysis and periodic reporting&lt;/td&gt;
        &lt;td&gt;Best for real-time monitoring, alerting, and dynamic applications&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Complexity Handling&lt;/td&gt;
        &lt;td&gt;Can handle complex queries and computations&lt;/td&gt;
        &lt;td&gt;Less effective for highly complex queries&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
        &lt;td&gt;Backfill&lt;/td&gt;
        &lt;td&gt;Easy to backfill historical data and re-run queries&lt;/td&gt;
        &lt;td&gt;Backfill can potentially introduce complexity&lt;/td&gt;
    &lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;🌵 &lt;a href=&quot;/anomaly-detection-and-remediation&quot; target=&quot;_blank&quot;&gt;Anomaly Detection and Remediation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🌵 &lt;a href=&quot;https://docs.mindsdb.com/what-is-mindsdb&quot; target=&quot;_blank&quot; rel=&quot;noopener noreferrer&quot;&gt;MindsDB&lt;/a&gt;: Connect Data Source, Configure AI Engine, Create AI Tables, Query for predictions and Automate workflows.&lt;/p&gt;
&lt;/details&gt;

&lt;/details&gt;

&lt;hr class=&quot;clear-hr&quot; /&gt;

&lt;details&gt;&lt;summary class=&quot;h3&quot;&gt;5. References&lt;/summary&gt;

&lt;pre style=&quot;height: 300px&quot;&gt;&lt;code&gt;1. Wikipedia, &quot;Telemetry,&quot; available: https://en.wikipedia.org/wiki/Telemetry. [Accessed: June 5, 2024].
2. Apache Cassandra, &quot;Cassandra,&quot; available: https://cassandra.apache.org. [Accessed: June 5, 2024].
3. VictoriaMetrics, &quot;VictoriaMetrics,&quot; available: https://victoriametrics.com. [Accessed: June 6, 2024].
4. Fluentd, &quot;Fluentd,&quot; available: https://www.fluentd.org. [Accessed: June 5, 2024].
5. Elasticsearch, &quot;Elasticsearch,&quot; available: https://www.elastic.co. [Accessed: June 5, 2024].
6. InfluxData, &quot;Telegraf,&quot; available: https://www.influxdata.com. [Accessed: June 5, 2024].
7. InfluxData, &quot;Telegraf Plugins,&quot; available: https://docs.influxdata.com. [Accessed: June 5, 2024].
8. GitHub, &quot;osx-cpu-temp,&quot; available: https://github.com/lavoiesl/osx-cpu-temp. [Accessed: June 5, 2024].
9. GitHub, &quot;Inlets,&quot; available: https://github.com/inlets/inlets. [Accessed: June 5, 2024].
10. InfluxData, &quot;Telegraf Installation,&quot; available: https://docs.influxdata.com/telegraf/v1. [Accessed: June 5, 2024].
11. InfluxData, &quot;InfluxDB Line Protocol,&quot; available: https://docs.influxdata.com/influxdb/v1.8/write_protocols/line_protocol. [Accessed: June 5, 2024].
12. GitHub, &quot;Telegraf Exec Plugin,&quot; available: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec. [Accessed: June 5, 2024].
13. GitHub, &quot;Telegraf Output Plugins,&quot; available: https://github.com/influxdata/telegraf/tree/master/plugins/outputs. [Accessed: June 5, 2024].
14. Pallets Projects, &quot;Flask,&quot; available: https://flask.palletsprojects.com. [Accessed: June 5, 2024].
15. Apache Kafka, &quot;Kafka,&quot; available: https://kafka.apache.org. [Accessed: June 5, 2024].
16. Confluent, &quot;Kafka Partitions,&quot; available: https://www.confluent.io. [Accessed: June 5, 2024].
17. AWS, &quot;Amazon Kinesis,&quot; available: https://aws.amazon.com/kinesis. [Accessed: June 5, 2024].
18. Redpanda, &quot;Redpanda,&quot; available: https://redpanda.com. [Accessed: June 5, 2024].
19. Apache, &quot;Apache Flink,&quot; available: https://flink.apache.org. [Accessed: June 6, 2024].
20. GitHub, &quot;flink-python/pyflink/examples,&quot; available: https://github.com/apache/flink/tree/master/flink-python/pyflink/examples. [Accessed: June 6, 2024].
21. Apache, &quot;Flink Download,&quot; available: https://www.apache.org/dyn/closer.lua/flink. [Accessed: June 6, 2024].
22. Apache, &quot;Flink Kafka Connector,&quot; available: https://www.apache.org/dyn/closer.lua/flink/flink-connector-kafka-3.1.0. [Accessed: June 6, 2024].
23. Docker, &quot;Docker Installation,&quot; available: https://docs.docker.com. [Accessed: June 6, 2024].
24. Apache Kafka, &quot;Kafka CLI,&quot; available: https://kafka.apache.org/quickstart. [Accessed: June 6, 2024].
25. Homebrew, &quot;Kafka Installation,&quot; available: https://formulae.brew.sh/formula/kafka. [Accessed: June 6, 2024].
26. Apache, &quot;Apache Storm,&quot; available: https://storm.apache.org. [Accessed: June 6, 2024].
27. Apache, &quot;Apache Samza,&quot; available: https://samza.apache.org. [Accessed: June 6, 2024].
28. ClickHouse, &quot;ClickHouse,&quot; available: https://clickhouse.com. [Accessed: June 6, 2024].
29. InfluxData, &quot;InfluxDB Benchmarks,&quot; available: https://www.influxdata.com/benchmarks. [Accessed: June 6, 2024].
30. TiDB, &quot;TiDB,&quot; available: https://github.com/pingcap/tidb. [Accessed: June 6, 2024].
31. Timescale, &quot;TimescaleDB,&quot; available: https://www.timescale.com. [Accessed: June 6, 2024].
32. MindsDB, &quot;MindsDB,&quot; available: https://docs.mindsdb.com. [Accessed: June 6, 2024].
33. Wikipedia, &quot;Write Amplification,&quot; available: https://en.wikipedia.org/wiki/Write_amplification. [Accessed: June 6, 2024].
34. GitHub, &quot;LSM-Tree,&quot; available: https://tikv.github.io/deep-dive/introduction/theory/lsm-tree.html. [Accessed: June 6, 2024].
&lt;/code&gt;&lt;/pre&gt;

&lt;/details&gt;
&lt;p&gt;&lt;/p&gt;</content><author><name>Adesh Nalpet Adimurthy</name></author><category term="System Wisdom" /><category term="Realtime" /><category term="Database" /><summary type="html">0. Overview 0.1. Architecture A telemetry pipeline is a system that collects, ingests, processes, stores, and analyzes telemetry data (metrics, logs, traces) from various sources in real-time or near real-time to provide insights into the performance and health of applications and infrastructure.</summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://pyblog.xyz/assets/featured/webp/telemetry-pipeline.webp" /><media:content medium="image" url="https://pyblog.xyz/assets/featured/webp/telemetry-pipeline.webp" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>