PYBLOG

Kafka Internals

2025-02-09T00:00:00+00:00

🚧 This post is a work in progress, but feel free to explore what’s here so far. Stay tuned for more!

14 years of Apache Kafka! Kafka is the de facto standard for event streaming, just like AWS S3 is for object storage and PostgreSQL is for RDBMS. While every TD&H (SWE) has likely used Kafka, managing a Kafka cluster is a whole other game. The long list of high-importance configurations is a testament to this. In this blog post, the goal is to understand Kafka's internals enough to make sense of its many configurations and highlight best practices.

On a completely different note, the cost and operational complexity of Kafka have led to the emergence of alternatives, making the Kafka API the de facto standard for event streaming, similar to the S3 API and PG Wire. Some examples include: Confluent Kafka, RedPanda, WrapStream, AutoMQ, AWS MSK, Pulsar, and many more!

1. Event Stream

The core concept of Kafka revolves around streaming events. An event can be anything, typically representing an action or information of what happened such as a button click or a temperature reading.

Each event is modeled as a record in Kafka with a timestamp, key, value, and optional headers.

The payload or event data is included in the value, and the key is used for:

imposing the ordering of events/messages,
co-locating the events that has the same key property,
and key-based storage, retention or compaction.

In Kafka, the key and value are stored as byte arrays, giving flexibility to encode the data in whatever way (serializer). Optionally, using a combination of Schema Registry and Avro serializer is a common practice.

2. Kafka Topics

As for comparison, topics are like tables in a database. In the context of Kafka, they are used to organize events of the same type, hence the same schema, together. Therefore, the producer specifies which topic to publish to, and the subscriber or consumer specifies which topic(s) to read from. Note: the stream is immutable and append-only.

The immediate question is, how do we distribute data in topics across different nodes in the Kafka cluster? This calls for a way to distribute data within the topic. That's where partitions come into play.

2.1. Kafka Topic Partitions

A Kafka topic can have one or more partitions, and a partition can be regarded as the unit of data distribution and also a unit of parallelism. Partitions of a topic can reside on different nodes of the Kafka cluster. Each partition can be accessed independently, hence you can only have as many consumers as the number of partitions (strongly dictating horizontal scalability of consumers).

Furthermore, each event/record within the partition has a unique ID called the offset, which is a monotonically increasing number, once an offset number is assigned, it's never reused. The events in the partition are delivered to the consumer in assigned offset order.

2.2. Choosing Number of Partitions

The number of partitions dictates parallelism and hence the throughput.

The more the partitions:

Higher is the throughput: both the producer and the broker can process different partitions independently and in parallel, leading to better utilization of resources for expensive operations such as compression and other processes.
More partitions mean more consumers in a consumer group, leading to higher throughput. Each consumer can consume messages from multiple partitions, but one partition cannot be shared across consumers in the same consumer group.

However, it's important to strike a balance when choosing the number of partitions. More partitions may increase unavailability/downtime periods.

Quick pre-context (from Section 3.3): A partition has multiple replicas, each stored in different brokers, and one replica is assigned as the leader while the rest are followers. The producer and consumer requests are typically served by the leader broker (of that partition).
When a Kafka broker goes down, the leader of those unavailable partitions is moved to other available replicas to serve client requests. When the number of partitions is high, the latency to elect a new leader adds up.

More partitions mean more RAM is consumed by the clients (especially the producer):

the producer client creates a buffer per partition (Section 3.1: accumulated by byte size or time). With more partitions, the memory consumption adds up.
Similarly, the consumer client fetches a batch of records per partition, hence increasing the memory needs (crucial for real-time low-latency consumers).

The idea behind choosing the number of partitions is to measure the maximum throughput that can be achieved on a single partition (for both production and consumption) and choose the number of partitions to accommodate the target throughput.

The reason for running these benchmarks to determine the number of partitions is that it depends on several factors such as: batching size, compression codec, type of acknowledgment, replication factor, etc. To accommodate for the buffer, choose (1.2 * P) or higher; It's a common practice to over-partition by a bit.

The Kafka cluster has a control plane and a data plane, where the control plane is responsible for handling all the metadata, and the data plane handles the actual data/events.

3. Kafka Broker (Data Plane)

Diving into the workings of the data plane, there are two types of requests the Kafka broker handles: the put requests from the producer and the get requests from the consumer.

3.1. Producer

The producer requests start with the producer application, sending the request with the key and value. The Kafka producer library determines which partition the messages should be produced to. This is done by using a hash algorithm to assign a partition based on the supplied partition key. Hence, records with the same key always go to the same partition. When a partition key is not assigned, the default mechanism is to use round-robin to choose the next partition.

Sending each record to the broker is not very efficient. The producer library also buffers data for a particular partition in an in-memory data structure (record batches). Data in the buffer is accumulated up to a limit based on the total size of all the records or by time (time and size). That is, if enough time has passed or enough data has accumulated, the records are flushed to the corresponding broker.

Lastly, batching allows records to be compressed, as it is better to compress a batch of records than a single record.

3.1.1. Socker Receive Buffer & Network Threads

Network threads in a Kafka broker are like workers that handle high-level communication between the Kafka server (broker) and the outside world (clients), i,e. handle messages coming into the server (data sent by producers).
*and also send messages back to clients (consumers fetching data)

To avoid network threads being overwhelmed by incoming data, a socket buffer stands before the network threads that buffers incoming requests.

The network handles each producer/client request throughout the rest of its lifecycle (the same network thread keeps track of the request through the entire process; the request is fully handled and the response is sent). For example, if a producer sends messages to a Kafka topic:

The network thread receives the request from the producer,
processes the request
*(write the message to the Kafka commit log & wait for replication).
Once processing is done, the network thread sends a response (acknowledgment that the messages were successfully received).

3.1.2. Request Queue & I/O Threads

Each network thread handles multiple requests from different clients (multiplex) and is meant to be lightweight, where it receives the bytes, forms a producer request, and publishes it to a shared request queue, immediately handling the next request.

Note: In order to guarantee the order of requests from a client, the network thread handles one request per client at a time; i.e., only after completing a request (with a response), does the network thread take another request from the same client.

The second main pool in Kafka, the I/O threads, picks requests from the shared request queue. The I/O threads handle requests from any client, unlike the network threads.

3.1.3. Commit Log

The I/O thread first validates the data (CRC) and appends data to a data structure called the commit log (by partition).

00000000000000000000.log
00000000000000000000.index
00000000000000000025.log
00000000000000000025.index
...
00000000000000004580.log
00000000000000004580.index
...

The suffix (0, 25 & 4580) in the segment's file name represents the base offset (i.e., the offset of the first message) of the segment.

The commit log (per partition) is organized on disk as segments. Each segment has two main parts: the actual data and the index (.log and .index), which stores the position inside the log file. By default, the broker acknowledges the produce request only after replicating across other brokers (based on the replication factor), since Kafka offers high durability via replication.

Note: The new batch of records (producer request) is first written into the OS's page cache and flushed to disk asynchronously. If the Kafka JVM crashes for any reason, recent messages are still in the page cache but may result in data loss when the machine crashes. Topic replication solves the problem, meaning data loss is possible only if multiple brokers crash simultaneously.

3.1.4. Purgatory & Response Queue

While waiting for full replication, the I/O thread is not blocked. Instead, the pending produce requests are stashed in the purgatory, and the I/O Thread is freed up to process the next set of requests.

Once the data of the pending producer request is fully replicated, the request is then moved out of the purgatory

3.1.5. Network Thread & Socket Send Buffer

and then sent to the shared response queue, which is then picked up by the network thread and sent through the socket send buffer.

3.2. Consumer

The consumer client sends the fetch request, specifying the topic, the partition, and the start offset.
Similar to the produce request, the fetch request goes through the socket receive buffer > network threads > shared request queue.
IO threads now refer to the index structure to find the corresponding file byte range using the offset index.
To prevent frequent empty responses when no new data has been ingested, the consumer typically specifies the minimum number of bytes and maximum amount of time for the response.
The fetch request is pushed to the purgatory, for either of the conditions to be met.
When either time or bytes are met, the request is taken out of purgatory and placed in the response queue for the network thread, which sends the actual data as a response to the consumer/client.

Kafka uses zero-copy transfers in the network, meaning there are no intermediate memory copies. Instead, data is transferred directly from disk buffers to the remote socket, making it memory efficient.

However, reading older data, which involves accessing the disk, can block the network thread. This isn't ideal, as the network threads are used by several clients, delaying processing for other clients. The Tiered Storage Fetch solves this very problem.

3.2.1. Tiered Storage

Tiered storage in Kafka was introduced as an early access feature in 3.6.0 (October 10, 2023).

Tiered storage is a common storage architecture that uses different classes/layers/tiers of storage to efficiently store and manage data based on access patterns, performance needs, and cost. A typical tier model has frequently accessed data or "hot" data, and less frequently accessed data is moved (not copied) to a lower-cost, lower-performance storage ("warm"). Outside of the tiers, "cold" storage is a common practice for storing backups.

Kafka is designed to ingest large volumes of data. Without tiered storage, a single broker is responsible for hosting an entire replica of a topic partition, adding a limit to how much data can be stored. This isn't much of a concern in real-time applications where older data is not relevant.

But in cases where historical data is necessary, tiered storage allows storing less frequently accessed data in remote storage (not present locally in the broker).

Tiered storage offers several advantages:

Cost: It's cost-effective as inactive segments of local storage (stored on expensive fast local disks like SSDs) can be moved to remote storage (object stores such as S3), making storage cheaper and virtually unlimited.
Elasticity: Now that storage and compute of brokers are separated and can be scaled independently, it also allows faster cluster operations due to less local data. Without tiered storage, needing more storage essentially meant increasing the number of brokers (which also increases compute).
Isolation: It provides better isolation between real-time consumers and historical data consumers.

Coming back to the fetch request (from consumer) with tiered storage enabled: If the consumer requests from an offset, the data is served the same way as before from the page cache.

The chances of most local data being in the page cache are also higher (due to smaller local data). However, if the data is not present locally and is in the remote store, the broker will stream the remote data from the object store into an in-memory buffer via the Tiered Fetch Threads, all the way to the remote socket send buffer in the network thread.

Hence, the network thread is no longer blocked even when the consumer is accessing historical data. i.e., real-time and historical data access don't impact each other.

3.3. Data Replication

Replication in the Data Plane is a critical feature of Kafka that offers durability and high-availability. Replication is typically enabled and defined at the time of creating the topic.

Each partition of the topic will be replicated across replicas (replication factor).

One of the replicas is assigned to be the leader of that partition, and the rest are called followers. The producer sends the data to the leader, and the followers retrieve the data from the leader for replication. In a similar fashion, the consumer reads from the leader; however, the consumer(s) can also read from the follower(s).

6. References

[1] "Apache Kafka Streams Architecture," Apache Kafka, [Online]. Available: https://kafka.apache.org/39/documentation/streams/architecture.
[2] "Apache Kafka Documentation: Configuration," Apache Kafka, [Online]. Available: https://kafka.apache.org/documentation/#configuration.
[3] J. Rao, "Apache Kafka Architecture and Internals," Confluent, [Online]. Available: https://www.confluent.io/blog/apache-kafka-architecture-and-internals-by-jun-rao/.

Design Driven Development

2024-11-21T00:00:00+00:00

It feels good to be back, reflecting, pondering, and putting thoughts into words. Let's talk about something core to the software engineering that often gets overlooked in the rush to code: design-driven development.

The goal of the post isn't to dictate how to think, design or solve problems, but rather bring emphasis on how crucial the design phase is—and how it lays the foundation for everything that follows. Building softwares goes beyond writing clean code; it's about creating systems with a solid, scalable foundation; think of features like modular blocks that are easy to fit, which calls for abstraction at various levels, allowing flexibility and portability.

Given a problem statement, what steps does one take to call it a success? Here it goes:

Discovery

At first glance, this may sound like common-sense, but it's essential to dedicate time to thinking and analyzing before jumping into the code. It mostly involves questioning before anything else, which doesn't necessarily have to be too organized to begin with.

Do I have all the required information? If not, what questions need to be answered?
What are the potential side effects or limitations of existing solutions?
Can the problem be broken down into smaller, manageable sub-problems? How does each feature fit into the bigger picture?
...

The discovery phase tends to be all over the place; the idea here is to create a mind map to bring order to the chaos. It's not uncommon to uncover a bigger problem than the one in question.

The discovery phase is also the foundation that guides every-other decision in the subsequent steps. Coding is least of the concern at this point in time. The biggest mistake you can make is jumping straight into the implementation before thoroughly thinking things through.

Requirements

A common pitfall I’ve noticed is the lack of requirements analysis. It’s easy to gloss over, but understanding the full scope of requirements is 🗝. Requirements are often like an iceberg—what’s visible on the surface is just a fraction of the reality. Many teams fail to dive deep enough into this and rushing in without a proper understanding is nothing but trouble down the line.

Furthermore, requirements aren’t just about what’s written down—they’re about bridging gaps in understanding, clarifying the problem, and aligning on the root intention and the ideal end-state.

Design

Once the current state of the problem statement and requirements are fairly clear, the next stop-is the design. This phase is often brushed aside or sketched out hastily on a whiteboard before diving into code. While it’s tempting to start building right away—especially when you’ve worked on similar problems before—the value of a well-thought-out design cannot be overstated.

Rushing into implementation almost always leads to poor outcomes. In early-stage startups, there’s often a push to “move fast” and “fail fast,” (creating a false illusion of efficiency and speed) but this mentality can backfire. The so-called “high velocity” comes at the cost of long-term maintainability, collaboration, and scalability.

Taking the time to solidify your design, clarify the intent, and define the architecture before diving into the implementation phase doesn't take much time, but it makes a world of difference. It ensures that everyone on the team is aligned and that everyone understands the vision. This clarity makes it easier to collaborate, distribute tasks, and move forward with a shared understanding of the problem and solution.

On the other hand, if you decide to build without a clear design, you’ll often end up with:

Unforeseen blockers: What seemed like a quick win turns into an unexpected roadblock, adding more development time in the end.
Damaged trust and poor collaboration: Without proper documentation and design, you risk creating a “trust me” culture. The code becomes a black box that no one else understands or can easily contribute to.
Difficult code reviews: Without a clear design review upfront, code reviews quickly devolve into debates about the implementation rather than constructive feedback on how to improve the solution.
Unmanageable technical debt: If the design is neglected, expect a backlog of tech debt that will need to be addressed later—often at great cost.

Document

A good engineer is someone who can design, document, and implement clean code. Stop coping!

Stop the PPTs 🪱 and start documenting 🐉. Clear, written documentation beats any PowerPoint presentation or casual conversations. Having a well-documented design allows for parallel work, clearer communication, and a shared understanding. Without it, you risk creating silos of knowledge, with one or two team members holding all the information, making collaboration difficult and inefficient.

I get it! most software engineers would rather skip this part. But documenting your thought process and the rationale behind your decisions is vital. It ensures that your work can be understood by others, that your decisions are well-justified, and that the design can be revisited if needed.

When documentation is overlooked, the risk of misunderstanding and future rework increases dramatically. A good document lays the groundwork for collaboration, clarifies the design, and sets clear expectations.

🌻 To keep the post short and contained; NOT going into too much detail about implementation, testing, and reiteration.

Now comes the real implementation

...With the well-documented design (reiterated from collaboration and feedback) in place, you can implement with a clear(er) direction. The implementation phase becomes a translation of the plan into working code.

However, the design phase is iterative to an extent. While in an ideal world we want the finalized design to be rock solid, during implementation, it may open up the need for iterations—almost like cooking from a recipe book after shopping for all the groceries. You may have to make another trip for cilantro , but you don't have to worry about missing a ton of ingredients.

Conclusion

In the post, I've tried to shed some light on the importance of design-first/design-driven development in software engineering and the significance of expressing it through a well-written design document.

Breath-First Search using Stack

2024-07-21T00:00:00+00:00

1. BFS using Queue

Just in the prior post on graph traversal, we went into details of Depth-First Search (DFS) and Breadth-First Search (BFS). BFS is a way of traversing down the graph, level-by-level. Specifically for a balanced-tree, the first/root node is visited first, followed by its immediate children, then followed by the next level children, and so on. Here's the same example of BFS using a queue:

2. Problem: Space Complexity

The problem with this solution is adding all the immediate children to the queue before visiting them. While this isn't much of a concern for a binary tree, imagine a non-binary tree where at each level the number of nodes grows exponentially. In the example below, when the second-level node G is visited, the queue now has 49 entries. For the nth level: 7^(N-1) nodes. For level 100, there would be 282,475,249 entries in the queue. Nearly 300 million entries and a 4-byte address pointer per entry would lead to around ~1 MB.

3. Solution: BFS using Stack

The recursive approach below, the space coordinate depends on the number of levels. In a balanced tree, the space complexity is now O(log(n)), where n is the total number of nodes.

Pusedo Code:

procedure bfs(root:NODE*);
    var target = 0;
    var node = root;
BEGIN
    for each level in tree do
    begin
        printtree(node, target, 0);
        target = target + 1;
    end
END

procedure printtree(node:NODE*, target:int, level:int);
BEGIN
    if(target > level) then
    begin
        for each child of node do
            printtree(child, target, level + 1);
    end
    else
        print node;
END

Going back to the same example for a balanced binary tree with nodes: A, B, C, D, E, F, G

Initializing the root node and setting the initial target level to 0. The main BFS loop iterates through each level of the tree, incrementing the target level after processing each one.

Iteration 1 (target = 0)

Step	Action	Current Call Stack	Visited Nodes
1	Initial setup
2	Iteration with target=0	printtree(A, 0, 0)
3	Visiting A		A

For each level, the printtree function is called with the current node, the target level, and the current level (starting from zero). Checks if the target level is greater than the current level. If so, recursively call for each child of the current node, incrementing the level by 1. This continues until the target level equals the current level, at which point the node is printed.

Iteration 2 (target = 1)

Step	Action	Current Call Stack	Visited Nodes
4	target=1		A
5	Iteration with target=1	printtree(A, 1, 0)	A
6	Call B	printtree(A, 1, 0) → printtree(B, 1, 1)	A
7	Visiting B	printtree(A, 1, 0)	A, B
8	Call C	printtree(A, 1, 0) → printtree(C, 1, 1)	A, B
9	Visiting C	printtree(A, 1, 0)	A, B, C

by incrementing the target level and repeating the process until all levels of the tree have been processed, nodes are printed level-by-level, leadind to a breadth-first traversal.

Iteration 3 (target = 2)

Step	Action	Current Call Stack	Visited Nodes
10	target=2		A, B, C
11	Iteration with target=2	printtree(A, 2, 0)	A, B, C
12	Call B	printtree(A, 2, 0) → printtree(B, 2, 1)	A, B, C
13	Call D	printtree(A, 2, 0) → printtree(B, 2, 1) → printtree(D, 2, 2)	A, B, C
14	Visiting D	printtree(A, 2, 0) → printtree(B, 2, 1)	A, B, C, D
15	Call E	printtree(A, 2, 0) → printtree(B, 2, 1) → printtree(E, 2, 2)	A, B, C, D
16	Visiting E	printtree(A, 2, 0) → printtree(B, 2, 1)	A, B, C, D, E
17	Call C	printtree(A, 2, 0) → printtree(C, 2, 1)	A, B, C, D, E
18	Call F	printtree(A, 2, 0) → printtree(C, 2, 1) → printtree(F, 2, 2)	A, B, C, D, E
19	Visiting F	printtree(A, 2, 0) → printtree(C, 2, 1)	A, B, C, D, E, F

4. Recursive BFS: Implementation

Without much explanation, here's an implementation in Java. In the Node class, children is an array of `Node`s, but it also works with other data structures, such as a LinkedList.

class Node {
    char data;
    Node[] children;

    Node(char data, int childCount) {
        this.data = data;
        this.children = new Node[childCount];
    }
}

public class TreeTraversal {

    // BFS subroutine
    boolean printTree(Node node, int target, int level) {
        boolean returnValue = false;
        if (target > level) {
            for (int i = 0; i < node.children.length; i++) {
                if (printTree(node.children[i], target, level + 1)) {
                    returnValue = true;
                }
            }
        } else {
            System.out.print(node.data);
            if (node.children.length > 0) {
                returnValue = true;
            }
        }
        return returnValue;
    }

    // BFS routine
    void printBfsTree(Node root) {
        if (root == null) return;
        int target = 0;
        while (printTree(root, target++, 0)) {
            System.out.println();
        }
    }

    public static void main(String[] args) {
        Node root = new Node('A', 2);
        root.children[0] = new Node('B', 2);
        root.children[1] = new Node('C', 1);
        root.children[0].children[0] = new Node('D', 0);
        root.children[0].children[1] = new Node('E', 0);
        root.children[1].children[0] = new Node('F', 0);

        TreeTraversal treeTraversal = new TreeTraversal();
        treeTraversal.printBfsTree(root);
    }
}

5. Conclusion

The prime difference between the queue-based BFS and stack-based BFS is that the space coordinate of queue-based BFS depends on the number of children and for stack-based BFS, it's the depth/height of the tree.

Taking an example, say we have a balanced tree with 9 levels (root node being level 1) and each node has 10 children. In the queue-based BFS solution, the number of nodes in the queue at level 9 would be C^(N - 1), where N is the number of levels and C is the number of children per node. For C = 10 and N = 9, this results in 10^(9 - 1) = 10^8. Presuming each node is 4 bytes, that's 400 MB in the queue (at level 9).

The stack-based solution, on the other hand, the call-stack can have at most L (number of levels) recursive calls (one for each child), but only one call at a time will be active in the stack for each depth level. Realistically the stack contains other data such as return address, local variables, saved registers, etc., and taking each stack frame size of 64 bytes, the space of the callstack at most is 9 × 64 bytes = 576 bytes.

This is considerable space saving! At much higher levels, say 50 levels, the stack-based solution outperforms queue-based BFS in both time and space coordinates. However, for a more irregular/high-depth tree, queue-based BFS performs better.

6. References

[1] Pravin Kumar Sinha, "Stack-based breadth-first search tree traversal," IBM Developer. [Online]. Available: https://developer.ibm.com/articles/au-aix-stack-tree-traversal.
[2] Wikipedia contributors, "Breadth-first search," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Breadth-first_search.
[3] Adesh Nalpet Adimurthy, "Graph Theory: Search and Traversal," PyBlog, 2024. [Online]. Available: https://www.pyblog.xyz/graph-traversal.

Graph Theory: Search and Traversal

2024-07-17T00:00:00+00:00

0. Graph Traversal

Breadth-First Search (BFS) and Depth-First Search (DFS) are two of the most commonly used graph traversal methods.

The traversal of a graph, whether BFS or DFS, involves two main concepts: visiting a node and exploring a node. Exploration refers to visiting all the children/adjacent nodes.

Among BFS and DFS, Depth-First Search is more intuitive to perform, so let's first explore DFS to set a clear standpoint on what BFS is not.

1. Depth First Search

Depth-First is the process of traversing (visiting and exploring) down the graph until we get to a leaf node or a cycle (re-visiting a node that's already explored). Every time we encounter one of these conditions, we head back to the last parent node (previous level node) and explore an adjacent node (until leaf or cycle) and repeat the process.

In other words: traverse through the tree by visiting all of the children, grandchildren, great-grandchildren (and so on) until the end of a path, only then traverse a level back to start a new path.

Explanation of the above example:

Starting with Vertex A - start exploration, say, we go to Node B
Node A has two other adjacent vertices, but in DFS, we go depth-first
Further exploring the visited vertex B, head to vertex C
Cannot further explore Node C as it's a leaf node - hence, Node C is completely explored
Head back to its parent (back-track prior level) and explore the next adjacent node, Vertex D
Similarly, Vertex E. Now that all adjacent nodes of Vertex B are already explored, head back a level again (Back to Node A)
Visit F, explore F; head back to Node A. Visit G, explore G; head back to Node A. DFS is now complete
Order of visiting nodes: A, B, C, D, E, F, G

Notice that in scenarios where there are more than one adjacent node, we choose the next node to explore at random, and hence there are several paths to traverse using DFS. Defining specific rules for which node to explore next brings up the topic of new strategies in DFS (In the case for Trees: Pre-order, In-order and Post-order traversal).

1.1. DFS: Detecting Cycles

In the previous example, we understood to back-track when we reach a leaf node. Taking an example with a graph this time to cover the "detecting a cycle" scenario, i.e., visiting a node that was previously visited.

I have highlighted when re-visiting Node G (Slide #6), followed by back-tracking and visiting Node J. Again, this is one particular Depth First Search traversal, but it can be done in many other ways by choosing a different "next" node to visit (at every explore step).

1.2. DFS: Implementation

The core of the solution is to find a way to back-track and head on a different path when encountering two scenarios: reaching a dead-end (leaf node) and reaching an already visited node (cycle).

1.2.1. DFS: Stack

The intuition behind using a stack is that when we reach a dead-end, we want to get to the previously added node (LIFO: Last-In First-Out) and explore other paths. This helps you explore each path deeply before backtracking, done using a stack to go back to the last node.

Easier to understand with visualization:

The key points to notice here are the stack pop operations. On reaching node D, a leaf node, pop() to explore other paths, i.e., Node E. Similarly, Node E is a leaf node, so pop() to head back and explore Node C.

Step	Action	Stack State	Visited Nodes
1	Push A	[A]	{}
2	Pop A, Push C, B	[C, B]	{A}
3	Pop B, Push E, D	[C, E, D]	{A, B}
4	Pop D	[C, E]	{A, B, D}
5	Pop E	[C]	{A, B, D, E}
6	Pop C, Push G, F	[G, F]	{A, B, D, E, C}
7	Pop F	[G]	{A, B, D, E, C, F}
8	Pop G	[]	{A, B, D, E, C, F, G}

Note: When visiting a node, add all adjacent nodes to the stack to ensure all possible paths from the current node are explored. This is essential for DFS to correctly traverse the entire graph.

Pseudo Code: Wrapping it all up with 10 lines of code

DFS-Iterative(graph, start):
    let stack be a stack
    let visited be a set
    stack.push(start)
    
    while stack is not empty:
        node = stack.pop()
        if node is not in visited:
            visit(node)
            visited.add(node)
            for each neighbor of node in graph (Optional: reverse order):
                if neighbor is not in visited:
                    stack.push(neighbor)

1.2.2. DFS: Recursion

The recursion solution is quite similar to the above stack solution, where we rely on the call stack as opposed to a user-defined stack.

There's a small difference (in traversal order). In the recursive solution, you handle each node when you see it. Thus, the first node you handle is the first child.

Whereas in an iterative approach, you first insert all the elements into the stack and then handle the head of the stack (which is the last node inserted). Thus, the first node you handle is the last child.

Step	Action	Call Stack State	Visited Nodes
1	Call on A	[A]	{}
2	Visit A, Call on B	[A, B]	{A}
3	Visit B, Call on D	[A, B, D]	{A, B}
4	Visit D, Return from D	[A, B]	{A, B, D}
5	Call on E	[A, B, E]	{A, B, D}
6	Visit E, Return from E	[A, B]	{A, B, D, E}
7	Return from B, Call on C	[A, C]	{A, B, D, E}
8	Visit C, Call on F	[A, C, F]	{A, B, D, E, C}
9	Visit F, Return from F	[A, C]	{A, B, D, E, C, F}
10	Call on G	[A, C, G]	{A, B, D, E, C, F}
11	Visit G, Return from G	[A, C]	{A, B, D, E, C, F, G}
12	Return from C	[A]	{A, B, D, E, C, F, G}
13	Return from A	[]	{A, B, D, E, C, F, G}

Pseudo Code: now down to 5 lines of code

DFS-Recursive(node, visited):
    if node is not in visited:
        visit(node)
        visited.add(node)
        for each neighbor of node:
            DFS-Recursive(neighbor, visited)

Note: if you want the user-defined stack solution to yield the same result as the recursive solution, you need to add elements to the stack in reverse order. For each node, insert its last child first and its first child last.

2. Breath First Search

Also called Level Order Search. Compared to DFS, exploring in BFS is level-by-level (or in layers); i.e., start with a node, explore an adjacent node (without deep diving till leaf) - repeat until all adjacent nodes are visited; then, choose an adjacent node (child), explore a level down - until its adjacent nodes are also explored; repeat the process.

In other words: traverse through one entire level of children nodes first before moving on to traverse through the grandchildren nodes. Repeat: traverse through an entire level of grandchildren nodes before going on to traverse through great-grandchildren nodes.

Explanation of the above example:

Starting with vertex A (Visit A) - start exploration of all adjacent vertices.
Explore adjacent nodes in any order, in this case: Node B, followed by F and G.
Cannot explore any further, as all adjacent nodes/children are visited.
Explore any one of the children, say Node B, and visit all the adjacent nodes of B: E, C, and D (in any order).
Again, cannot explore further, as all children are visited.
Similar to Node B, explore Node G and F (nothing to explore). BFS is now complete.
Order of visiting nodes: A, B, F, G, E, C, D

2.1 BFS: Implementation

Similar to DFS, we need to know if a node is "visited" in order to prevent cycles, i.e., re-visiting a node. Typically, BFS is implemented using a queue (FIFO: First-In First-Out) data structure. I wouldn't necessarily say that it's impossible to solve it with a stack, but it's definitely not conventional and introduces complexity.

Fun Fact: in the worst-case scenario (for Trees), a stack-based BFS performs better than a queue-based BFS. I'll explain more on this in a different post dedicated to Trees.

One important observation in BFS is that we add nodes that we have discovered but not yet visited to the queue, and come back to (visit) them later. With the source node (or root node) in the queue, the process is to visit a node (dequeue), add all the children/adjacent nodes to the queue (enqueue), and repeat the process.

Step	Action	Queue State	Visited Nodes
1	Enqueue A	[A]	{}
2	Dequeue A, Enqueue B, C	[B, C]	{A}
3	Dequeue B, Enqueue D, E	[C, D, E]	{A, B}
4	Dequeue C, Enqueue F, G	[D, E, F, G]	{A, B, C}
5	Dequeue D	[E, F, G]	{A, B, C, D}
6	Dequeue E	[F, G]	{A, B, C, D, E}
7	Dequeue F	[G]	{A, B, C, D, E, F}
8	Dequeue G	[]	{A, B, C, D, E, F, G}

The intuition to follow along is: Queues follow the first-in, first-out (FIFO) principle, which means that whatever was enqueued first is the first item that will be read and removed from the queue.

Pseudo Code:

BFS(graph, start):
    let queue be a queue
    let visited be a set
    queue.enqueue(start)
    
    while queue is not empty:
        node = queue.dequeue()
        if node is not in visited:
            visit(node)
            visited.add(node)
            for each neighbor of node in graph:
                if neighbor is not in visited:
                    queue.enqueue(neighbor)

I hate to be the person who uses a tree to explain a graph. Reminds me of the physics class at school, where the lectures and exams are miles apart! So, here is the visualization of BFS for a graph:

In the Breadth-First Search (BFS) for a graph, the same element might be added to the queue multiple times in the presence of cycles (i.e. same nodes can be visited from multiple nodes). However, it will be ignored later based on the visited check. In the above graph BFS visualization, I have skipped adding the same element into the queue and indicated it with arrows (from other node(s)) instead.

This can be prevented by: searching the entire queue (increasing time complexity), using another hashtable to track enqueued nodes (increasing space complexity), or slightly optimized with tail checks.

3. Conclusion

Both Breadth-First Search (BFS) and Depth-First Search (DFS) have a lot of applications and come up way too often when dealing with graphs.

BFS is the first that pops up when finding the shortest path in an unweighted graph. DFS has tons of use-cases—be it computing a graph's minimum spanning tree, detecting cycles in a graph, checking if a graph is bipartite, finding bridges, articulation points, strongly connected components, topologically sorting a graph, and many more. BFS and DFS can often be used interchangeably.

4. References

[1] Wikipedia contributors, "Depth-first search," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Depth-first_search.
[2] Wikipedia contributors, "Breadth-first search," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Breadth-first_search.
[3] Abdul Bari, "Graph Traversals - BFS & DFS -Breadth First Search and Depth First Search," YouTube. [Online]. Available: https://youtu.be/pcKY4hjDrxk.
[4] Pravin Kumar Sinha, "Stack-based breadth-first search tree traversal," IBM Developer. [Online]. Available: https://developer.ibm.com/articles/au-aix-stack-tree-traversal/.
[5] W. Fiset, "Algorithms repository," GitHub, 2017. [Online]. Available: https://github.com/williamfiset/Algorithms.

Graph Theory: Introduction

2024-07-14T00:00:00+00:00

Before heading into details of how we store, represent, and traverse various kinds of graphs, this post is more of a ramp-up to better understand what graphs are and the different kinds from a computer science point of view, rather than a mathematical one. So, no proofs and equations, mostly just diagrams and implementation details, with an emphasis on how to apply graph theory to real-world applications.

Graph theory is the mathematical theory of the properties and applications of graphs/networks, which is just a collection of objects that are all interconnected.

Graph theory is a broad enough topic to say it can be applied to almost any problem—first (maybe not first, make it 21st) thing in the morning, choosing what to wear - given all of the wardrobe, how many sets of clothes can I make by choosing one from each category (by category, I mean tops, bottoms, shoes, hats, and glasses)? While this sounds like a math problem to find permutations, using graphs to visualize each clothing item as a node and edges to represent relationships between them can be helpful.

Another everyday example is the social network. A graph representation answers questions such as how many mutual friends or how many degrees of separation exist between two people.

1. Types of Graphs

There are a lot of types of graphs, and it's important to understand the kind of graph you are dealing with. Let's go over the most commonly known graph variants.

1.1. Undirected Graph

The most simple kind of graph, where the edges have no orientation (bi-directional). i.e., edge (u, v) is identical to edge (v, u).

Example: A city interconnected by bi-directional roads. You can drive from one city to another and can retrace the same path back.

1.2. Directed Graph/Digraph

In contrast to an undirected graph, directed graphs or digraphs have edges that are directed/have orientation. Edge (u, v) represents that you can only go from node u to node v and not the other way around. As shown in the figure below, the edges are directed, indicated by the arrowheads on the edges between nodes.

Example: This graph could represent people who bought each other gifts. C and D got gifts for each other, E didn't get any nor give any, B got one from A, gave a gift to D, and sent a gift to itself.

1.3. Weighted Graphs

So far, we have seen unweighted graphs, but edges on graphs can contain weights to represent arbitrary values such as distance, cost, quantity, etc.

Weighted graphs can again be directed or undirected. An edge of a weighted graph can be denoted with (u, v, w), where w is the weight.

2. Special Graphs

While directed, undirected and weighted graphs covers the basic types, there are many other types of graphs governed by rules and restrictions.

2.1. Trees

A tree is simply a collection of nodes connected by directed (or undirected) edges with no cycles or loops (no node can be its own ancestor). A tree has N nodes and N-1 edges.

All of the above are indeed trees, even the left-most graph, which has no cycles and N-1 edges.

2.2. Rooted Trees

A related but totally different kind of graph is a rooted tree. It has a designated root node, where every edge either points away from or towards the root node. When edges point away from the root, it's called an out-tree (arborescence) and an in-tree (anti-arborescence) otherwise.

Out-trees are more commonly used than in-trees, so much so that out-trees are often referred to as just "trees."

2.3. Directed Acyclic Graphs (DAGs)

DAGs are directed acyclic graphs, i.e., with directed edges and no cycles or loops. DAGs play an important role and are very common in computer science, including dependency management, workflows, schedulers, and many more.

When dealing with DAGs, commonly used algorithms include finding the shortest path and topological sort (how to process nodes in a graph in the correct order considering dependencies).

Fun Fact: All out-trees are DAGs, but not all DAGs are out-trees.

DAG nodes can have multiple parents, meaning there can be multiple paths that eventually merge. Out-trees are DAGs with the restriction that a child can only have one parent. Another way to see it is that a tree is like single-class inheritance, and a DAG is like multiple-class inheritance.

2.4. Bipartite Graph

A bipartite graph is one whose vertices can be split into two independent groups, U and V, such that every edge connects between U and V. A bipartite graph is two-colorable, in other words, it is a graph in which every edge connects a vertex of one set (Example, set 1: red color) to a vertex of the other set (Example, set 2: blue color).

A common question is to find the maximum matching that can be created on a bipartite graph (covered in a follow-up post). For example, say red nodes are jobs and blue nodes are people. The problem is to determine how many people can be matched to jobs.

2.5. Complete Graph

In a complete graph, there is a unique edge between every pair of nodes, i.e., every node is connected to every other node except itself. A complete graph with n vertices is denoted by the graph K_n.

A complete graph is often seen as the worst-case possible graph and is used for performance testing.

3. Graph Representation

The next important aspect is the data structure we use to represent a graph, which can have a huge impact on performance. The simplest and most common way is using an adjacency matrix.

3.1. Adjacency Matrix

An adjacency matrix m represents a graph, where m[i][j] is the edge weight of going from node i to node j. Unless specified, it's often assumed that the edge of going from a node to itself has zero cost. Which is why the diagonal of the matrix has all zeroes.

For example, the weight of the edge going from node D to node B is 5, as represented in the matrix.

Pros:

Space efficient for representing dense graphs.
Edge weight lookup is constant time: O(1).
Simplest graph representation.

Cons:

Requires O(V²) space, where V is the number of nodes/vertices.
Iterating over all edges requires O(V²) time.

The quadratic space complexity becomes less feasible when dealing with networks with nodes in the order of thousands or more.

3.2. Adjacency List

The other alternative to the adjacency matrix is the adjacency list. This is a way to represent the graph as a map from nodes to lists of outgoing edges. In other words, each node tracks all its outgoing edges. i.e., N₁ = [(N_x, W), (N_y, W), ...]

For example, Node C has 3 outgoing edges, so the map entry for Node C has those 3 entries, each represented by the combination of the destination node and edge weight/cost.

Pros:

Space efficient for representing sparse graphs (no extra space for unused edges).
Iterating over all edges is efficient.

Cons:

Less space efficient for dense graphs.
Edge weight lookup is O(E), where E is the number of edges of a node.
Slightly more complex graph representation.

Adjacency lists are still very commonly used, since edge weight lookup is not a common use case and many real-world use cases involve sparse graphs.

3.3. Edge List

The edge list takes an overly simplified approach to represent a graph simply as an unordered list of edges with the source node, destination node, and the weight. For example, (u, v, w) represents the cost from node u to node v as w.

Pros:

Space efficient for representing sparse graphs.
Iterating over all edges is efficient.
Overly simple structure/representation.

Cons:

Less space efficient for dense graphs.
Edge weight lookup is O(E), where E is the number of edges.

Despite the seeming simplicity and lack of structure, edge lists do come in handy for a variety of problems and algorithms.

4. Graph Problems

One of the best approaches to dealing with graph problems is to better understand and familiarize yourself with common graph theory algorithms. Many other problems can be reduced to a known graph problem.

Does the graph already exist, or is it to be derived/constructed?
Is the graph directed or undirected?
Is it a weighted graph (edges)?
Is it a sparse graph or a dense graph?
Based on all of the above, should I use an adjacency matrix, adjacency list, edge list, or other structures?

4.1. Shortest Path Problem

Given a weighted graph, find the shortest path of edges from Node A to Node B (source and destination nodes).

Algorithms: Breadth First Search (unweighted graph), Dijkstra's, Bellman-Ford, Floyd-Warshall, A*, and many more.

In the example, to find the shortest path from Node A to Node H, the sum of all the weights/costs of the path taken should be the least.

4.2. Connectivity

Along the same lines, to determine if connectivity exists from Node A to Node B. In other words, given the nodes, do they exist in the same network/graph? This is quite commonly used in communication networks such as WiFi, Thread, Zigbee, etc.

Algorithms: Any search algorithm such as BFS (Breadth First Search) or DFS (Depth First Search).

4.3. Negative Cycles

To detect negative cycles in a directed graph. Also known as a negative-weight cycle, it is a cycle in a graph whose edges sum to a negative value.

In the example, nodes B, C, and D form a negative cycle, where the sum of costs is -1, which can lead to cycling endlessly with a smaller cost for every iteration. For instance, finding the shortest path without detecting negative cycles would be a trap, never escaping out of it.

Detecting negative cycles has other applications, such as currency arbitrage. In this context, assign currencies to different vertices, and let the edge weight represent the exchange rate.

Algorithms to detect negative cycles: Bellman-Ford and Floyd-Warshall.

4.4. Strongly Connected Components

SSCs are self-contained cycles within a directed graph, i.e., every vertex/node in a cycle can reach every other vertex in the same cycle.

If each strongly connected component is contracted to a single vertex, the resulting graph is a directed acyclic graph (DAG), the condensation of Graph G.

Algorithms: Tarjan's SSC and Kosaraju's algorithm.

4.5. Traveling Salesman Problem

or the travelling salesperson problem (TSP) asks "Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?" It is an NP-hard problem.

For the above graph, the TSP (Traveling Salesman Problem) solution has a cost of 9 to travel from Node A to all the other nodes and back to Node A.

Algorithms: Held-Karp, Brand and Bound, Approximation (Ex: Ant Colony) algorithms

4.6. Bridges

A bridge, cut-edge, or cut-arc is an edge of a graph whose deletion increases the graph's number of connected components (islands or clusters).

Detecting bridges is important as they often signify bottlenecks, weak points, or vulnerabilities in a graph. For instance, it's common to ensure that a mesh network is a bridgeless graph.

4.7. Articulation Points

An articulation point, or cut vertex, is similar to a bridge, but instead of edges, they are nodes. When removed, they increase the number of connected components.

In the same graph as for bridges, the nodes connected by the bridges are articulation points.

4.8. Minimum Spanning Tree (MST)

A minimum spanning tree (MST) or minimum weight spanning tree is a subset of the edges of a connected, edge-weighted undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight/cost.

A graph can have multiple minimum spanning trees with the same cost, but the resulting trees (MSTs) are not unique. Common use cases include designing a least-cost network, transportation networks, and more.

Algorithms: Kruskal's, Prim's and Boruvka's algorithm.

4.7. Flow Network

Flow network or the transportation network is a directed graph where the edge weight represents "capacity." The amount of flow on an edge cannot exceed the capacity of the edge. Capacity can represent fluids in a pipe, currents in an electrical circuit, cars on a road, etc.

Problem: For an infinite input to reach the sink, what's the max flow? With this, it's easier to see bottlenecks in the network that slow the flow. Correlating to the example, max flow would be the number of cars, volume of fluid, etc.

Also, there cannot be blockages in the network/flow, the amount of flow into a node equals the amount of flow out of it.

5. Conclusion

With the basics of graph theory covered, including various types of graphs and their representations, we've laid the groundwork for understanding how to efficiently store, represent, and traverse graphs in real-world applications. The next set of posts on Graph Theory will be a deep dive into specific problems and algorithms.

6. References

[1] W. Fiset, "Algorithms repository," GitHub, 2017. [Online]. Available: https://github.com/williamfiset/Algorithms.
[2] V. Schwartz, "Currency Arbitrage and Graphs (2)," Reasonable Deviations, Apr. 21, 2019. [Online]. Available: https://reasonabledeviations.com/2019/04/21/currency-arbitrage-graphs-2/. 
[3] Wikipedia, "Graph theory," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Graph_theory.
[4] Wikipedia, "Bidirected graph," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Bidirected_graph.
[5] Wikipedia, "Directed graph," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Directed_graph.
[6] Wikipedia, "Tree (graph theory)," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Tree_(graph_theory).
[7] Wikipedia, "Directed acyclic graph," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Directed_acyclic_graph.
[8] Wikipedia, "Bipartite graph," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Bipartite_graph.
[9] Wikipedia, "Complete graph," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Complete_graph.
[10] Wikipedia, "Adjacency matrix," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Adjacency_matrix.
[11] Wikipedia, "Adjacency list," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Adjacency_list.
[12] Wikipedia, "Breadth-first search," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Breadth-first_search.
[13] Wikipedia, "Depth-first search," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Depth-first_search.
[14] Wikipedia, "Bellman–Ford algorithm," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Bellman%E2%80%93Ford_algorithm.
[15] Wikipedia, "Floyd–Warshall algorithm," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm.
[16] Wikipedia, "Strongly connected component," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Strongly_connected_component.
[17] Wikipedia, "Travelling salesman problem," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Travelling_salesman_problem.
[18] Wikipedia, "Held–Karp algorithm," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Held%E2%80%93Karp_algorithm.
[19] Wikipedia, "Bridge (graph theory)," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Bridge_(graph_theory).
[20] Wikipedia, "Minimum spanning tree," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Minimum_spanning_tree.
[21] Wikipedia, "Flow network," Wikipedia, The Free Encyclopedia. [Online]. Available: https://en.wikipedia.org/wiki/Flow_network.

Spatial Index: R Trees

2024-06-26T00:00:00+00:00

If you have been following the Spatial Index Series, it started with the need for multi-dimensional indexes and an introduction to space-filling curves, followed by a deep dive into grid systems (GeoHash and Google S2) and tessellation (Uber H3).

In this post, let's explore the R-Tree data structure (data-driven structure), which is popularly used to store multi-dimensional data, such as data points, segments, and rectangles.

1. R-Trees and Rectangles

For example, consider the plan of a university layout below. We can use the R-Tree data structure to index the buildings on the map.

To do so, we can place rectangles around a building or group of buildings and then index them. Suppose there's a much bigger section of the map signifying a larger department, and we need to query all the buildings within a department. We can use the R-Tree to find all the buildings within (partially or fully contained) the larger section (query rectangle).

Figure 0: Layout with MBRs and Query Rectangle

In the above figure, the red rectangle represent the query rectangle, used to ask the R-Tree to get all the buildings that intersect with the query rectangle (R2, R3, R6).

2. R-Tree - Intuition

The main idea in R-trees is the minimum bounding rectangles. We'll come to what "minimum" implies in a second.

The inner node of an R-tree is as follows: We start with the root node, representing the large landscape. The inner nodes are guideposts that hold pointers to the child nodes we need to go down to in the tree. i.e. each entry of a node points to an area of the data space (described by MBR).

Figure 1: R-Tree Inner Node

For instance, think of a Binary Search Tree. From the root node, we make a decision to go left or right. The R-tree is similar, but more of an M-way tree, where each node can have multiple entries as seen above. Instead of having integer or string values (one-dimensional), the inner nodes consist of entries (multi-dimensional). In the example, there are 4 entries of rectangles.

2.1. MBR - Minimum Bounding Rectangle

Figure 2: R-Tree Minimum Bounding Rectangle

Minimum Bounding Rectangles, R1, R2, R3, R4, contain the objects which are stored in the sub-trees in a minimal way. For instance, say we have 3 rectangles R11, R12, R13. R1 is the smallest rectangle that can be created to completely contain all three rectangles, hence the name "minimum."

2.2. Search Process and Overlapping MBRs

The search process in an R-tree is simple: for a query object/query rectangle; at an inner node, it is the decision to check if any of the entries in a node intersect with the query rectangle.

Figure 3: R-Tree Query Rectangle(s)

For example, consider a query rectangle Q1. It's clear that R1 intersects with Q1, so we would follow down the tree from R1. Similarly, Q2 intersects with R2. However, in scenarios where the query rectangle intersects with multiple entries/rectangles (Q3 with R2, R3, R4), all the intersecting rectangles have to be searched. This can happen if the indexing is not optimized and has to be avoided as it defeats the purpose of indexing in the first place.

2.3. R-Tree - Properties

Here's a bit of a larger example of an R-tree.

Figure 4: R-Tree Level-2

Every node in an R-tree has between m and M entries. More specifically, each node has between m ≤ ⌈M/2⌉ and M entries. The node has at least 2 entries unless it's a leaf.

By now, if you also read the blog post on B-Trees and B+ Trees, you’ll see that an R-Tree is quite similar to a B+ Tree. It uses a similar idea to split the space at each (inner) node into multiple areas. However, B+ Trees mostly work with one-dimensional data, and the data ranges do not overlap.

3. Search using an R-Tree

Now that we know the idea behind R-Trees and the search process, Let's put a clear-cut definition to the search process:

Goal: Find all rectangles that overlap with the given rectangle S (query rectangle).
Let T denote the node (at the current level/sub-tree).
S1 (Search in sub-trees): If T is not a leaf, check all the entries E in T. If the MBR of E overlaps with S, then continue the search in the sub-tree to which E points.
S2 (Search in Leaves): If T is a leaf node, inspect all entries of E. All entries that overlap with S are part of the query result.

4. Inserting to an R-Tree

Coming to inserts, consider a leaf node (MBR) as shown below with 3 entries/objects, R1, R2, and R3. Let's assume that the leaf is not full yet (MBR has a threshold capacity on the number of objects it can hold).

Say, there's a new rectangle R4 coming and it has to be inserted inside the leaf node. As you can see, in order to capture the new objects, the MBR is adjusted, i.e., enlarged to minimally contain R1 to R4. Going on and inserting another object R5, the MBR is once again adjusted.

Figure 5: R-Tree Insert (Adjusting MBR)

On an insert, when the MBR is updated, i.e., contains more objects, the new MBR has to be updated not only for the node but also propagated to other lower levels and potentially (not always) up to the root node. This is to reflect that the sub-tree now contains more information.

4.1. Choice for Insert

Unlike the example, it's not always clear in which node/sub-tree an object should be inserted. Here: MBR1, MBR2, or MBR3.

Figure 6: R-Tree Choice for Insert (1)

The question is, in which MBR should we insert R1 into? Setting aside any rules or justification for a second, R1 can be inserted into any MBR.

Figure 7: R-Tree Choice for Insert (2)

Inserting into MBR1 would need to immensely grow/expand MBR1 to fully contain R1. The implication? Say there's a query rectangle Q1. After leading down the sub-tree to MBR1, we find that there's nothing (no objects). This is because, to contain R1, we have expanded MBR1 so much that there is a lot of space without any objects. So, it's fair to conclude that one criterion to add is to insert into MBRs that need to expand the least.

Figure 8: R-Tree Choice for Insert (3)

Going by that, inserting into MBR2 is a better option as opposed to MBR1. Similarly, MBR3 may not be a bad option either, depending on the expansion factor.

Stating the obvious (for implementation), the minimum-bounding-rectangle (MBR) is defined as the rectangle that has the maximal and minimal values of all rectangles in each dimension.

Figure 9: R-Tree MBR Implementation

Summarizing the insertion into R-Tree so far:

In principle, a new rectangle can be inserted into any node.
If the node is full, a split needs to be performed (more on that in the next section).
If not, the MBR may have to be adjusted/expanded to accommodate new objects (as seen ).

Observations:

Extending bounding boxes is a critical factor for the performance of the R-Tree.
Try to minimize overlap (of the MBRs).
Try to minimize spread (the size of the MBR, as seen in section 4.1).

4.2. Insert - Algorithm

Here's the algorithm proposed by the author of the R-Tree paper "A Dynamic Index Structure for Spatial Searching," by A. Guttman, 1984.

The rest of this section is mostly going over snippets of code and explanations from this paper, but with more examples and visualization.

Algorithm: Search for leaf to insert (ChooseLeaf):

CS1: Let N be the root.
CS2:
- If N is a leaf, return N.
- If N is not a leaf: Search for an entry in N whose rectangle (MBR) requires the least area increase in order to accommodate the new rectangle. In the case where there are multiple options, consider an entry that has the smallest (in area) MBR.
CS3: Let N be the child node, then continue to step CS2 (repeat).

A much simpler example of 8 objects, each object with one multidimensional attribute (Range or line-segments on x-axis) and one identity (Color). To insert these objects one by one in an empty R-tree of degree M = 3 (maximum number of entries at each node) and m ≥ M/2 (minimum number of entries at each node = 2).

Figure 10: R-Tree Insertion Example

Observation: in the case where the selected leaf is already full, a splitting operation is performed. Let's understand the overflow problem better (the split problem):

4.3. Handling Overflow

In the case a node/leaf is full and a new entry cannot be stored anymore, a split needs to be performed, just as for a B+ Tree. The difference is that the split can be done arbitrarily and not only in the middle as for a B+ Tree.

Figure 11: R-Tree Insertion: Overflow

4.3.1. The Split Problem

Given M + 1 entries in a node (exceeded maximum capacity per node), which two subsets of these entries should be considered as new and old nodes?

To better understand the split problem, let's take a step back and consider 4 rectangles (R1, R2, R3, R4) that need to be assigned to two nodes (MBRs) in a meaningful way.

Figure 12: R-Tree Insertion: Split Problem

Why is one better than the other? As mentioned before (Section 4.1), the area of expansion of the poor split is much larger compared to the good split (despite the overlap). This leads to more empty spaces in the node/MBR that do not have any objects.

A realistic use case for an R-Tree is M = 50 and there are 2^(M-1) possibilities. Hence, a naive approach to look at all possible subsets and choose the best one is not practical (too expensive!).

4.3.2. The Split Problem: Quadratic Cost

Search for split with smallest possible area
Cost is Quadratic in M and linear in number of dimensions d.
Idea:

Search for pairs of entries that would cause the largest MBR area if placed in the same node. Then put these entries in two different nodes
Then: Consider all remaining entries and consider the one (among the 2 nodes) for which the increase in area (of MBR) has the largest possible difference between the two nodes.
This entry is assigned to the node with the smallest increase. Repeat until all entries are assigned

Figure 13: R-Tree Insertion: Choosing MBR

In this example, two nodes, MBR1 and MBR2, are created. R1 and R2 in the same MBR would lead to creating the largest MBR. R3 is then inserted into MBR1 and not MBR2, as the area increase of MBR1 is smaller compared to MBR2.

Method "AdjustTree," is called whenever a new entry is inserted. It is responsible for adapting the parent's MBR and propagating the changes bottom up, handling splits as well as changes to MBRs. In the worst case, the propagation can be up to the root node.

5. R-Tree Variants

R-trees do not guarantee good worst-case performance, but generally speaking, they perform well with real-world data. Addressing this specific problem, the Priority R-tree is a worst-case asymptotically optimal alternative to the spatial tree R-tree, which is essentially a hybrid between a k-dimensional tree (k-d tree) and an R-tree.

Another commonly used variant is the R*-Tree, which uses the same algorithm as the regular R-tree for query and delete operations. However, while inserting, the R*-tree uses a combined strategy: for leaf nodes, overlap is minimized, and for inner nodes, enlargement and area are minimized, making the tree construction slightly more expensive.

The R+-Tree, on the other hand, solves one main problem to ensure nodes do not overlap with each other, leading to better point query performance. However, it does so by inserting an object into multiple leaves if necessary, which is a disadvantage due to duplicate entries and larger tree size.

The Hilbert R-Tree uses space-filling curves, specifically the Hilbert curve, to impose a linear ordering on the data rectangles. It has two variants: Packed Hilbert R-trees, suitable for static databases in which updates are very rare, and dynamic Hilbert R-trees, suitable for dynamic databases where insertions, deletions, or updates may occur in real time.

6. Conclusion

R-trees have come a long way since the first paper was published in 1984. Today, their applications span over multi-dimensional indexes, computer graphics, video games, spatial data management systems, and many more.

On the flip side, R-trees can degrade badly with discrete data. Hence, it's highly recommended to understand the data representation before using R-trees. R-trees are also relatively slow when there's a very high mutation rate, i.e., where the index changes often; this is because of the higher cost for constructing and updating the index (due to tree rebalancing) and they are more optimized for various search operations. Lastly, R-trees can be a poor algorithm choice when primarily dealing with points as opposed to polygons/regions.

7. References

[1] A. Guttman, "A Dynamic Index Structure for Spatial Searching," presented at the ACM SIGMOD International Conference on Management of Data, 1984. [Online]. Available: https://www.researchgate.net/publication/220805321_A_Dynamic_Index_Structure_for_Spatial_Searching.
[2] "R-Tree," Wikipedia. [Online]. Available: https://en.wikipedia.org/wiki/R-tree.
[3] "B-Trees and B+ Trees," PyBlog. [Online]. Available: https://www.pyblog.xyz/b-trees-b-plus-trees.
[4] "Spatial Index R-Tree," YouTube, https://www.youtube.com/watch?v=U0jUvvQkaFw.

Spatial Index: Tessellation

2024-06-17T00:00:00+00:00

Brewing! this post a continuation of Spatial Index: Grid Systems where we will set the foundation for tessellation and delve into the details of Uber H3

0. Foundation

Tessellation or tiling is the process of covering/dividing a space into smaller, non-overlapping shapes that fit together perfectly without gaps or overlaps. In spatial indexing, tessellation is used to break down the Earth's surface into manageable units for efficient data storage, querying, and analysis.

The rationale behind why a geographical grid system (Tessellation system) is necessary: The real world is cluttered with various geographical elements, both natural and man-made, none of which follow any consistent structure. To perform geographic algorithms or analyses on it, we need a more abstract form.

Maps are a good start and are the most common abstraction, with which most people are familiar. However, maps still contain all sorts of inconsistencies. This calls for a grid system, which takes the cluttered geographic space and provides a more clean and structured mathematical space, making it much easier to perform computations and queries.

Figure 0: Tessellated View of Halifax

The primary principle of the grid is to break the space into uniform cells. These cells are the units of analysis used in geographic systems. Think of it as pixels in an image.

A grid system adds a couple more layers on top of this, consisting of a series of nested grids, usually at increasingly fine resolutions. They include a way to uniquely identify any cell in the system. Other common grid systems include Graticule (latitude and longitude), Quad Key (Mercator projection), Geohash (Equirectangular projection) and Google S2 (Spherical projection).

1. Uber H3 - Intuition

Most systems use four-sided polygons (Square, Rectangle and Quadrilateral). H3 is the grid system developed by Uber, which uses hexagon cells as its base. It covers the space/world with hexagons and has different levels of resolution, with the smallest cells representing about 1 cm² of space.

1.1. Why Hexagons?

Starting off by adding rules or needs for choosing a tile, such as:

(a) Uniform shape
(b) Uniform edge length
(c) Uniform angles

Brings down the number of options, with the most commonly used shapes being squares, equilateral triangles, and hexagons.

Figure 1: Triangle vs Square vs Hexagon (neighbors)

Another important property of tiles is uniform adjacency, i.e., how unambiguous the neighbors are. For example, squares have 4 unambiguous neighbors but also have 4 ambiguous neighbors at the corners, which may not provide the best perception of neighbors if you consider a circular radius.

Equilateral triangles are much worse, with 3 unambiguous neighbors and 9 ambiguous neighbors, which is one of the reasons why triangles are not commonly used, along with the rotation of cells necessary for tessellation. Lastly, hexagons are the best, with 6 unambiguous neighbors and a structure very close to finding neighbors by radius.

Figure 2: Square vs Hexagon (Optimal Space-Filling)

Hexagons are more space-efficient and have optimal space-filling properties. This means that when filling a polygon with uniform cells, hexagons generally result in less over/under filling compared to squares.

Figure 3: Square vs Hexagon (Child Containment)

Hierarchical relationships between resolutions are another important property. Evidently, squares have hierarchical relationships with perfect child containment and can use algorithms such as quad trees to navigate up and down the hierarchy and space-filling curves to traverse the grid. Hexagons, while not having perfect child containment, can still function effectively with a tolerable margin of error.

Without taking triangles into account, the summary of the comparison between squares and hexagons:

Figure 4: Squares vs Hexagons (Full Comparison)

More on Hexagons vs Squares at Conceptualization of a Cartogram

1.2. Why Icosahedron?

Lastly, low shape and area distortion is more related to the projection than the shape of the tile. There are many types of projections, but the most commonly used are polyhedra. One such projection is the cylindrical projection, used in Geohash, which works well for squares but has the problem of distortion near the poles, making it hard to get equal surface area cells across the projection.

Figure 5: Uniform Shape Polyhedrons

The smaller the face, the lesser the distortion. An icosahedron, with 20 faces, is the better option among the uniform-face polyhedrons for fitting hexagons and triangles on them. Fitting squares on an icosahedron or even a tetrahedron is not ideal. Squares are mostly suitable for cubes (as seen in S2). Taking the best of both worlds, an icosahedron with hexagons is the way to go.

1.3. H3 Grid System

Putting it all together, we take the polyhedron, the icosahedron, project it on the surface of the Earth, then each face on the icosahedron is split into hexagon cells. More specifically, 4 full hexagon cells are completely contained by the face, 3 cells are half contained, and 3 corners form the pentagon.

Each hexagonal cell can be further subdivided into 7 hexagon cells with marginal error for containment. The number of levels decides the resolution.

Figure 6: H3 Projection and Tessellation

The H3 grid system divides the surface of the Earth into 122 (110 hexagons and 12 icosahedron vertex-centered pentagons) base cells (resolution 0), which are used as the foundation for higher resolution cells. Each base cell has a specific orientation relative to the face of the icosahedron it is on. This orientation determines how cells at higher resolutions are positioned and indexed.

1.4. Why Pentagons?

Looking at the icosahedron, the 5 faces come together at every vertex, and truncating that creates the base cell. Pentagons are unavoidable at the vertices. However, there are only 12 of them at every resolution. But again, for most cases dealing with spaces within a city where the resolution is higher than 9, the pentagons, if far off in the water, they are safe to ignore.

Figure 7: Dymaxion layout (12 Vertices in Water)

While the layout of the faces on the icosahedron can be done in any fashion, H3 uses the layout developed by Buckminster Fuller called the Dymaxion layout.

Figure 8: H3 Projection and Tessellation (Animated)

The benefit is that all the vertices end up in the water. For most applications, land is more important than water, and since the vertices are in the water, it reduces the need to deal with pentagons.

1.5. Cell ID

A cell ID is a 64-bit integer that uniquely identifies a hexagonal cell at a particular resolution. The composition of an H3 cell ID is as follows:

Mode (4 bits): Identifies the H3 mode, which indicates the type of the identifier. For cell IDs, this value defaults set to 1.
Edge Mode (Reserved, 3 bits): Indicates the edge mode, which is 0 for cell IDs.
Resolution (4 bits): Specifies the resolution of the cell. H3 supports resolutions from 0 (coarsest) to 15 (finest).
Base Cell (7 bits): Identifies the base cell, which is one of the 122 base cells that form the foundation of the H3 grid.
Cell Index (45 bits): Contains the specific index of the cell within the base cell and resolution.

This structure (Figure 14) allows H3 to efficiently encode the hierarchical location and resolution of each hexagonal cell in a compact 64-bit integer.

2. H3 - Implementation

The implementation below, loosely follows the steps of the actual H3 index calculation for demonstration purposes (to better understand the H3 Index). Here's a step-by-step process with reasonable simplifications:

2.1. LatLong to Vec3D

Convert latitude and longitude to 3D Cartesian coordinates using the formulas (similar to Section 4.2.1 in S2):.

Figure 9: (lat, long) to (x, y, z) Transformation

2.1a. LatLong to Vec3D - Snippet

private static double[] latLonToVec3D(double lat, double lon) {
    double r = Math.cos(Math.toRadians(lat));
    double x = r * Math.cos(Math.toRadians(lon));
    double y = r * Math.sin(Math.toRadians(lon));
    double z = Math.sin(Math.toRadians(lat));
    return new double[]{x, y, z};
}

2.2. Icosahedron Properties

We can identify the 12 vertices of the icosahedron using the golden ratio (ϕ). It a well known property of a regular icosahedron, where three mutually perpendicular rectangles of aspect ratio (ϕ) are arranged such that they share a common center.

The icosahedron has 12 vertices, 20 faces, and 30 edges. The 12 vertices are given by: (±1, ±ϕ, 0), (±ϕ, 0, ±1), (0, ±1, ±ϕ). Lastly, the vertices need to be normalized to lie on the surface of a unit sphere.

Figure 10: Golden Ratio Rectangles

To calculate the 20 face centers of the icosahedron:

For each face, average the coordinates of its three vertices and normalize the resulting vector to lie on the unit sphere. Use the formula:

Figure 11: Icosahedron Face Center

2.2a. Icosahedron Vertices - Snippet

double PHI = (1.0 + Math.sqrt(5.0)) / 2.0;
double[][] vertices = {
        {-1, PHI, 0}, {1, PHI, 0}, {-1, -PHI, 0}, {1, -PHI, 0},
        {0, -1, PHI}, {0, 1, PHI}, {0, -1, -PHI}, {0, 1, -PHI},
        {PHI, 0, -1}, {PHI, 0, 1}, {-PHI, 0, -1}, {-PHI, 0, 1}
};

// Normalize the vertices to lie on the unit sphere
for (int i = 0; i < vertices.length; i++) {
    vertices[i] = normalize(vertices[i]);
}

// Computes the center of a face defined by three vertices.
private static double[] computeFaceCenter(double[] a, double[] b, double[] c) {
    double[] center = new double[3];
    center[0] = (a[0] + b[0] + c[0]) / 3.0;
    center[1] = (a[1] + b[1] + c[1]) / 3.0;
    center[2] = (a[2] + b[2] + c[2]) / 3.0;
    return normalize(center);
}

// Normalizes a vector to lie on the unit sphere.
private static double[] normalize(double[] v) {
    double length = Math.sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
    return new double[]{v[0] / length, v[1] / length, v[2] / length};
}

2.3. Vec3D to Vec2D

The Vec2D represents the cartesian coordinates on the face of the icosahedron. It provides a 2D projection (Figure 7) of a point on the spherical surface of the Earth onto one of the icosahedron's faces, used to map geographic coordinates (latitude and longitude) onto a planar hexagonal grid. The conversion involves gnomonic projection, which translates 3D coordinates to a 2D plane by projecting from the center of the sphere to the plane tangent to the face of the icosahedron.

Calculate r (Radial Distance): Convert the distance from the face center to an angle using the inverse cosine function.
Gnomonic Scaling: Scale the angle r for the hexagonal grid at the given resolution.
Calculate θ (Azimuthal Angle): Determine the angle from the face center, adjusting for face orientation and resolution.
Convert to local 2D Coordinates: Transform polar coordinates (r, θ) into Cartesian coordinates (x, y).

Figure 12: Gnomonic Projection (XYZ to rθ)

2.3a. Vec3D to Vec2D - Snippet

// faceAxesAzRadsCII: Icosahedron face `ijk` axes as azimuth in radians from face center to vertex
// faceCenterGeo: Icosahedron face centers in lat/lng radians.
// RES0_U_GNOMONIC: Scaling factor from `Vec2d` resolution 0 unit length (or distance between adjacent cell center points on the plane) to gnomonic unit length.
// SQRT7_POWERS: Power of √7 for each resolution.
// AP7_ROT_RADS: Rotation angle between Class II and Class III resolution axes: asin(sqrt(3/28))

public Vec2d toVec2d(int resolution, int face, double distance) {
    // cos(r) = 1 - 2 * sin^2(r/2) = 1 - 2 * (sqd / 4) = 1 - sqd/2
    double r = acos(1.0 - distance / 2.0);
    if (r < EPSILON) {
        return new Vec2d(0.0, 0.0);
    }
    
    // Perform gnomonic scaling of `r` (`tan(r)`) and scale for current
    r = (tan(r) / RES0_U_GNOMONIC) * SQRT7_POWERS[resolution];
    
    // Compute counter-clockwise `theta` from Class II i-axis.
    double theta = faceAxesAzRadsCII[face][0] - this.azimuth(faceCenterGeo[face]);
    
    // Adjust `theta` for Class III.
    if ((resolution % 2) != 0) {
        theta -= AP7_ROT_RADS;
    }
    
    // Convert to local x, y.
    return new Vec2d(r * cos(theta), r * sin(theta));
}

About SQRT7_POWERS. Each resolution beyond 0 is created using an aperture 7 resolution spacing, i.e. number of cells in the next finer resolution (Figure 1 and 3). So, as resolution increases, unit length is scaled by sqrt(7). H3 has 15 resolutions/levels (+resolution 0).

2.4. Vec2D to FaceIJK

Hexagonal grids have three primary axes, unlike the two we have for square grids. In Axial coordinates or the Cube coordinates, the three coordinates (i, j, k) ensure that any point in the hexagonal grid can be described without ambiguity.

Figure 13: Axial Coordinates (Class II and Class III)

There are several other hex coordinate systems based, in this case, the constraints are i + j + k = 0, with 120° axis separation.

The faceIJK represents the position/location of a hexagon within a face of the icosahedron using three coordinates (i, j, k).

Reverse Conversion: Translate Cartesian coordinates into the hexagonal coordinate system by aligning them with the hex grid's axes.
Quantize and Round: Convert floating-point coordinates to integer grid positions, determining the closest hexagon center.

Check Hex Center and Round: Use remainders to accurately determine which hexagon the point falls into by rounding to the nearest hex center.

// Determine i and j based on r1 and r2
IF r1 < 0.5 THEN
    IF r1 < 1 / 3 THEN
        i = m1
        j = m2 + (r2 >= (1 + r1) / 2)
    ELSE
        i = m1 + ((1 - r1) <= r2 && r2 < (2 * r1))
        j = m2 + (r2 >= (1 - r1))
ELSE IF r1 < 2 / 3 THEN
    j = m2 + (r2 >= (1 - r1))
    i = m1 + ((2 * r1 - 1) >= r2 || r2 >= (1 - r1))
ELSE
    i = m1 + 1
    j = m2 + (r2 >= (r1 / 2))

Fold Across Axes if Necessary: Correct the coordinates if they fall into negative regions, ensuring the coordinates remain within the valid grid.

IF value.x < 0 THEN
    offset = j % 2
    axis_i = (j + offset) / 2
    diff = i - axis_i
    i = i - 2 * diff - offset

IF value.y < 0 THEN
    i = i - (2 * j + 1) / 2
    j = -j

Normalize: Purpose: Adjust the coordinates to maintain the properties of the hexagonal grid, ensuring i + j + k = 0.

2.4a. Vec2D to FaceIJK - Snippet

public static CoordIJK fromVec2d(Vec2d value) {
    int k = 0;

    double a1 = Math.abs(value.x);
    double a2 = Math.abs(value.y);

    // Reverse conversion
    double x2 = a2 / SIN60;
    double x1 = a1 + x2 / 2.0;

    // Quantize and round
    int m1 = (int) x1;
    int m2 = (int) x2;

    double r1 = x1 - m1;
    double r2 = x2 - m2;

    int i, j;
    if (r1 < 0.5) {
        if (r1 < 1.0 / 3.0) {
            i = m1;
            j = m2 + (r2 >= (1.0 + r1) / 2.0 ? 1 : 0);
        } else {
            i = m1 + ((1.0 - r1) <= r2 && r2 < (2.0 * r1) ? 1 : 0);
            j = m2 + (r2 >= (1.0 - r1) ? 1 : 0);
        }
    } else if (r1 < 2.0 / 3.0) {
        j = m2 + (r2 >= (1.0 - r1) ? 1 : 0);
        i = m1 + ((2.0 * r1 - 1.0) >= r2 || r2 >= (1.0 - r1) ? 1 : 0);
    } else {
        i = m1 + 1;
        j = m2 + (r2 >= (r1 / 2.0) ? 1 : 0);
    }

    // Fold Across Axes if Necessary
    if (value.x < 0) {
        int offset = j % 2;
        int axis_i = (j + offset) / 2;
        int diff = i - axis_i;
        i = i - 2 * diff - offset;
    }

    if (value.y < 0) {
        i = i - (2 * j + 1) / 2;
        j = -j;
    }

    return new CoordIJK(i, j, k).normalize();
}

Each grid resolution is rotated ~19.1° relative to the next coarser resolution. The rotation alternates between counterclockwise (CCW) and clockwise (CW) at each successive resolution, so that each resolution will have one of two possible orientations as shown in Figure 13: Class II or Class III. The base cells, which make up resolution 0, are Class II.

2.5. FaceIJK to H3 Index

Lastly, the face and face-centered ijk coordinates are converted to H3 Index.

Figure 14: H3 Index Structure

If the resolution is not uptill level 15, rest of the vits are set to 1s, for example: 83001dfffffffff. The binary representation is as below (Figure 15); Index mode = 1 i.e. indexes the regular hexagon type. Resolution = 3; Base Cell = 0; Resolution 1, 2 and 3 are 0, 3 and 5, rest are 1s.

Figure 15: H3 Index Structure (Example: 83001dfffffffff)

This primarily involves coverting to Direction bits, representing the hierarchical path from a base cell to a specific cell at a given resolution. These bits encode the sequence of directional steps taken within the hexagonal grid to reach the target cell from the base cell.

Handle Base Cell: If the resolution is 0 (base cell), directly set the base cell in the index.

// Convert IJK to Direction Bits
faceIJK.coord = directions_bits_from_ijk(faceIJK.coord, resolution)

// Set the Base Cell
base_cell = get_base_cell(faceIJK)
bits = set_base_cell(bits, base_cell)

Build from Finest Resolution Up and Set Base Cell: Convert IJK coordinates to direction bits starting from the finest resolution (r), updating the index progressively. Identify and set the correct base cell for the given IJK coordinates.

// Handle Pentagon Cells
IF base_cell.is_pentagon() THEN
    IF first_axe(bits) == Direction.K THEN
        // Check for a CW/CCW offset face (default is CCW).
        IF base_cell.is_cw_offset(faceIJK.face) THEN
            bits = rotate60(bits, 1, CW)
        ELSE
            bits = rotate60(bits, 1, CCW)
        END IF
    END IF
    FOR i = 0 TO rotation_count DO
        bits = pentagon_rotate60(bits, CCW)
    END FOR
ELSE
    bits = rotate60(bits, rotation_count, CCW)
END IF

Handle Pentagon Cells: Apply necessary rotations if the base cell is a pentagon to ensure the correct orientation and avoid the missing k-axes subsequence (if the direction bits indicate a move along the k-axis).

Since each base cell can be oriented differently (Section 1.3) on the icosahedron's faces, rotations are needed to standardize these orientations. rotation_count refers to the number of 60-degree rotations that need to be applied to the H3 cell index to align it with the canonical orientation of the base cell (also refer).

2.6. Official H3 library

Here's a Java snippet using the official H3 library provided by Uber:

2.7a. Official H3 - Snippet

import com.uber.h3core.H3Core;

public class H3Index {
    public static void main(String[] args) throws Exception {
        H3Core h3 = H3Core.newInstance();
        double lat = 37.7749;
        double lon = -122.4194;
        int resolution = 9;

        long h3Index = h3.geoToH3(lat, lon, resolution);
        System.out.println(Long.toHexString(h3Index));
    }
}

3. H3 - Conclusion

So far, in the Spatial Index Series, we have seen the use of space-filling curves and their application in grid systems like Geohash and S2. Finally, we explored Uber's H3, which falls under grid systems and more specifically relies on tessellation. By now, it's likely clear that H3 indexes are not directly queryable on the database by ranges or prefixes, but they have more importance towards the accuracy of filling a polygon, nearby search by radius, high resolution, and many more.

Figure 16: H3 grid segmentation (Level 0 and Level 1)

If you missed the series, it starts with Spatial Index: Space-Filling Curves, followed by Spatial Index: Grid Systems, and finally, the current post, Spatial Index: Tessellation.

4. References

1. Uber Technologies, Inc., "H3: A Hexagonal Hierarchical Spatial Index," GitHub. [Online]. Available: https://github.com/uber/h3.
2. Wikipedia, "Graticule," [Online]. Available: https://en.wikipedia.org/wiki/Graticule.
3. Microsoft, "QuadKey," Microsoft Docs. [Online]. Available: https://learn.microsoft.com/en-us/bingmaps/articles/bing-maps-tile-system.
4. Wikipedia, "Geohash," [Online]. Available: https://en.wikipedia.org/wiki/Geohash.
5. Google, "Google S2 Geometry Library," [Online]. Available: https://s2geometry.io/.
6. Wikipedia, "Icosahedron," [Online]. Available: https://en.wikipedia.org/wiki/Icosahedron.
7. Wikipedia, "Dot product," [Online]. Available: https://en.wikipedia.org/wiki/Dot_product.
8. Wikipedia, "Basis vectors," [Online]. Available: https://en.wikipedia.org/wiki/Basis_(linear_algebra).
9. Wikipedia, "3D Cartesian coordinates," [Online]. Available: https://en.wikipedia.org/wiki/Cartesian_coordinate_system.
10. A. N. Adimurthy, "Spatial Index: Tessellation," PyBlog. [Online]. Available: https://www.pyblog.xyz/spatial-index-tessellation.
11. Wikipedia, "Conceptualization of a Cartogram," [Online]. Available: https://en.wikipedia.org/wiki/Cartogram.
12. Wikipedia, "Golden ratio," [Online]. Available: https://en.wikipedia.org/wiki/Golden_ratio.
13. Wikipedia, "Icosahedron vertices," [Online]. Available: https://en.wikipedia.org/wiki/Icosahedron#Vertices.
14. Wikipedia, "H3: A Hexagonal Hierarchical Spatial Index," [Online]. Available: https://en.wikipedia.org/wiki/H3_(spatial_index).
15. Wikipedia, "Dymaxion map," [Online]. Available: https://en.wikipedia.org/wiki/Dymaxion_map.
16. K. Sahr, "Geodesic Discrete Global Grid Systems," Southern Oregon University. [Online]. Available: https://webpages.sou.edu/~sahrk/sqspc/pubs/gdggs03.pdf.
17. D. F. Marble, "The Fundamental Data Structures for Implementing Digital Tessellation," University of Edinburgh. [Online]. Available: https://www.geos.ed.ac.uk/~gisteac/gis_book_abridged/files/ch36.pdf.
18. J. Castner, "The Application of Tessellation in Geographic Data Handling," Semantic Scholar. [Online]. Available: https://pdfs.semanticscholar.org/feb2/3e69e19875817848ac8694b15f58d2ef52b0.pdf.
19. "Hexagonal Tessellation and Its Application in Geographic Information Systems," YouTube. [Online]. Available: https://www.youtube.com/watch?v=wDuKeUkNLkQ&list=PL0HGds8aHQsAYm86RzQdZtFFeLpIOjk00.
20. Hydronium Labs. "h3o: A safer, faster, and more flexible H3 library written in Rust." GitHub Repository. Available: https://github.com/HydroniumLabs/h3o/tree/master.

Spatial Index: Grid Systems

2024-06-12T00:00:00+00:00

This post is a continuation of Stomping Grounds: Spatial Indexes, but don’t worry if you missed the first part—you’ll still find plenty of new insights right here.

3. Geohash

Geohash: Invented in 2008 by Gustavo Niemeyer, encodes a geographic location into a short string of letters and digits. It's a hierarchical spatial data structure that subdivides space into buckets of grid shape using a Z-order curve (Section 2.).

3.1. Geohash - Intuition

Earth is round or more accurately, an ellipsoid. Map projection is a set of transformations represent the globe on a plane. In a map projection. Coordinates (latitude and longitude) of locations from the surface of the globe are transformed to coordinates on a plane. And GeoHash Uses Equirectangular projection

Figure 21: Equirectangular projection/ Equidistant Cylindrical Projection

The core of GeoHash is just an clever use of Z-order curves. Split the map-projection (rectangle) into 2 equal rectangles, each identified by unique bit strings.

Figure 22: GeoHash Level 1 - Computation

Observation: the divisions along X and Y axes are interleaved between bit strings. For example: an arbitrary bit string 01110 01011 00000, follows:

By futher encoding this to Base32 (0123456789bcdefghjkmnpqrstuvwxyz), we map a unique string to a quadrant in a grid and quadrants that share the same prefix are closer to each other; e.g. 000000 and 000001. By now we know that interleaving trace out a Z-order curve.

Figure 23: GeoHash Level 1 - Z-Order Curve

Higher levels (higher order z-curves) lead to higher precision. The geohash algorithm can be iteratively repeated for higher precision. That's one cool property of geohash, adding more characters increase precision of the location.

Figure 24: GeoHash Level 2

Despite the easy implementation and wide usage of geohash, it inherits the disadvantages of Z-order curves (Section 2.5): weakly preserved latitude-longitude proximity; does not always guarantee that locations that are physically close are also close on the Z-curve.

Adding on to it, is the use of equirectangular projection, where the division of the map into equal subspaces leads to unequal/disproportional surface areas, especially near the poles (northern and southern hemisphere). However, there are alternatives such as Geohash-EAS (Equal-Area Spaces).

3.2. Geohash - Implementation

To Convert a geographical location (latitude, longitude) into a concise string of characters and vice versa:

Convert latitude and longitude to a binary strings.
Interleave the binary strings of latitude and longitude.
Geohash: Convert the interleaved binary string into a base32 string.

3.2a. Geohash Encoder - Snippet

public class GeohashEncoder {

    public static String encodeGeohash(double latitude, double longitude, int precision) {
        // 1. Convert Lat and Long into a binary string based on the range.
        String latBin = convertToBinary(latitude, -90, 90, precision * 5 / 2);
        String lonBin = convertToBinary(longitude, -180, 180, precision * 5 / 2);

        // 2. Interweave the binary strings.
        String interwovenBin = interweave(lonBin, latBin);

        // 3. Converts a binary string to a base32 geohash.
        String geohash = binaryToBase32(interwovenBin);

        return geohash.substring(0, precision);
    }

    private static String convertToBinary(double value, double min, double max, int precision) {
        StringBuilder binaryStr = new StringBuilder();
        for (int i = 0; i < precision; i++) {
            double mid = (min + max) / 2;
            if (value >= mid) {
                binaryStr.append('1');
                min = mid;
            } else {
                binaryStr.append('0');
                max = mid;
            }
        }
        return binaryStr.toString();
    }

    private static String interweave(String str1, String str2) {
        StringBuilder interwoven = new StringBuilder();
        for (int i = 0; i < str1.length(); i++) {
            interwoven.append(str1.charAt(i));
            interwoven.append(str2.charAt(i));
        }
        return interwoven.toString();
    }

    private static String binaryToBase32(String binaryStr) {
        String base32Alphabet = "0123456789bcdefghjkmnpqrstuvwxyz";
        StringBuilder base32Str = new StringBuilder();
        for (int i = 0; i < binaryStr.length(); i += 5) {
            String chunk = binaryStr.substring(i, Math.min(i + 5, binaryStr.length()));
            int decimalVal = Integer.parseInt(chunk, 2);
            base32Str.append(base32Alphabet.charAt(decimalVal));
        }
        return base32Str.toString();
    }

    public static void main(String[] args) {
        double latitude = 37.7749;
        double longitude = -122.4194;
        int precision = 5;
        String geohash = encodeGeohash(latitude, longitude, precision);
        System.out.println("Geohash: " + geohash);
    }
}

3.3. Geohash - Conclusion

Similar to Section 2.7 (Indexing the Z-values); Geohashes convert latitude and longitude into a single, sortable string, simplifying spatial data management. A B-trees or search tree such as GiST/SP-GiST (Generalized Search Tree) index are commonly used for geohash indexing in databases.

Prefix Search: Nearby locations share common geohash prefixes, enabling efficient filtering of locations by performing prefix searches on the geohash column

Neighbor Searches: Generate geohashes for a target location and its neighbors to quickly retrieve nearby points. Which also extends to Area Searches: Calculate geohash ranges that cover a specific area and perform range queries to find all relevant points within that region.

Popular databases such as ClickHouse, MySQL, PostGIS, BigQuery, RedShift and many others offer built-in geohash function. And many variations have been developed, such as the 64-bit Geohash and Hilbert-Geohash

Interactive Geohash Visualization: /geohash

4. Google S2

4.1. S2 - Intuition

Google's S2 library was released more than 10 years ago and didn't exactly the get the attention it deserved, much later in 2017, Google announced the release of open-source C++ s2geometry library. With the use of Hilbert Curve (Section 2.2) and cube face (spherical) projection instead of geohash's Z-order curve and equirectangular projection; S2 addresses (to an extent) the large jumps (Section 2.5) problem with Z-order curves and disproportional surface areas associated with equirectangular projection.

The core of S2 is the hierarchical decomposition of the sphere into "cells"; done using a Quad-tree, where a quadrant is recursively subdivided into four equal sub-cells and the use of Hilbet Curve goes hand-in-hand - runs across the centers of the quad-tree’s leaf nodes.

4.2. S2 - Implementation

The overview of solution is to:

Enclose sphere in cube
Project point(s) p onto the cube
Build a quad-tree/hilbert-curve on each cube face (6 faces)
Assign ID to the quad-tree cell that contains the projection of point(s) p

Starting with the input co-ordinates, latitude (Degrees: -90° to +90°. Radians: -π/2 to π/2) and longitude (-180° to +180°. Radians: 0 to 2π). And WGS84 is a commmonly standard used in geocentric coordinate system.

4.2.1. (Lat, Long) to (X,Y,Z)

Covert p = (lattitude,longitude) => (x,y,z) XYZ co-ordinate system (x = [-1.0, 1.0], y = [-1.0, 1.0], z = [-1.0, -1.0]), based on coordinates on the unit sphere (unit radius), which is similar to Earth-centered, Earth-fixed coordinate system.

Figure 25: (lat, long) to (x, y, z) Transformation with ECEF

Where, (x, y, z): X-axis at latitude 0°, longitude 0° (equator and prime meridian intersection), Y-axis at latitude 0°, longitude 90° (equator and 90°E meridian intersection), Z-axis at latitude 90° (North Pole), Altitude (PM on Figure 25) = Height to the reference ellipsoid/Sphere (Zero for a Round Planet approximation)

4.2.2. (X,Y,Z) to (Face,U,V)

To map (x,y,z) to (face, u,v), each of the six faces of the cube is projected onto the sphere. The process is similar to UV Mapping: to project 3D model surface into a 2D coordinate space. where u and v denote the axes of the 2D plane. In this case, U,V represent the location of a point on one face of the cube.

The projection can simply be imagined as a unit sphere circumscribed by a cube. And a ray is emitted from the center of the sphere to obtain the projection of the point on the sphere to the 6 faces of the cube, that is, the sphere is projected into a cube.

Figure 26: (lat, long) to (x, y, z) and (x, y, z) to (face, u, v)

The face denotes which of the 6 (0 to 5) cube faces a point on the sphere is mapped onto. Figure 27, shows the 6 faces of the cube (cube mapping) after the projection. For a unit-sphere, for each face, the point u,v = (0,0) represent the center of the face.

Figure 27: Cube Face (Spherical) Projection

The evident problem here is that, the linear projection leads to same-area cells on the cube having different sizes on the sphere (Length and Area Distortion), with the ratio of highest to lowest area of 5.2 (areas on the cube can be up to 5.2 times longer or shorter than the corresponding distances on the sphere).

4.2.2a. S2 FaceXYZ to UV - Snippet

public static class Vector3 {
    public double x;
    public double y;
    public double z;

    public Vector3(double x, double y, double z) {
        this.x = x;
        this.y = y;
        this.z = z;
    }
}

public static int findFace(Vector3 r) {
    double absX = Math.abs(r.x);
    double absY = Math.abs(r.y);
    double absZ = Math.abs(r.z);

    if (absX >= absY && absX >= absZ) {
        return r.x > 0 ? 0 : 3;
    } else if (absY >= absX && absY >= absZ) {
        return r.y > 0 ? 1 : 4;
    } else {
        return r.z > 0 ? 2 : 5;
    }
}

public static double[] validFaceXYZToUV(int face, Vector3 r) {
    switch (face) {
        case 0:
            return new double[]{r.y / r.x, r.z / r.x};
        case 1:
            return new double[]{-r.x / r.y, r.z / r.y};
        case 2:
            return new double[]{-r.x / r.z, -r.y / r.z};
        case 3:
            return new double[]{r.z / r.x, r.y / r.x};
        case 4:
            return new double[]{r.z / r.y, -r.x / r.y};
        default:
            return new double[]{-r.y / r.z, -r.x / r.z};
    }
}

public static void main(String[] args) {
    Vector3 r = new Vector3(1.0, 2.0, 3.0);
    int face = 0;
    double[] uv = validFaceXYZToUV(face, r);
    System.out.println("u: " + uv[0] + ", v: " + uv[1]);
}

The Cube Face is the largest absolute X,Y,Z component, when component is -ve, back faces are used.

Face and XYZ is mapped to UV by using the other two X, Y, Z components (other than largest component of face) and diving it by the largest component, a value between [-1, 1]. Additionally, some faces of the cube are transposed (-ve) to produce the single continuous hilbert curve on the cube.

4.2.3. (Face,U,V) to (Face,S,T)

The ST coordinate system is an extension of UV with an additional non-linear transformation layer to address the (Area Preservation) disproportionate sphere surface-area to cube cell mapping. Without which, cells near the cube face edges would be smaller than those near the cube face centers.

Figure 28: (u, v) to (s, t)

S2 uses Quadratic projection for (u,v) => (s,t). Comparing tan and quadratic projections: The tan projection has the least Area/Distance Distortion. However, quadratic projection, which is an approximation of the tan projection - is much faster and almost as good as tangent.

	Area Ratio	Cell → Point (µs)	Point → Cell (µs)
Linear	5.20	0.087	0.085
Tangent	1.41	0.299	0.258
Quadratic	2.08	0.096	0.108

Cell → Point and Point → Cell represents the transformation from (U, V) to (S, T) coordinates and vice versa.

Figure 29: (face, u, v) to (face, s, t); for face = 0

For the quadratic transformation: Apply a square root transformation; sqrt(1 + 3 * u) and to maintain the uniformity of the grid cells

4.2.3a. S2 UV to ST - Snippet

public static double uvToST(double u) {
    if (u >= 0) {
        return 0.5 * Math.sqrt(1 + 3 * u);
    } else {
        return 1 - 0.5 * Math.sqrt(1 - 3 * u);
    }
}

public static void main(String[] args) {
    // (u, v) values in the range [-1, 1]
    double u1 = 0.5;
    double v1 = -0.5;
    
    // Convert (u, v) to (s, t)
    double s1 = uvToST(u1);
    double t1 = uvToST(v1);

    System.out.println("For (u, v) = (" + u1 + ", " + v1 + "):");
    System.out.println("s: " + s1);
    System.out.println("t: " + t1);
}

4.2.4. (Face,S,T) to (Face,I,J)

The IJ coordinates are discretized ST coordinates and divides the ST plane into 2³⁰ × 2³⁰, i.e. the i and j coordinates in S2 range from 0 to 2³⁰ - 1. And represent the two dimensions of the leaf-cells (lowest-level cells) on a cube face.

Why 2³⁰? The i and j coordinates are each represented using 30 bits, which is 2³⁰ distinct values for both i and j coordinates (every cm² of the earth), this large range allows precise positioning within each face of the cube (high spatial resolution). The total number of unique cells is 6 x (2³⁰ × 2³⁰)

Figure 30: (face, s, t) to (face, i, j); for face = 0

4.2.4a. S2 ST to IJ - Snippet

public static int stToIj(double s) {
  return Math.max(
    0, Math.min(1073741824 - 1, (int) Math.round(1073741824 * s))
  );
}

4.2.5. (Face,I,J) to S2 Cell ID

The hierarchical sub-division of each cube face into 4 equal quadrants calls for Hilbert Space-Filling Curve (Section 2.2): to enumerate cells along a Hilbert space-filling curve.

Figure 31: (face, i, j) to Hilbert Curve Position

Hilbert Curve preserves spatial locality, meaning, the values that are close on the cube face/surface, are numerically close in the Hilbert curve position (illustration in Figure 31 - Level 3).

Transformation: The Hilbert curve transforms the IJ coordinate position on the cube face from 2D to 1D and is given by a 60 bit integer (0 to 2⁶⁰).

4.2.5a. S2 IJ to S2 Cell ID - Snippet

public class S2CellId {
    private static final long MAX_LEVEL = 30;
    private static final long POS_BITS = 2 * MAX_LEVEL + 1;
    private static final long FACE_BITS = 3;
    private static final long FACE_MASK = (1L << FACE_BITS) - 1;
    private static final long POS_MASK = (1L << POS_BITS) - 1;

    public static long faceIjToCellId(int face, int i, int j) {
        // Face Encoding
        long cellId = ((long) face) << POS_BITS;
        // Loop from MAX_LEVEL - 1 down to 0
        for (int k = MAX_LEVEL - 1; k >= 0; --k) {
            // Hierarchical Position Encoding
            int mask = 1 << k;
            long bits = (((i & mask) != 0) ? 1 : 0) << 1 | (((j & mask) != 0) ? 1 : 0);
            cellId |= bits << (2 * k);
        }
        return cellId;
    }

    public static void main(String[] args) {
        int face = 2; 
        int i = 536870912;
        int j = 536870912;

        long cellId = faceIjToCellId(face, i, j);
        System.out.println("S2 Cell ID: " + cellId);
    }
}

The S2 Cell ID is represented by a 64-bit integer,

Figure 32: (face, i, j) to S2 Cell ID

the left 3 bits are used to represent the cube face [0-5],
the next following 60 bits represents the Hilbert Curve position,
with [0-30] levels; two bits for every higher order/level, followed by a trailing 1 bit, which is a marker to identify the level of the cell (by position).
and the last digits are padded with 0s

fffpppp...pppppppp1  # Level 30 cell ID
fffpppp...pppppp100  # Level 29 cell ID
fffpppp...pppp10000  # Level 28 cell ID
...
...
...
fffpp10...000000000  # Level 1 cell ID
fff1000...000000000  # Level 0 cell ID

Notice the position of trailing 1 and padded 0s, correlated to the level.

S2 Tokens are a string representation of S2 Cell IDs (uint64), which can be more convenient for storage.

4.2.5b. S2 Cell ID to S2 Token - Snippet

public static String cellIdToToken(long cellId) {
    // The zero token is encoded as 'X' rather than as a zero-length string
    if (cellId == 0) {
        return "X";
    }

    // Convert cell ID to a hex string and strip any trailing zeros
    String hexString = Long.toHexString(cellId).replaceAll("0*$", "");
    return hexString;
}

public static void main(String[] args) {
    long cellId = 3383821801271328768L; // Given example value

    // Convert S2 Cell ID to S2 Token
    String token = cellIdToToken(cellId);

    System.out.println("S2 Cell ID: " + cellId);
    System.out.println("S2 Token: " + token);
}

It's similar to Geohash, however, prefixes from a high-order S2 token does not yield a parent lower-order token, because the trailing 1 bit in S2 cell ID wouldn't be set correctly. Convert S2 Cell ID to an S2 Token by encoding the ID into a base-16 (hexadecimal) string.

4.3. S2 - Conclusion

Google's S2 provides spatial indexing by using hierarchical decomposition of the sphere into cells through a combination of Hilbert curves and cube face (spherical) projection. This approach mitigates some of the spatial locality issues present in Z-order curves and offers more balanced surface area representations. S2's use of (face, u, v) coordinates, quadratic projection, and Hilbert space-filling curves ensures efficient and precise spatial indexing.

Closing with a strong pro and a con, S2 offers a high resolution of as low as 0.48 cm² cell size (level 30), but the number of cells required to cover a given polygon isn't the best. This makes it a good transition to talk about Uber's H3. The question is, Why Hexagons?

3. References

6. Christian S. Perone, "Google’s S2, geometry on the sphere, cells and Hilbert curve," in Terra Incognita, 14/08/2015, https://blog.christianperone.com/2015/08/googles-s2-geometry-on-the-sphere-cells-and-hilbert-curve/. [Accessed: 12-Jun-2024].
7. B. Feifke, "Geospatial Indexing Explained," Ben Feifke, Dec. 2022. [Online]. Available: https://benfeifke.com/posts/geospatial-indexing-explained/. [Accessed: 12-Jun-2024].
8. "S2 Concepts," S2 Geometry Library Documentation, 2024. [Online]. Available: https://docs.s2cell.aliddell.com/en/stable/s2_concepts.html. [Accessed: 13-Jun-2024].
9. "Geospatial Indexing: A Look at Google's S2 Library," CNIter Blog, Mar. 2023. [Online]. Available: https://cniter.github.io/posts/720275bd.html. [Accessed: 13-Jun-2024].
10. "S2 Geometry Library," S2 Geometry, 2024. [Online]. Available: https://s2geometry.io/. [Accessed: 13-Jun-2024].

Spatial Index: Space-Filling Curves

2024-06-11T00:00:00+00:00

0. Overview

Spatial data has grown (/is growing) rapidly thanks to web services tracking where and when users do things. Most applications add location tags and often allow users check in specific places and times. This surge is largely due to smartphones, which act as location sensors, making it easier than ever to capture and analyze this type of data.

The goal of this post is to dive into the different spatial indexes that are widely used in both relational and non-relational databases. We'll look at the pros and cons of each type, and also discuss which indexes are the most popular today.

Figure 0: Types of Spatial Indexes

Spatial indexes fall into two main categories: space-driven and data-driven structures. Data-driven structures, like the R-tree family, are tailored to the distribution of the data itself. Space-driven structures include partitioning trees (kd-trees, quad-trees), space-filling curves (Z-order, Hilbert), and grid systems (H3, S2, Geohash), each partitioning space to optimize spatial queries. This classification isn't exhaustive, as many other methods cater to specific needs in spatial data management.

1. Foundation

To understand the need for spatial indexes, or more generally, a way to index multi-dimensional data.

Figure 1: Initial Table Structure

Consider a table with the following fields: device, X, and Y, all of which are integers ranging from 1 to 4. Data is inserted into this table randomly by an external application.

Figure 2: Unpartitioned and Unsorted Table

Currently, the table is neither partitioned nor sorted. As a result, the data is distributed across all files (8 files), each containing a mix of all ranges. This means all files are similar in nature. Running a query like Device = 1 and X = 2 requires a full scan of all files, which is inefficient.

Figure 3: Partitioning by Device

To optimize this, we partition the table by the device field into 4 partitions: Device = 1, Device = 2, Device = 3, and Device = 4. Now, the same query (Device = 1 and X = 2) only needs to scan the relevant partition. This reduces the scan to just 2 files.

Figure 4: Sorting Data Within Partitions

Further optimization can be achieved by sorting the data within each partition by the X field. With this setup, each file in a partition holds a specific range of X values. For example, one file in the Device = 1 partition hold X = 1 to 2. This makes the query Device = 1 and X = 2 even more efficient.

Figure 5: Limitation with Sorting on a Single Field

However, if the query changes to Device = 1 and Y = 2, the optimization is lost because the sorting was done on X and not Y. This means the query will still require scanning the entire partition for Device = 1, bringing us back to a less efficient state.

At this point, there's a clear need for efficiently partitioning 2-dimensional data. Why not use B-tree with a composite index? A composite index prioritizes the first column in the index, leading to inefficient querying for the second column. This leads us back to the same problem, particularly when both dimensions need to be considered simultaneously for efficient querying.

2. Space-Filling Curves

X and Y from 1 to 4 on a 2D axis. The goal is to traverse the data and number them accordingly (the path). using Space-Filling Curves AKA squiggly lines.

Figure 6: Exploring Space-Filling Curve and Traversing the X-Y Axis

Starting from Y = 1 and X = 1, as we traverse up to X = 1 and Y = 4, it's evident that there is no locality preservation (Lexicographical Order). The distance between points (1, 4) and (1, 3) is 6, a significant difference for points that are quite close to each other. Grouping this data into files keeps unrelated data together and ended up sorting by one column while ignoring the information in the other column (back to square one). i.e. X = 2 leads to a full scan.

2.1. Z-Order Curve - Intuition

A recursive Z pattern, also known as the Z-order curve, is an effective way to preserve locality in many cases.

Figure 7: Z-Order Curve Types

The Z-order curve can take many shapes, depending on which coordinate goes first. The typical Z-shape occurs when the Y-coordinate goes first (most significant bit), and the upper left corner is the base. A mirror image Z-shape occurs when the Y-coordinate goes first and the lower left corner is the base. An N-shape occurs when the X-coordinate goes first and the lower left corner is the base.

Z-order curve grows exponentially, and the next size is the second-order curve that has 2-bit sized dimensions. Duplicate the first-order curve four times and connect them together to form a continuous curve.

Figure 8: Z-Order Curve

Points (1, 4) and (1, 3) are separated by a single square. With 4 files based on this curve, the data is not spread out along a single dimension. Instead, the 4 files are clustered across both dimensions, making the data selective on both X and Y dimensions.

2.2. Hilbert Curve - Intuition

The Hilbert curve is another type of space-filling curve that serve a similar purpose, rather than using a Z-shaped pattern like the Z-order curve, it uses a gentler U-shaped pattern. When compared with the Z-order curve in Figure 9, it’s quite clear that the Hilbert curve always maintains the same distance between adjacent data points.

Figure 9: First Order and Second Order Hilbert Curve

Hilbert curve also grows exponentially, to do so, duplicate the first-order curve and connect them. Additionally, some of the first-order curves are rotated to ensure that the interconnections are not larger than 1 point.

Comparing with the Z-curves (from Figure 8, higher-order in Figure 18), the Z-order curve is longer than the Hilbert curve at all levels, for the same area.

Figure 10: Hilbert Curve Types

Although there are quite a lot of varaints of Hilbert curve, the common pattern is to rotate by 90 degrees and repeat the pattern in next higher order(s).

Figure 11: Hilbert Curve

Hilbert curves traverse through the data, ensuring that multi-dimensional data points that are close together in 2D space remain close together along the 1D line or curve, thus preserving locality and enhancing query efficiency across both dimensions.

2.3. Z-Order Curve and Hilbert Curve - Comparison

Taking an example, if we query for X = 3, we only need to search 2 of the files. Similarly, for Y = 3, the search is also limited to 2 files in both Z-order and Hilbert Curves

Figure 12: Z-Order Curve - Example

Unlike a hierarchical sort on only one dimension, the data is selective across both dimensions, making the multi-dimensional search more efficient.

Figure 13: Hilbert Curve - Example

Although both the curves give a similar advantage, the main shortcoming with Z-order curve: it fails to maintain perfect data locality across all the data points in the curve. In Figure 12, notice the data points between index 8 and 9 are further apart. As the size of the Z-curve increases, so does the distance between such points that connect different parts of curve together.

Hilbert curve is more preferred over the Z-order curve for ensuring better data locality and Z-order curve is still widely used because of it's simplicity.

2.4. Optimizing with Z-Values

In the examples so far, we have presumed that the X and Y values are dense, meaning that there is a value for every combination of X and Y. However, in real-world scenarios, data can be sparse, with many X, Y combinations missing

Figure 14: Flexibility in Number of Files

The number of files (4 in the prior examples) isn't necessarily dictated. Here's what 3 files would look like using both Z-order and Hilbert curves. The benefits still holds to an extent because of the space-filling curve, which efficiently clusters related data points.

Figure 15: Optimizing with Z-Values

To improve efficiency, we can use Z-values. If files are organized by Z-values, each file has a min-max Z-value range. Filters on X and Y can be transformed into Z-values, enabling efficient querying by limiting the search to relevant files based on their Z-value ranges.

Figure 16: Efficient Querying with Min-Max Z-Values

Consider a scenario where the min-max Z-values of 3 files are 1 to 5, 6 to 9, and 13 to 16. Querying by 2 ≤ X ≤ 3 and 3 ≤ Y ≤ 4 would initially require scanning 2 files. However, if we convert these ranges to their Z-value equivalent, which is 10 ≤ Z ≤ 15, we only need to scan one file, since the min-max Z-values are known.

2.5. Z-Order Curve - Implementation

So far, wkt, Z-ordering arranges the 2D pairs on a 1-dimensional line. More importantly, values that were close together in the 2D plane would still be close to each other on the Z-order line. The implementation goal is to derive Z-Values that preserves spatial locality from M-dimensional data-points (Z-ordering is not limited to 2-dimensional space and it can be abstracted to work in any number of dimensions)

Z-order bit-interleaving is a technique that interleave bits of two or more values to create a 1-D value while spatial locality is preserved:

Figure 17: Bit Interleaving

Example: 4-bit values X = 10, Y = 12 on a 2D grid, X = 1010, Y = 1100, then interleaved value Z = 1110 0100 (228)

2.5a. Z-Order Curve - Snippet

public class ZOrderCurve {

    // Function to interleave bits of two integers x and y
    public static long interleaveBits(int x, int y) {
        long z = 0;
        for (int i = 0; i < 32; i++) {
            z |= (long)((x & (1 << i)) << i) | ((y & (1 << i)) << (i + 1));
        }
        return z;
    }

    // Function to compute the Z-order curve values for a list of points
    public static long[] zOrderCurve(int[][] points) {
        long[] zValues = new long[points.length];
        for (int i = 0; i < points.length; i++) {
            int x = points[i][0];
            int y = points[i][1];
            zValues[i] = interleaveBits(x, y);
        }
        return zValues;
    }

    public static void main(String[] args) {
        int[][] points = { {1, 2}, {3, 4}, {5, 6} };
        long[] zValues = zOrderCurve(points);

        System.out.println("Z-order values:");
        for (long z : zValues) {
            System.out.println(z);
        }
    }
}

Figure 18: 2-D Z-Order Curve Space

From the above Z-order keys, we see that points that are close to each other in the original space have close Z-order keys. For instance, points sharing the prefix 000 in their Z-order keys are close in 2D space, while points with the prefix 110 indicate greater distance.

Figure 19: 2-D Z-Order Curve Space and a Query Region

Now that we know how to calculate the z-order keys, we can use the z-order keys to define a range of values to read (reange-query), to do so, we have to find the lower and upper counds. For example: The query rectangle: 2 ≤ X ≤ 3 to 4 ≤ Y ≤ 5, the lower bound is Z-Order(X = 2, Y = 4) = 100100 and upper bound is (X = 3, Y = 5) = 100111, translates to Z-order values of 36 and 39.

Figure 20: 2-D Z-Order Curve Space and a Query Region (The Problem)

However, range queries based on Z-Order keys are not always present in a continuous Z path. For example: The query rectangle 1 ≤ X ≤ 3 to 3 ≤ Y ≤ 4, the lower bound Z-Order(X = 1, Y = 3) = 001011 and upper bound is (X = 3, Y = 4) = 100101, translates to Z-order values of 11 and 37 - optimized using subranges.

The Z-order curve weakly preserves latitude-longitude proximity, i.e. two locations that are close in physical distance are not guaranteed to be close following the Z-curve

2.6. Hilbert Curve - Implementation

From Section 2.2, wkt: The Hilbert curve implementation converts 2D coordinates to a single scalar value that preserves spatial locality by recursively rotating and transforming the coordinate space.

In the code snippet: The xyToHilbert function computes this scalar value using bitwise operations, while the hilbertToXy function reverses this process. This method ensures that points close in 2D space remain close in the 1D Hilbert curve index, making it useful for spatial indexing.

2.6a. Hilbert Curve - Snippet

public class HilbertCurve {
    // Rotate/flip a quadrant appropriately
    private static void rot(int n, int[] x, int[] y, int rx, int ry) {
        if (ry == 0) {
            if (rx == 1) {
                x[0] = n - 1 - x[0];
                y[0] = n - 1 - y[0];
            }
            // Swap x and y
            int temp = x[0];
            x[0] = y[0];
            y[0] = temp;
        }
    }

    // Convert (x, y) to Hilbert curve distance
    public static int xyToHilbert(int n, int x, int y) {
        int d = 0;
        int[] ix = { x };
        int[] iy = { y };

        for (int s = n / 2; s > 0; s /= 2) {
            int rx = (ix[0] & s) > 0 ? 1 : 0;
            int ry = (iy[0] & s) > 0 ? 1 : 0;
            d += s * s * ((3 * rx) ^ ry);
            rot(s, ix, iy, rx, ry);
        }

        return d;
    }

    // Convert Hilbert curve distance to (x, y)
    public static void hilbertToXy(int n, int d, int[] x, int[] y) {
        int rx, ry, t = d;
        x[0] = y[0] = 0;
        for (int s = 1; s < n; s *= 2) {
            rx = (t / 2) % 2;
            ry = (t ^ rx) % 2;
            rot(s, x, y, rx, ry);
            x[0] += s * rx;
            y[0] += s * ry;
            t /= 4;
        }
    }

    public static void main(String[] args) {
        int n = 16; // size of the grid (must be a power of 2)
        int x = 5;
        int y = 10;
        int d = xyToHilbert(n, x, y);
        System.out.println("The Hilbert curve distance for (" + x + ", " + y + ") is: " + d);

        int[] point = new int[2];
        hilbertToXy(n, d, point, point);
        System.out.println("The coordinates for Hilbert curve distance " + d + " are: (" + point[0] + ", " + point[1] + ")");
    }
}

2.7. Z-Order Curve and Hilbert Curve - Conclusion

Usage: Insert data points and their Z-order keys/Hilbert Keys (let's call it Z and H keys) into a one-dimensional hierarchical index structure, such as a B-Tree or Quad-Tree. For range or nearest neighbor queries, convert the search criteria into Z/H keys or range of keys. After retrieval, further filter the results as necessary to remove any garbage values.

To conclude: Space-Filling Curves such as Z-Order/Hilbert indexing is a powerful technique to query higher-dimensional data, especially as the data volumes grows. By combining bits from multiple dimensions into a single value, space-Filling Curves indexing preserves spatial locality, enabling efficient data indexing and retrieval.

However, as seen in Section 2.5, large jumps along the Z-Order curve can affect certain types of queries (better with Hilbert curves Section 2.2). The success of Z-Order indexing relies on the data's distribution and cardinality. Therefore, it is essential to evaluate the nature of the data, query patterns, performance needs and limitation(s) of indexing strategies.

3. References

1. "Programming the Hilbert curve" (American Institue of Physics (AIP) Conf. Proc. 707, 381 (2004)).
2. Wikipedia. “Z-order curve,” [Online]. Available: https://en.wikipedia.org/wiki/Z-order_curve.
3. Amazon Web Services, “Z-order indexing for multifaceted queries in Amazon DynamoDB – Part 1,” [Online]. Available: https://aws.amazon.com/blogs/database/z-order-indexing-for-multifaceted-queries-in-amazon-dynamodb-part-1/. [Accessed: 10-Jun-2024].
4. N. Chandra, “Z-order indexing for efficient queries in Data Lake,” Medium, 20-Sep-2021. [Online]. Available: https://medium.com/@nishant.chandra/. [Accessed: 10-Jun-2024]z-order-indexing-for-efficient-queries-in-data-lake-48eceaeb2320. [Accessed: 10-Jun-2024].
5. YouTube, “Z-order indexing for efficient queries in Data Lake,” [Online]. Available: https://www.youtube.com/watch?v=YLVkITvF6KU. [Accessed: 10-Jun-2024].

Real-time insights: Telemetry Pipeline

2024-06-07T00:00:00+00:00

0. Overview

0.1. Architecture

A telemetry pipeline is a system that collects, ingests, processes, stores, and analyzes telemetry data (metrics, logs, traces) from various sources in real-time or near real-time to provide insights into the performance and health of applications and infrastructure.

Figure 0: Barebone Telemetry Pipeline Architecture

It typically involves tools like Telegraf for data collection, Kafka for ingestion, Flink for processing, and Cassandra and VictoriaMetrics for storage and analysis.

Figure 1: Detailed Telemetry Pipeline Architecture

0.2. Stages

Collection: Telemetry data is collected from various sources using agents like Telegraf and Fluentd.
Ingestion: Data is ingested through message brokers such as Apache Kafka or Kinesis to handle high throughput.
Processing: Real-time processing is done using stream processing frameworks like Apache Flink for filtering, aggregating, and enriching data.
Storage and Analysis: Processed data is stored in systems like Cassandra, ClickHouse and Elasticsearch, and analyzed using tools like Grafana and Kibana for visualization and alerting.

1. Collection

1.1. Collection Agent

To start, we'll use Telegraf, a versatile open-source agent that collects metrics from various sources and writes them to different outputs. Telegraf supports a wide range of input and output plugins, making it easy to gather data from sensors, servers, GPS systems, and more.

Figure 2: Telegraf for collecting metrics & data

For this example, we'll focus on collecting the CPU temperature and Fan speed from a macOS system using the exec plugin in Telegraf. And leverage the osx-cpu-temp command line tool to fetch the CPU temperature.

🌵 Inlets allows devices behind firewalls or NAT to securely expose local services to the public internet by tunneling traffic through a public-facing Inlets server

1.2. Dependencies

Using Homebrew: brew install telegraf
For other OS, refer: docs.influxdata.com/telegraf/v1/install.
Optionally, download the latest telegraf release from: https://www.influxdata.com/downloads
Using Homebrew: brew install osx-cpu-temp
Refer: github.com/lavoiesl/osx-cpu-temp

1.3. Events

Here's a custom script to get the CPU and Fan Speed:

#!/bin/bash
timestamp=$(date +%s)000000000
hostname=$(hostname | tr "[:upper:]" "[:lower:]")
cpu=$(osx-cpu-temp -c | sed -e 's/\([0-9.]*\).*/\1/')
fans=$(osx-cpu-temp -f | grep '^Fan' | sed -e 's/^Fan \([0-9]\) - \([a-zA-Z]*\) side *at \([0-9]*\) RPM (\([0-9]*\)%).*/\1,\2,\3,\4/')
echo "cpu_temp,device_id=$hostname temp=$cpu $timestamp"
for f in $fans; do
  side=$(echo "$f" | cut -d, -f2 | tr "[:upper:]" "[:lower:]")
  rpm=$(echo "$f" | cut -d, -f3)
  pct=$(echo "$f" | cut -d, -f4)
  echo "fan_speed,device_id=$hostname,side=$side rpm=$rpm,percent=$pct $timestamp"
done

Output Format: measurement,host=foo,tag=measure val1=5,val2=3234.34 1609459200000000000

The output is of Line protocol syntax
Where measurement is the “table” (“measurement" in InfluxDB terms) to which the metrics are written.
host=foo,tag=measure are tags to can group and filter by.
val1=5,val2=3234.34 are values, to display in graphs.
1716425990000000000 is the current unix timestamp + 9 x "0" — representing nanosecond timestamp.

Sample Output: cpu_temp,device_id=adeshs-mbp temp=0.0 1716425990000000000

1.4. Configuration

The location of telegraf.conf installed using homebrew: /opt/homebrew/etc/telegraf.conf

Telegraf's configuration file is written using TOML and is composed of three sections: global tags, agent settings, and plugins (inputs, outputs, processors, and aggregators).

Once Telegraf collects the data, we need to transmit it to a designated endpoint for further processing. For this, we'll use the HTTP output plugin in Telegraf to send the data in JSON format to a Flask application (covered in the next section).

Below is what the telegraf.conf file looks like, with exec input plugin (format: influx) and HTTP output plugin (format: JSON).

[agent]
  interval = "10s"
  round_interval = true
  metric_buffer_limit = 10000
  flush_buffer_when_full = true
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = "/path to telegraf log/telegraf.log"
  hostname = "host"
  omit_hostname = false

[[inputs.exec]]
  commands = ["/path to custom script/osx_metrics.sh"]
  timeout = "5s"
  name_suffix = "_custom"
  data_format = "influx"
  interval = "10s"

[[outputs.http]]
  url = "http://127.0.0.1:5000/metrics"
  method = "POST"
  timeout = "5s"
  data_format = "json"
  [outputs.http.headers]
    Content-Type = "application/json"

Edit telegraf.conf (use above config):
vi /opt/homebrew/etc/telegraf.conf

🚧: Don't forget to expore tons of other input and output plugins: docs.influxdata.com/telegraf/v1/plugins

1.5. Start Capture

Run telegraf (when installed from Homebrew):
/opt/homebrew/opt/telegraf/bin/telegraf -config /opt/homebrew/etc/telegraf.conf

2. Ingestion

2.1. Telemetry Server

The telemetry server layer is designed to be lightweight. Its primary function is to authenticate incoming requests and publish raw events directly to Message Broker/Kafka. Further processing of these events will be carried out by the stream processing framework.

For our example, the Flask application serves as the telemetry server, acting as the entry point (via load-balancer) for the requests. It receives the data from a POST request, validates it, and publishes the messages to a Kafka topic.

Topic partition is the unit of parallelism in Kafka. Choose a partition key (ex: client_id) that evenly distributes records to avoid hotspots and number of partitions to achieve good throughput.

🚧 Message Broker Alternatives: Amazon Kinesis, Redpanda

2.2. Dependencies

Using PIP: pip3 install Flask flask-cors kafka-python

For Local Kafka Set-up

Using Homebrew: brew install kafka
Refer: Homebrew Kafka

Start Zookeeper: zookeeper-server-start /opt/homebrew/etc/kafka/zookeeper.properties
Start Kafka: brew services restart kafka

Create Topic: kafka-topics --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic learn
Usage: Kafka CLI

2.3. Docker Compose

To set up Kafka using Docker Compose, ensure Docker is installed on your machine by following the instructions on the Docker installation page. Once Docker is installed, create a docker-compose.yml for Kafka and Zookeeper:

version: '3.7'

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.3.5
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
    ports:
      - "2181:2181"

  kafka:
    image: confluentinc/cp-kafka:7.3.5
    ports:
      - "9092:9092"  # Internal port
      - "9094:9094"  # External port
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INTERNAL:PLAINTEXT,OUTSIDE:PLAINTEXT
      KAFKA_ADVERTISED_LISTENERS: INTERNAL://kafka:9092,OUTSIDE://localhost:9094
      KAFKA_LISTENERS: INTERNAL://0.0.0.0:9092,OUTSIDE://0.0.0.0:9094
      KAFKA_INTER_BROKER_LISTENER_NAME: INTERNAL
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      CONFLUENT_SUPPORT_METRICS_ENABLE: "false"
    depends_on:
      - zookeeper

  kafka-topics-creator:
    image: confluentinc/cp-kafka:7.3.5
    depends_on:
      - kafka
    entrypoint: ["/bin/sh", "-c"]
    command: |
      "
      # blocks until kafka is reachable
      kafka-topics --bootstrap-server kafka:9092 --list

      echo -e 'Creating kafka topics'
      kafka-topics --bootstrap-server kafka:9092 --create --if-not-exists --topic raw-events --replication-factor 1 --partitions 1

      echo -e 'Successfully created the following topics:'
      kafka-topics --bootstrap-server kafka:9092 --list
      "

  schema-registry:
    image: confluentinc/cp-schema-registry:7.3.5
    environment:
      - SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL=zookeeper:2181
      - SCHEMA_REGISTRY_HOST_NAME=schema-registry
      - SCHEMA_REGISTRY_LISTENERS=http://schema-registry:8085,http://localhost:8085
    ports:
      - 8085:8085
    depends_on: [zookeeper, kafka]

Run docker-compose up to start the services (Kafka + Zookeeper).

2.4. Start Server

The Flask application includes a /metrics endpoint, as configured in telegraf.conf output to collect metrics. When data is sent to this endpoint, the Flask app receives and publishes the message to Kafka.

New to Flask? Refer: Flask Quickstart

import os
from flask_cors import CORS
from flask import Flask, jsonify, request
from dotenv import load_dotenv
from kafka import KafkaProducer
import json


app = Flask(__name__)
cors = CORS(app)
load_dotenv()

producer = KafkaProducer(bootstrap_servers='localhost:9094', 
                         value_serializer=lambda v: json.dumps(v).encode('utf-8'))

@app.route('/metrics', methods=['POST'])
def process_metrics():
    data = request.get_json()
    print(data)
    producer.send('raw-events', data)
    return jsonify({'status': 'success'}), 200


if __name__ == "__main__":
    app.run(debug=True, host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))

Start all services 🚀:

Run Flask App (Telemetry Server):
flask run
Ensure telegraf is running (Refer: Section 1.5)

3. Processing

3.1. Stream Processor

The Stream Processor is responsible for data transformation, enrichment, stateful computations/updates over unbounded (push-model) and bounded (pull-model) data streams and sink enriched and transformed data to various data stores or applications. Key Features to Look for in a Stream Processing Framework:

Scalability and Performance: Scale by adding nodes, efficiently use resources, process data with minimal delay, and handle large volumes
Fault Tolerance and Data Consistency: Ensure fault tolerance with state saving for failure recovery and exactly-once processing.
Ease of Use and Community Support: Provide user-friendly APIs in multiple languages, comprehensive documentation, and active community support.

Figure 3: Stateful Stream Processing

Integration and Compatibility: Seamlessly integrate with various data sources and sinks, and be compatible with other tools in your tech stack.
Windowing and Event Time Processing: Support various windowing strategies (tumbling, sliding, session) and manage late-arriving data based on event timestamps.
Security and Monitoring: Include security features like data encryption and robust access controls, and provide tools for monitoring performance and logging.

Although I have set the context to use Flink for this example;
☢️ Note: While Apache Flink is a powerful choice for stream processing due to its rich feature set, scalability, and advanced capabilities, it can be overkill for a lot of use cases, particularly those with simpler requirements and/or lower data volumes.

🚧 Open Source Alternatives: Apache Kafka Streams, Apache Storm, Apache Samza

3.2. Dependencies

Install PyFlink Using PIP: pip3 install apache-flink==1.18.1
Usage examples: flink-python/pyflink/examples

For Local Flink Set-up:

Download Flink and extract the archive: www.apache.org/dyn/closer.lua/flink/flink-1.18.1/flink-1.18.1-bin-scala_2.12.tgz
☢️ At the time of writing this post Flink 1.18.1 is the latest stable version that supports kafka connector plugin.
Download Kafka Connector and extract the archive: www.apache.org/dyn/closer.lua/flink/flink-connector-kafka-3.1.0/flink-connector-kafka-3.1.0-src.tgz
Copy/Move the flink-connector-kafka-3.1.0-1.18.jar to flink-1.18.1/lib ($FLINK_HOME/lib)
Ensure Flink Path is set export FLINK_HOME=/full-path/flink-1.18.1 (add to .bashrc/.zshrc)
Start Flink Cluster: cd flink-1.18.1 && ./bin/start-cluster.sh
Flink dashboard at: localhost:8081
To Stop Flink Cluster: ./bin/stop-cluster.sh

3.3. Docker Compose

Create flink_init/Dockerfile file for Flink and Kafka Connector:

FROM flink:1.18.1-scala_2.12

RUN wget -P /opt/flink/lib https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-kafka/3.1.0-1.18/flink-connector-kafka-3.1.0-1.18.jar

RUN chown -R flink:flink /opt/flink/lib

Add Flink to docker-compose.yml (in-addition to Kafka, from Section 2.3)

version: '3.8'
services:
  jobmanager:
    build: flink_init/.
    ports:
      - "8081:8081"
    command: jobmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager

  taskmanager:
    build: flink_init/.
    depends_on:
      - jobmanager
    command: taskmanager
    environment:
      - |
        FLINK_PROPERTIES=
        jobmanager.rpc.address: jobmanager
        taskmanager.numberOfTaskSlots: 2

Run docker-compose up to start the services (Kafka + Zookeeper, Flink).

3.4. Start Cluster

⚠️ PyFlink Job:

Start all services 🚀:

Ensure all the services are running (Refer: Section 1.5, 2.4, 3.3)

4. Storage and Analysis

The code snippets - stops here! The rest of the post covers key conventions, strategies, and factors for selecting the right data store, performing real-time analytics, and alerts.

4.1. Datastore

When choosing the right database for telemetry data, it's crucial to consider several factors:

Read and Write Patterns: Understanding the frequency and volume of read and write operations is key. High write and read throughput require different database optimizations and consistencies.
Data Amplification: Be mindful of how the data volume might grow over time (+Write Amplification) and how the database handles this increase without significant performance degradation.
Cost: Evaluate the cost implications, including storage, processing, and any associated services.
Analytics Use Cases: Determine whether the primary need is for real-time analytics, historical data analysis, or both.
Transactions: Consider the nature and complexity of transactions that will be performed. For example: Batch write transactions
Read and Write Consistency: Decide on the level of consistency required for the application. For example, OLTP (Online Transaction Processing) systems prioritize consistency and transaction integrity, while OLAP (Online Analytical Processing) systems are optimized for complex queries and read-heavy workloads.

🌵 LSM-Tree favors write-intensive applications.

For example, to decide between Row-based vs Columar Storage. Or OLTP (Online Transaction Processing), OLAP (Online Analytical Processing), or a Hybrid approach:

Figure 4: Row vs Columnnar Storage

Transactional and High Throughput Needs: For high write throughput and transactional batches (all or nothing), with queries needing wide-column family fetches and indexed queries within the partition, Cassandra/ScyllaDB is better suited.
Complex Analytical Queries: For more complex analytical queries, aggregations on specific columns, and machine learning models, data store(s) such as ClickHouse or Druid is more appropriate. Its optimized columnar storage and powerful query capabilities make it ideal for handling large-scale analytical tasks. Several others include: VictoriaMetrics and InfluxDB (emphasis on time-series); closed-source: Snowflake, BigQuery and Redshift
Hybrid Approach: In scenarios requiring both fast write-heavy transactional processing and complex analytics, a common approach is to use Cassandra for real-time data ingestion and storage, and periodically perform ETL (Extract, Transform, Load) or CDC (Change Data Capture) processes to batch insert data into OLAP DB for analytical processing. This leverages the strengths of both databases, ensuring efficient data handling and comprehensive analytical capabilities. Proper indexing and data modeling goes unsaid 🧐

🌵 Debezium: Distributed platform for change data capture (more on previous post).

Using a HTAP (Hybrid Transactional/Analytical Processing) database that's suitable for both transactional and analytical workloads is worth considering. Example: TiDB, TimescaleDB (Kind of).

While you get some of the best from both worlds 🌎, you also inherit a few of the worst from each!
Lucky for you, I have first hand experience with it 🤭:

Figure 5: Detailed comparison of OLTP, OLAP and HTAP

Analogy: Choosing the right database is like picking the perfect ride. Need pay-as-you-go flexibility? Grab a taxi. Tackling heavy-duty tasks? 🚜 Bring in the bulldozer. For everyday use, 🚗 a Toyota fits. Bringing a war tank to a community center is overkill. Sometimes, you need a fleet—a car for daily use, and a truck for heavy loads.

☢️ InfluxDB: Stagnant contribution graph, Flux deprecation, but new benchmarks!

4.2. Partition and Indexes

Without getting into too much detail, it's crucial to choose the right partitioning strategy (Ex: Range, List, Hash) to ensure partitions don't bloat and effectively support primary read patterns (in this context, example: client_id + region + 1st Day of Month).

Figure 6: Types of Indexes and Materialized view

Following this, clustering columns and indexes help organize data within partitions to optimize range queries and sorting. Secondary indexes (within the partition/local or across partitions/global) are valuable for query patterns where partition or primary keys don't apply. Materialized views for precomputing and storing complex query results, speeding up read operations for frequently accessed data.

Figure 7: Partition Key, Clustering Keys, Local/Global Secondary Indexes and Materialized views

Multi-dimensional Index (Spatial/Spatio-temporal): Indexes such as B+ trees and LSM trees are not designed to directly store higher-dimensional data. Spatial indexing uses structures like R-trees and Quad-trees and techniques like geohash. Space-filling curves like Z-order (Morton) and Hilbert curves interleave spatial and temporal dimensions, preserving locality and enabling efficient queries.

Figure 8: Commonly Used: Types of Spatial Indexes

🌵 GeoMesa: spatio-temporal indexing on top of the Accumulo, HBase, Redis, Kafka, PostGIS and Cassandra. XZ-Ordering: Customizing Index Creation.

Next blog post is all about spatial indexes!

4.3. Analytics and Alerts

Typically, analytics are performed as batch queries on bounded datasets of recorded events, requiring reruns to incorporate new data.

Figure 9: Analytics on Static, Relative and In-Motion Data

In contrast, streaming queries ingest real-time event streams, continuously updating results as events are consumed, with outputs either written to an external database or maintained as internal state.

Figure 10: Batch Analytics vs Stream Analytics

Feature	Batch Analytics	Stream Analytics
Data Processing	Processes large volumes of stored data	Processes data in real-time as it arrives
Result Latency	Produces results with some delay; near real-time results with frequent query runs	Provides immediate insights and actions
Resource Efficiency	Requires querying the database often for necessary data	Continuously updates results in transient data stores without re-querying the database
Typical Use	Ideal for historical analysis and periodic reporting	Best for real-time monitoring, alerting, and dynamic applications
Complexity Handling	Can handle complex queries and computations	Less effective for highly complex queries
Backfill	Easy to backfill historical data and re-run queries	Backfill can potentially introduce complexity

🌵 Anomaly Detection and Remediation

🌵 MindsDB: Connect Data Source, Configure AI Engine, Create AI Tables, Query for predictions and Automate workflows.

5. References

1. Wikipedia, "Telemetry," available: https://en.wikipedia.org/wiki/Telemetry. [Accessed: June 5, 2024].
2. Apache Cassandra, "Cassandra," available: https://cassandra.apache.org. [Accessed: June 5, 2024].
3. VictoriaMetrics, "VictoriaMetrics," available: https://victoriametrics.com. [Accessed: June 6, 2024].
4. Fluentd, "Fluentd," available: https://www.fluentd.org. [Accessed: June 5, 2024].
5. Elasticsearch, "Elasticsearch," available: https://www.elastic.co. [Accessed: June 5, 2024].
6. InfluxData, "Telegraf," available: https://www.influxdata.com. [Accessed: June 5, 2024].
7. InfluxData, "Telegraf Plugins," available: https://docs.influxdata.com. [Accessed: June 5, 2024].
8. GitHub, "osx-cpu-temp," available: https://github.com/lavoiesl/osx-cpu-temp. [Accessed: June 5, 2024].
9. GitHub, "Inlets," available: https://github.com/inlets/inlets. [Accessed: June 5, 2024].
10. InfluxData, "Telegraf Installation," available: https://docs.influxdata.com/telegraf/v1. [Accessed: June 5, 2024].
11. InfluxData, "InfluxDB Line Protocol," available: https://docs.influxdata.com/influxdb/v1.8/write_protocols/line_protocol. [Accessed: June 5, 2024].
12. GitHub, "Telegraf Exec Plugin," available: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec. [Accessed: June 5, 2024].
13. GitHub, "Telegraf Output Plugins," available: https://github.com/influxdata/telegraf/tree/master/plugins/outputs. [Accessed: June 5, 2024].
14. Pallets Projects, "Flask," available: https://flask.palletsprojects.com. [Accessed: June 5, 2024].
15. Apache Kafka, "Kafka," available: https://kafka.apache.org. [Accessed: June 5, 2024].
16. Confluent, "Kafka Partitions," available: https://www.confluent.io. [Accessed: June 5, 2024].
17. AWS, "Amazon Kinesis," available: https://aws.amazon.com/kinesis. [Accessed: June 5, 2024].
18. Redpanda, "Redpanda," available: https://redpanda.com. [Accessed: June 5, 2024].
19. Apache, "Apache Flink," available: https://flink.apache.org. [Accessed: June 6, 2024].
20. GitHub, "flink-python/pyflink/examples," available: https://github.com/apache/flink/tree/master/flink-python/pyflink/examples. [Accessed: June 6, 2024].
21. Apache, "Flink Download," available: https://www.apache.org/dyn/closer.lua/flink. [Accessed: June 6, 2024].
22. Apache, "Flink Kafka Connector," available: https://www.apache.org/dyn/closer.lua/flink/flink-connector-kafka-3.1.0. [Accessed: June 6, 2024].
23. Docker, "Docker Installation," available: https://docs.docker.com. [Accessed: June 6, 2024].
24. Apache Kafka, "Kafka CLI," available: https://kafka.apache.org/quickstart. [Accessed: June 6, 2024].
25. Homebrew, "Kafka Installation," available: https://formulae.brew.sh/formula/kafka. [Accessed: June 6, 2024].
26. Apache, "Apache Storm," available: https://storm.apache.org. [Accessed: June 6, 2024].
27. Apache, "Apache Samza," available: https://samza.apache.org. [Accessed: June 6, 2024].
28. ClickHouse, "ClickHouse," available: https://clickhouse.com. [Accessed: June 6, 2024].
29. InfluxData, "InfluxDB Benchmarks," available: https://www.influxdata.com/benchmarks. [Accessed: June 6, 2024].
30. TiDB, "TiDB," available: https://github.com/pingcap/tidb. [Accessed: June 6, 2024].
31. Timescale, "TimescaleDB," available: https://www.timescale.com. [Accessed: June 6, 2024].
32. MindsDB, "MindsDB," available: https://docs.mindsdb.com. [Accessed: June 6, 2024].
33. Wikipedia, "Write Amplification," available: https://en.wikipedia.org/wiki/Write_amplification. [Accessed: June 6, 2024].
34. GitHub, "LSM-Tree," available: https://tikv.github.io/deep-dive/introduction/theory/lsm-tree.html. [Accessed: June 6, 2024].