Creating custom indexes to data with Swift Sets

Creating custom indexes to data with Swift Sets

Introduction & Background

I have been working on a flow visualization tool for a while now. It is part of a project where a datacenter has evolved over a long time and has grown into a complex myriad of connections, applications and “solutions” to make things work. Goal of the project is to visualize certain environments and how they communicate with other services. Surely we’ve looked at Tetration as a Service and other visualization tools. The business case wasn’t viable enough for commercial solutions, so decided to go with Elastiflow as a basis.

Personal note: It’s been way overdue for a blog post. Even though Tom (@networkingnerd) mentioned that many bloggers have found it hard to find the time and drive to write blog posts during COVID19, WFH and everything that comes with it, I have been frustrated over myself I haven’t found the drive to write for way too long. 

It works well for forensics but for the original purpose (visualization, research & analysis) it was lacking a few bits. It was thus decided that a custom-made application that takes the samples from ElastiFlow and work from there was the most viable road. And so I started writing a Mac Application, using Core Data and GraphViz to fetch samples, process them (dedup), and were able to visualize flows. 

Fast forward to just before summer break; the flows in the database have grown to millions of records, which led to huge delays (300s) for the visualization of the flows on a larger dataset of server IPs. The reason for that is twofold, database queries are always more expensive than in-memory searches and the way I organized the data (in-memory) was inefficient. I’ve solved the latter by optimizing the code structure and algorithm and reduced the visualization time from 300s to 25s per visualization iteration. But the first cause of delays (roundtrips to the database) isn’t solved with that optimization and it bit me in ingesting new samples (as that also checks the database for duplicate samples). The solution to that is to use an in-memory database (this is quite common; Cisco ISE has an in-memory database (I think Cisco Prime Infrastructure does too), and big data platforms like SAP/HANA). After a quick research for in-memory Swift based databases, I decided to build my own Swift based memory manager for just the flows.

Sets in Swift, Hashable, and my idea

Swift has a number of ways to create collections of data elements (classes or structures). The most common approach is an array, which is an ordered collection. It is very easy to use an array, as the example below shows.

Swift Arrays are not fixed in size, and it’s easy to add or remove elements to that array. Ideal for data manipulation and search. But.. the search is linear and takes, on average, half of the iterations to find the entry, but worst case it will take n-iterations (where n is the size of the array) to find the record you’re looking for. 

So to optimize the search speed, we need to reduce the number of iterations needed to find a specific flow. This is where the Set class comes into play. The Swift Set class is used to store unique data elements. The uniqueness is determined by the hash of a unique identifier (the key). Hashing is a mathematical method that creates a unique consistent value for a piece of data. It’s a common principle in not only encryption (integrity validation), but also in database search optimizations.

Instead of comparing all attributes in a record (let’s say a flow), the Set class checks the unique identifier if the element is inside the collection:

Using the Set swift class allowed me to do that optimization from 300seconds to 25seconds for visualization. So why not use the Set Class to store the flows. So I can definitely use that in the in-memory database of flows. In my application I use four categories of queries / searches on those flows:

  1. Full flows: Search for all attributes of a flow (protocol, source IP and Port, destination IP and Port) 
  2. Similar flows: Search for all flows for the same protocol, same source IP and to a specific destination IP and Port
  3. IP conversations: Search for flows only based on source and destination IP
  4. Server applications: Search for all flows matching a specifc destination IP and Port

Effectively these four types of searches are similar to four indices in a database, and I should be able to implement them inside Swift. And this is where the power of class inheritance and protocols come into play inside the Swift language. The Set class requires that the element inside the collection conforms to the Hashable and Equatable protocol (Hashable inherits from Equatable). It means that every element you want to store in a Set needs to minimally have the following functions:

func hash(into: inout Hasher)
static func  == (Self,Self) -> Bool

All common types in Swift (String, Int, Date) have a default implementation of these methods, so you can compare types ( == keyword is actually a function ) and use the Set class in your code. The default hash and == function implementations take all attributes and use them for comparison. 

But it is possible to override that behavior by specifically implementing the Hashable protocol and write your own implementation (and thus specify which properties are used for uniqueness)

The problem

So I wrote the following code to create the indices:

// Base class for storing network flows in memory
class FMMFlowEntry : Hashable, CustomStringConvertible {
    var proto : String = ""
    var srcIP : String = ""
    var srcPort : Int = 0
    var dstIP : String = ""
    var dstPort : Int = 0
    var firstSeen : Date?
    var lastSeen : Date?
    
    static func == (lhs: FMMFlowEntry, rhs: FMMFlowEntry) -> Bool {
        let response = dstIP == rhs.dstIP && proto == proto && dstPort == dstPort && srcIP == srcIP && srcPort == rhs.srcPort
        return response
    }
        
    public func hash(into hasher: inout Hasher) {
        hasher.combine(proto)
        hasher.combine(srcIP)
        hasher.combine(srcPort )
        hasher.combine(dstIP)
        hasher.combine(dstPort)
    }
    
    var description : String {
        let response : String = "\(proto) \(srcIP):\(srcPort) -> \(dstIP):\(dstPort)"
        return response
    }
    
    init(proto: String, srcIP: String, srcPort: Int, dstIP : String, dstPort: Int) {
        self.proto = proto
        self.srcIP = srcIP
        self.srcPort = srcPort
        self.dstIP = dstIP
        self.dstPort = dstPort
    }
    
    
    init() {
        
    }
}

// Class that represents a flow record based on source IP address and destination IP address
class FMMIPFlow  : FMMFlowEntry  {
    var refCount : Int = 1
    init(from: FMMFlowEntry) {
        super.init(proto : from.proto, srcIP : from.srcIP, srcPort : from.srcPort, dstIP : from.dstIP, dstPort : from.dstPort)

    }

    override public func hash(into hasher: inout Hasher) {
        hasher.combine(srcIP)
        hasher.combine(dstIP)
    }
    
    static func == (lhs: FMMIPFlow, rhs: FMMIPFlow) -> Bool {
        let response = lhs.dstIP == rhs.dstIP && lhs.srcIP == rhs.srcIP
        return response
    }
    
    override var description : String {
        let response : String = "\(srcIP) -> \(dstIP) [\(refCount)]"
        return response
    }
}

In the above code there is the base class (FMMFlowEntry) that has all necessary attributes. The hash function specifies which properties of the flow are used to determine if it is unique). The == function also uses those attributes to check for equality.

The class FMMIPFlow inherits every method and property from FMMFlowEntry. It then overrides the hash function that only the srcIP and dstIP attributes are used for hashing and comparison. 

This should mean that if I’d compile and run the code below, the variable uniqueIPFlows would only have one entry (as the srcIP and dstIP are the same for the two added flows). When I run it in Playground, this is the result.

That means that both flow1 and flow4 are added to the Set. I was not expecting that, I created a custom hash function and so it should not add it, right? For debugging I ran it again, printing the hash value too:

So even while they have the same hash value (which changes every time I run the code), both flows are still added to the index. That is not the purpose, but why?

Cause of the problem

The answer to that took me quite a while and would like to share it here. The problem lies in the comparison function. It is defined as a static function (conform to the protocol). The static keyword also means final and cannot be overridden. So even if I have a new static == function in the subclass, it will never be called because static implies final (and the compiler will just ignore that other method). We can validate that by adding a print in our code:

static func == (lhs: FMMFlowEntry, rhs: FMMFlowEntry) -> Bool {
    print("FMMFlowEntry == func called")
    let response = lhs.dstIP == rhs.dstIP && lhs.proto == rhs.proto &&
         lhs.dstPort == rhs.dstPort && lhs.srcIP == rhs.srcIP &&
         lhs.srcPort == rhs.srcPort
    return response
}

And rerun the code

So the hash is used for searching, but to determine if an element needs to be added to the Set, there is an equality check. And because the two flows are different (on source port), both are added. 

The solution

Now that we know the cause, we can create a solution to work around that blockade. Instead of performing the comparison check inside the static function, I can create an instance function that is called on the lhs parameter and validates against the rhs, so that I am able to override that function in a subclass. The code in FMMFlowEntry is changed:

static func == (lhs: FMMFlowEntry, rhs: FMMFlowEntry) -> Bool {
    return lhs.equals(to: rhs)
}
    
func equals(to rhs: FMMFlowEntry) -> Bool {
    let response = dstIP == rhs.dstIP && proto == rhs.proto && 
        dstPort == rhs.dstPort && srcIP == rhs.srcIP && 
        srcPort == rhs.srcPort
    return response
}

and the code inside the FMMIPFlow Class is changed to:

override func equals(to rhs: FMMFlowEntry) -> Bool {
    guard let rhs = rhs as? FMMIPFlow else { return false }
    let response = self.dstIP == rhs.dstIP && self.srcIP == rhs.srcIP
    return response
}

Now when we try to run the following code, the static compares in the base class should call the overriden equals function of the instance (FMMIPFlow in this case) and will only check for the srcIP and dstIP property, which should lead to only one entry in the Set:

Succes! Now I can create indexes / search types very quickly by just subclassing the base record and override the equals function for that specific index. Which allows me to to code the indexes I need with only a few lines of code and still be able to work with the Set class and have speed in search queries. One step closer to my in-memory database that is capable of storing millions of flows.

 

The key lesson here is that the keyword static in Swift means that it is a final method and cannot be overridden in a subclass. The method around it is to call an Instance method in the base class and override this in the subclass.

The Swift Playground used in this post is available on github

Creating custom indexes to data with Swift Sets

Swift & Network programmability, a good combo? An introduction.

Swift is commonly known by iOS and MacOSX Software developers as Apple introduced the language in 2014 for MacOSX, iOS and Linux application development.
In my role as software engineer I’ve used different programming languages to build small tools, solutions or prototypes. For network programmability I’ve used Java as my primary language. I have my reasons, which I might share later in another post.

Network Programmability on the web pretty much evolves around Python. Is Swift mature enough and powerfull enough to be used for programming the network? Time to write up my experiences in a blog series. The first post is an introduction to Swift. (more…)