We briefly introduced IP Similarity previously, but now we want to dive deep and show how we made this idea a reality. 

The Goal

The first goal of IP Similarity is to encode a GreyNoise record as a numerical feature vector. This is just an array of numbers that somehow represent all of the data we have in a GreyNoise record.

Figure 1: Record to  Feature Vector

This representation is extremely useful for machine learning and any numerical analysis. From this point we can quantitatively measure how far away two records are, cluster groups of records together, and build all sorts of classifiers. This is the ground floor basis for applying machine learning to GreyNoise data.

The Nitty Gritty

But, getting there is hard. Our records contain a vast amount of unstructured and semi-structured textual data. User-Agents can be nearly anything you want, from ​

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)

to ​

Anarchy99

Web paths can be as simple as ​

/

or complicated like ​

/${(#a=@org.apache.commons.io.IOUtils@toString(@java.lang.Runtime@getRuntime().exec("whoami").getInputStream(),"utf-8")).(@com.opensymphony.webwork.ServletActionContext@getResponse().setHeader("X-Cmd-Response",#a))}/

Ports can be any or all of the 65,535 available values. The list goes on.

In order to turn this complex multi-modal data into a fixed size numerical feature vector we employ a few tricks, primarily: tokenization and “the hashing trick”.

Books could be (and have been) written on tokenization, but for our purposes we can implement a simple regex.

tokens = re.sub(r'[^\w\s]', ' ', text)

This matches everything but  alphanumeric characters and replaces them with whitespace, with which we can split the string on and lowercase all values. This turns our ​

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)

into the list of item ​

['mozilla', '5', '0', 'x11', 'linux', 'x86_64', 'applewebkit', '537', '36', 'khtml', 'like', 'gecko']

Once we have more consistent tokens, we can put them all into a fixed bucket with the hashing trick. This works as follows:

  1. Create a zero vector the size you want. E.g. a size 16 vector would be [0, 0, 0, 0, 0, 0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0].
  2. Take your text, hash it, and modulo it to the size of your vector.
import hashlibbucket_size = 16
text = 'mozilla'
hash_index = int(hashlib.sha1(text.encode("utf-8")).hexdigest(), 16) % (bucket_size)
  1. Insert a 1 (or some other value as you choose, perhaps scaled based on the number of items you’re indexing) into the ​hash_index​ position. So ‘mozilla’ would get inserted into the 9 position of the vector, resulting in [0, 0, 0, 0, 0, 0 ,0 ,0 ,0 ,1 ,0 ,0 ,0 ,0 ,0 ,0].
  2. Continue with all of the items you want to hash into that vector.
  3. Note: For our use case, we are scaling the value inserted into the vector based on the number of items we are indexing. If two are put in the same position, they are added together.
  4. Finally, the string of tokens ​['mozilla', '5', '0', 'x11', 'linux', 'x86_64', 'applewebkit', '537', '36', 'khtml', 'like', 'gecko']​ would get hashed to [0, 0.0833, 0.0833, 0.1667, 0.1667, 0.0833, 0, 0.0833, 0, 0.0833, 0.0833, 0.0833, 0.0833, 0, 0, 0]
  5. For higher fidelity, you can increase the bucket size from 16 to a larger number.

Now that we have a method to turn unbounded text into a fixed numerical vector, we can do this with many more of our fields and concatenate the results , along with boolean variables (e.g. is this IP coming from a VPN? T/F), to create one long feature vector to represent each record. Success!

Figure 2: Base Feature Vector

But weight, there’s more!

Not all features have equal importance, so we need to create weights so some features have more significance than others in the analysis. 

For IP Similarity we are using a combination of relatively static IP centric features, things we can derive just from knowing what IP the traffic is coming from or their connection metadata, and more dynamic behavioral features, things we see inside the traffic from that IP. These features are:

IP Centric

  • VPN
  • Tor
  • rDNS
  • OS
  • JA3 Hash
  • HASSH

Behavioral

  • Bot
  • Spoofable
  • Web Paths
  • User-Agents
  • Mass scanner
  • Ports

Features like JA3 can be less important while features like Web Paths can really show good similarity between records.

We are curating an ever growing collection of pairs of GreyNoise records that we think are good matches and bad matches. With these, we can randomly go through our collection, compare the feature vectors for the records and adjust the weights to make those matches (or non-matches) better and better. This creates a weight vector that we can use to adjust our base feature vector.

Figure 3: Weight Vector

The Final Vector

We take our GreyNoise record, extract the features we want to use, apply the hashing trick or other numerical logic, apply our weights, and we are left with a final vector that is ready to be used in comparison and machine learning. For example:

Figure 4: Final Feature Vector Calculation

The Results

With this new representation we can do a lot of ML, but our first use case is IP Similarity, which answers the following question:

Given an IP address and all that GreyNoise knows about it, show me all other IPs GreyNoise has seen that have similar characteristics and behaviors.

To do this we compare two feature vectors and calculate L2Norm. Just like in geometry where you use the Pythagorean theorem, a2 + b2 = c2 or c = sqrt(a2+b2), L2Norm just extends that to a larger space, so it is simply a measure of how far two points/vectors are from each other. If L2Norm is small, the feature vectors are close and thus very similar. If it is large, the feature vectors are far from each other and thus dissimilar.

We put all of this feature vector information into ElasticSearch alongside our GreyNoise records and voilà, we can now find any GreyNoise records that are similar to any other. Some of the use cases are:

We can take a single IP from our friends at Shodan.io, https://viz.greynoise.io/ip-similarity/89.248.172.16, and return 21 (at the time of writing) other IPs from Shodan, 

Figure 5: IP Similarity of 89.248.172.16  as shown in GreyNoise. 

And we can compare the IPs side by side to find out why they were scored as similar.

Figure 6: IP Similarity Details 

While we have an Actor tag for Shodan which allows us to see that all of these are correct, IP Similarity would have picked these out even if they were not tagged by GreyNoise.

We can take an IP from the tagged with NETGEAR DGN COMMAND EXECUTION, https://viz.greynoise.io/ip-similarity/182.126.118.174, and return many other IPs that could be  part of that attack, 

Figure 7: IP Similarity of 182.126.118.174 as shown in GreyNoise. 

We can see they share OS, Ports, Web Paths, and rDNS.

We can take an IP from another prolific scanner like ReCyber, https://viz.greynoise.io/ip-similarity/89.248.165.64, and return a large number of IPs, many from ReCyber, but others that simply act like ReCyber, 

Figure 8: IP Similarity of 89.248.165.64 as shown in GreyNoise. 

The End

Ultimately, we hope this tool is insanely useful to you and you’ve developed a better understanding of how it works under the hood. Be on the lookout for more features, machine learning applications, and explanations! To try IP Similarity for yourself, sign-up for a free trial or request a demo to learn more.

(*Create a free GreyNoise account to begin your enterprise trial. Activation button is on your Account Plan Details page.)

Get Started With GreyNoise for Free
This article is a summary of the full, in-depth version on the GreyNoise Labs blog.
GreyNoise Labs logo
Link to GreyNoise Twitter account
Link to GreyNoise Twitter account