# Auto-Discovery of Communities from Network Data

## Prologue

The Data Science team at JASK got its start well before I had even heard of JASK; Joseph Zadeh (Director of Data Science) and I met in San Francisco when we worked together on papers for a User Behavior Analytics (UBA) workshop for SIG KDD 2016. The experience of working together on ideas around UBA, led to many other discussions on related and not-so-related topics such as scaling algorithms to run in near-real time on distributed systems, defining a minimal “cyber game” for training a deep reinforcement learner, and what the next steps would be in machine learning for cyber security. For both Joseph and I, the next step was to join the extremely talented team at JASK and contribute to building next-generation machine learning technology to automate triage in the SOC.

This post kicks off a blog series by our Data Science team. We hope it will foster a conversation in the community and capture some of the excitement we have found through whiteboarding ideas and prototyping algorithms; this is truly an exciting time during which machine learning is finally starting to make an real impact for the network security analyst. We will highlight algorithms and analytics we are using at JASK in production, interesting ideas from customer and team analysts in hopes of achieving this bigger goal.

## Introduction

One of the most challenging aspects of developing reliable and accurate machine learning algorithms for network security applications is the availability of training data, particularly labeled training data. One can still find recent academic research papers citing the use of these ancient data sets such as the Enron email corpus and the SIG-KDD Cup Challenge from 1999. Even when current data is available, it is rarely labeled. Labels can be seen as providing answers to question within the data such as: Which IP’s are servers? What are normal traffic patterns? Which devices are BYOD? Labeled data is critical for training and testing so that models will be able to learn structure implied by the labels. Also, basic infrastructure information and nominal network activity patterns are critical for providing context and structure to both automated and human analyses of events.

One way to obtain this type of information that is both low-cost and low-impact is to auto-discover network communities and prominent nodes from flow data. We looked at two algorithms: PageRank (https://en.wikipedia.org/wiki/PageRank) and a probabilistic matrix factorization model described here (http://ideal.ece.utexas.edu/pubs/pdf/2015/kdd%20wksp%202015%20acsz15.pdf).

## PageRank

Briefly, PageRank is a variant of the Eigenvector centrality method of analyzing linked networks. It was originally, and famously, formulated to find popular nodes in referential networks such as the web (see https://en.wikipedia.org/wiki/PageRank for an overview including early use in search results at Google.) Rather than using a simple metric, such as number of links or centrality, PageRank factors the reputation of nodes so that, for instance, fewer links from more reputable nodes will count more than many links from low-ranking ones. Thus, PageRank is calculated iteratively, although there are algebraic methods.

Here, we’ve used PageRank on a network of connected computer systems, using flows to define “links.” Highly-ranked devices will be “popular” systems in the network and are likely to be traffic aggregation points, such as proxies.

## Matrix Factorization

Networks are often represented, for the purposes of mathematical modeling, as an association matrix.

*You can see from an actual association matrix that since most nodes do not communicate with each other, the data is highly sparse, or most cells in the matrix are zero. This means that typical methods for working with matrices (that often require inversion) do not work at all.*

This is a square matrix (can be thought of as a data table) with the same set of row indices and column indices. Each row corresponds to a source IP, each column to a destination IP, and cell values correspond to the number of connections (or packets, or bytes) between source and destination. The diagonal is typically all “0”s since self-connections are not considered, nor observed. We noted that PageRank is a variant of Eigenvector centrality, so although implementation-dependent, PageRank also uses an association matrix input format. Association matrices often make three simplifying assumptions:

- Binary connection: there is either a connection (communication) between two nodes, or there is not
- Symmetric connections: the direction of communication is not important, so the upper-right half of the matrix is identical to the lower-left half
- Homogeneous edges or connections: communication or connections are the same in all cases, or at least irrelevant to the analysis

These assumptions are made for abstracting the problem into a numerical model, and effectively to simplify the mathematical analysis. However, when working with network flow data, none of these assumptions hold.

- Connections between hosts are measured based on the number of sessions, the number of packets, the number of bytes, etc.
- Most connections are highly asymmetric, due to client-server configuration, or other topologies
- At a minimum, level 4 and 7 information (port and application) provide valuable labels to the actual heterogeneous network

The simplest matrix factorization model is non-negative matrix factorization, which assumes that we can factor the observable association matrix, Y, which is typically highly sparse, into a more dense factor matrix, . Note that the factor matrix, is multiplied by its transpose to compose the original observation matrix, as shown in the figure below. The reason is called a factor matrix is that each column will represent a “group” in the network, and the vector of ips in that group with the highest weight will be linked by their connectivity in the original association matrix.

Such matrix factorization models work satisfactorily, however, they require an initial specification of the anticipated number of groups (factors), K. In addition, once trained, the model is rigid with respect to new data. The straightforward way to relax these conditions is to turn this into a probabilistic model. The most widely used probabilistic model of this type is the mixed membership stochastic blockmodel (MMSB) (http://www.jmlr.org/papers/volume9/airoldi08a/airoldi08a.pdf). However, our experiments with using MMSB to model communities of network activity yield poor results. One reason may be that the generative probabilistic model assumed in MMSB, which holds that community membership, is drawn from a multinomial distribution (under the assumption that each node in the network has a probability of belonging to any of the active communities in the network). Although this sounds superficially plausible, it requires that each node be a member of all communities at some non-zero level, which does not match intuition nor descriptive statistics. Instead, we can take a different approach and recognize that since the association matrix is composed of *counts* of interactions between nodes, we can model this data using a Poisson process (a statistical definition we sometimes use for modeling arrival times or discontinuous behaviors) , but this makes both the mathematical model and the inverse inference of the model parameters from data much more challenging. The below figure illustrates a model we helped to develop that extends matrix factorization into a Bayesian framework, where factors are drawn from a Poisson process, using a Gamma (conjugate) prior. The mathematical details get deep here; if you’re interested beyond this point, I recommend you check out one of the academic references published on this work (http://ideal.ece.utexas.edu/pubs/pdf/2015/actz15.pdf, http://ideal.ece.utexas.edu/pubs/pdf/2015/kdd%20wksp%202015%20acsz15.pdf)

Although this looks similar to the basic matrix factorization model, inferring the factors, vectors in , is far more complicated. See the references for some of the details. This model has a couple of advantages:

- Dynamic discovery of group structures: K does not need to be specified in advance, the model will automatically find the number of communities in the network based on the data. This is called a non-parametric model.
- Both binary association matrices and count-based association matrices can be modeled. This addressed frequent problematic assumption #1 in network models (mentioned above).

## Results

The following graphs show node weights for PageRank and color-coded communities inferred by the factorization model, respectively. Recall that a node is just a system in our model and so the most prominent nodes detected by PageRank were also found to be key community nodes by the factor model. The factor model identified 5 prominent interaction communities. We investigated the activity of key nodes in the two strongest communities, one of which overlapped strongly with PageRank strength. Interestingly, all these nodes were acting as http proxies. It is unclear, on the basis of this initial investigation, if the other communities on this network also exhibit this pattern.

Fig 1. Nodes are sized in this figure based on PageRank ranking. Node are colored based on the community detected model. Node the two prominent PageRank nodes in the “orange” and “blue” communities. (These PageRank results were produced by Scott Woods from our Engineering team, during our last Hackathon.)

Fig 2. The primary communities inferred by the matrix factorization model are color coded. These figures were produced with the Python networkx and matplotlib libraries..

As mentioned at the outset, focus is often placed on training machine learning algorithms for detection and classification, but the first step is to orient algorithms by preparing data for training. Skipping this step risks under-conditioning, which results in poor metrics (high false-positives, etc.). Auto-detecting baseline network activity can provide one way to label and condition collected data so that it is not just dumped into a single giant “bin”, but rather subdivided into bins that are useful in a conditional probability model and meaningful to analysts. It’s this last aspect that is perhaps most valuable, for providing real context to alerts and investigations.

Many promising data mining and machine learning models fail to be useful in practice in part because of a failure to solve real-world problems in a reliable way. Consider the problem of the SOC analyst that must investigate every alert in the context of substantial local configuration and domain knowledge. Training and tuning detection and alert analytics and providing context to analysts using a data-driven model of network topology and nominal activity, is but one piece of the puzzle.

## Conclusion

Since JASK can run computations on live and historical data in the cloud, it’s straightforward to prototype algorithm and analytic ideas, whether as a customer or a company analyst. In this case, I was motivated by our recent hackathon to see if community detection methods could learn nominal activity patterns on a customer network. PageRank was able to identify a small number of key nodes. A more sophisticated Poisson matrix factorization model not only identified these same nodes but also several other similar nodes and the related activity networks. Extensions to the factorization model, like including protocol information (link heterogeneity) and allowing for directed communications (an asymmetric association matrix) are likely to further improve the quality and reliability of the inferred network communities.