### Keeping the “Science” in “Data Science”: Calibrating Algorithms for Threat Detection

As attack payloads and methods have become more easily adaptable and customizable to individual campaigns and targets (e.g. polymorphic malware, customized payloads, credential theft, etc.), threat detection systems have migrated from using static, predefined rules (e.g. snort, regex) to more data-driven detectors (UBA, honey net salting, etc.). Although often more effective than older methods at detecting complex and subtle attacks, these newer techniques can be accompanied by a greater level of uncertainty. Fortunately, there are statistical methods we can use to improve performance and to clarify results for both product developers and users.

For many years, threat detection was based on *signatures* which can be thought of as *assertions* about data, based on previously-seen attacks or on rule violations. We often think of hash matching or regex rules as being the main type of signature detection, but in fact, many types of behavioral detection are also signature detection. For example, detecting outbound traffic volumes over some predefined threshold, or detecting any use of particular protocols on an internal network are common unusual behaviors that are flagged, but they are also *assertions* about the data. Signatures in general can be circumvented by a knowledgeable attacker, particularly an insider, and behavior-based signatures are often brittle or require manual tuning by IT staff. (For example, just how much external volume is unusual?)

**Adaptive Detection**

The advantage of data science-driven detection is the ability to adapt detection for each installation environment, and in some cases, each system. However, developers and customers need to become comfortable with the inherently *statistical* nature of many of these methods. Unlike *assertions*, model-based *deductions* are not binary (threat or non-threat) but rather yield a *probability* (between 0 and 1) that a particular event or datum is a threat. Depending on the model, there may also be a *confidence* in the probability estimate. These details are often hidden by using thresholds. For instance, if the threat probability is > 0.95 then it’s a threat, otherwise (<= 0.95) non-threat.

Of course, signature-based detections produce false alerts, but given the use of thresholding and the desire to not miss true alerts, false alerts can be more common with probabilistic detection. One common method to modulate the true alert / false alert set is to maintain a set of whitelist or blacklist items related to the detection method. For instance, a blacklist of known malicious domains can be maintained to help improve the true alert set and a set of known safe domains can be maintained to help improve the false alert set. Unfortunately, these lists are necessarily incomplete and rarely maintained over time. Also, as with other assertion based methods, they are inherently *reactive*, and do not accommodate unknown domains in advance.

**Model Metrics**

In order to discuss more data-driven methods to improve performance, let’s consider our metrics.

Figure 1. AUC examples. Black is no better than random; blue is better than red.

Other metrics like F1 and Precision-Recall can be used, and at JASK we normally calculate both AUC and F1, but when we can calculate AUC, it is the most useful for model comparison and for selecting an *operating point*, or a trade off between TPR and FPR.

**Model Training and Calibration**

When we develop a detection model, we will normally try to gather data from as many environments as possible for training. Then, as we move a model into production, not only for a new model, but each time we do an installation, we want to make sure the model is working as well as it can be. However, instead of using an assertion-based tuning method, we want to continue to use a model-driven method. Depending on the model being used, we have several options.

The most straightforward method for model calibration is to adjust the training set, based on local data characteristics, and re-train the model. For batch training classification, this is a reliable approach. Models that are inherently streaming and maintain a latent or feature-space “state,” such as topic and community models, can be operated in a dynamic mode such that over time, as new data is ingested, the model will drift adaptively toward more relevant solutions. Active learning models facilitate direct feedback either from users (e.g. this was a good detection, that was not a good detection) or from an internal quality metric. Such “guided learning” or “reinforcement learning” models often learn more adaptive and robust representations of data, but require considerable more effort to develop.

**Example Calibration: Domain Generation Algorithm Detector**

One detection model improvement that does not fit well into the above categories, but instead requires a model extension, is n-gram subtypes for DGA or Domain Generation Algorithm detection. Attackers often use DGAs to create domains to use for command and control, beaconing, and other communication infrastructure use cases. We developed an artificial neural network (ANN) using Long-Short Term Memory (LSTM) for detecting algorithmically-generated domain names. Figure 2 shows a few typical examples from non-threat and threat cases. DGAs are often used for hosting malicious command and control servers, and other threat infrastructure.

Figure 2. Examples of threat and non-threat domains based on a Domain Generation Algorithm (DGA)

We found during some customer installations that a customer’s external traffic might involve a particular set of domains that for some reason were flagged as DGA by our baseline detector. As mentioned above, this sort of behavior is relatively common for probabilistic models. Instead of gathering a list of domains we happened to observe and whitelisting these in an adjunct process, external to the model, we gathered these domains (see Figure 3) and noted that many of them were characterized by dictionary-word n-grams. The most straightforward way to address this was to extend the model to check for dictionary words in n-grams in the candidate domains. This enabled the model to be robust to potential FPs from all domains in this category, not just the domains in the observed set.

Figure 3. False Positive DGA examples, with some egregious examples highlighted. Many in this category are composed of dictionary word n-grams.

The following figure shows an example of the DGA detection algorithm running in production on the JASK system.

Figure 4. Screen Capture from DGA detection in JASK Trident.

**Ongoing Monitoring and Metrics**

In order to maintain ongoing situational awareness of detection performance across all JASK installations, we have instrumented the product to include various logging and metrics into the usual system performance dashboards*. See Figure 4 below for an example of one such internal dashboard.* This provides data scientists, engineering, and DevOps with views over time, per installation, aggregate in time (hours, days) and aggregate in category and in customer segment (financial, data center, etc.) as attacks often cluster along these axes. This allows us to get ahead of model trends during operation, not just during initial calibration.

Figure 5. Internal performance monitoring dashboard, showing detection rates by category over time.

**Conclusion**

We encourage data science teams to look for principled, model-driven ways to measure and control the performance of analytics, rather than ad hoc external methods such as lists or separate processing steps. Retraining, dynamic / streaming models, and guided / reinforcement learning are all viable options. A little effort up front can lead to improved and more robust performance over the lifecycle of an engagement.

### Cueing Threat Hunters with Change Detection

Artificial Intelligence (AI) and various component tools such as machine learning (ML) are not intended to fully-automate threat mitigation and response, at least not in the current generation of technologies. Instead, AI and ML are beginning to provide a much greater degree of organization and prioritization for existing workflows. For example, the Threat Hunter ideally keys off a highly-targeted alert, perhaps one that is indicative of a specific threat actor or use of a particular tool. However, such targeted alerts are notoriously brittle, and this model fails for novel threat vectors.

In lieu of specially-crafted signatures for known or previously-seen attacks, or alerts that might fail to reliably fire in a particular network, how can we key an investigation into potentially malicious activity in which an attacker has already gained access, before it’s too late?

Nearly all “generic” or non-signature detections involve detecting a change in an observable field or a change in the level of activity on some observable field. This is called “changepoint detection” and is an area of probabilistic or statistical time series analysis. Some methods are more reliable than others. Some methods are “fooled” by regular changes in periodic activity - that is, they detect changes due to the normal ebb-and-flow of the work day or week.

In our last blog, we discussed methods on the highly-sophisticated end (neural networks) for learning nominal activity patterns in data. These and similar models require a lot of training data and can be challenging to deploy in production (we will have a follow-up on this subject.) Here we discuss how to apply much simpler statistical models that are relatively straightforward to deploy in production. However, as with most data science based models, the key is framing the problem and what might be called data logistics.

The following chart shows hourly samples of SMB file access from a real network. If you squint you can see a pattern of daily activity compressed near the x-axis. However, the spike one morning really stands out, as this was multiple orders of magnitude greater than the usual level of activity for this activity type for this network.

What goes through a threat hunter’s mind when they see a graph like this? It could represent something benign like a file backup, a policy change, or some other routine high-volume file server activity. Or it could indicate a hostile scenario such as a recon, ransomware, or a mass file delete. This triage process can be hectic and stressful until the cause of the unusual activity is run to ground. In order to determine threat or not-threat, the hunter will need to dig further, but first, the unusual activity must be flagged.

To reduce the burden on hunters and the security workload in general, JASK’s Trident product combines multiple behaviors, like change points, signals, and threat intelligence into a greatly reduced number of Smart Alerts that merit greater attention. But any analyst or hunter can benefit from a better understanding of how to better detect changes on the network.

One good initial step in a statistical analysis of data is to determine the distribution of data. It can be tempting to start off by calculating high-level or coarse statistical measures like mean (average) and variance (square of standard deviation). But these measures are only valid if the data really does fit a standard normal distribution, the oft-cited “bell curve.” A quick look at the above time series shows that this data does not quite meet this assumption. In fact, research has shown that network traffic does not generally follow a normal distribution, but tends to be heavy-tailed, like the data example here.

This graph is zoomed into the area around the bulk of the data, so we can see the shape of the distribution more clearly. There are a small number of outliers, in particular, the dramatic activity peak on Wednesday morning.

Going back to analyzing some of the descriptive statistics associated to these values we see that the histogram is known as a skewed (or heavy-tailed) distribution. Network counts for instance are skewed, and in this case mean is a poor indicator of data centrality. Similarly, variance is a poor estimate for how much spread is typical in the skewed data sets. A commonly-used indicator of change in simple anomaly detection applications is building a threshold rule, such as “three sigma,” which is three times the standard deviation above and below the average trend of values in the timeseries.

In practice when we use these type of threshold rules on skewed data we sometimes run into noisy situations, where the model raises too many outliers or not any at all. For example In the output below we have overlaid some useful descriptive statistics. In the example below we end up computing a standard deviation that is less than zero, which is hard to interpret for counts. Furthermore looking at the distribution of values in relation to the outlier, we see that the mean and standard deviation are concentrated towards the skewed part of the distribution.

An alternate approach is to use median instead of mean. As shown in the same figure, median is a much more reliable measure of data centrality. Median is also less prone to pull from small numbers of outliers and other extreme data. A common change point detection metric used based on median is median absolute deviation, or MAD, defined like:

In the example above we see if we use MAD for flagging change points or outlier the values we get a much less noisy threshold. As can be seen from the chart above, this would only trigger on the one unusual hour in the observed data. This might seem to be what we want, however, what if the data is collected or processed in 10 min or 1 min bins instead of hourly bins? How does this impact the distribution?

The standard deviation is often a poor measure of data “dispersion” or spread on skewed data sets commonly used in cyber security modeling. The same chart shows the alternate measure of dispersion based on median, or MAD. Again, this appears to fit our visual intuition. Also, using a threshold of 3 or 4 times MAD would ignore most of the data, but flag a small number of high-end data points in addition to the extreme outlier - which is way off the right end of of this zoomed in chart.

From a scalability perspective we have to answer one key question: are we concerned about performing these calculations on the vast amount of network data collected on real large-scale networks? The answer is “no” because streaming implementation of the median exist, based on a mini-heap computation, which scales as O(n log n), principally because as new data arrives, a sort must be performed so that the mid-point can be identified, or the P^{2 }algorithm, which is more memory efficient than using heaps.

As mentioned already, barring a clear signature of a known attack tool or use of malicious code, which for a wide variety of reasons is becoming more rare, the threat hunter requires some starting-point for investigations. Changes in the network are one of those key behaviors, and using a metric that suits the heavy-tailed distribution of network activity is essential to keep false alerts in check while still being sensitive to moderate changes. Correlating these and other types of changes in context can be performed using an manual correlations and SIEM. More and more, through additional machine learning automation. This next level of contextual automation is where JASK is focusing, and we will discuss some of the ways this can be reliably done is subsequent posts in this series.

### Auto-Discovery of Communities from Network Data

## Prologue

The Data Science team at JASK got its start well before I had even heard of JASK; Joseph Zadeh (Director of Data Science) and I met in San Francisco when we worked together on papers for a User Behavior Analytics (UBA) workshop for SIG KDD 2016. The experience of working together on ideas around UBA, led to many other discussions on related and not-so-related topics such as scaling algorithms to run in near-real time on distributed systems, defining a minimal “cyber game” for training a deep reinforcement learner, and what the next steps would be in machine learning for cyber security. For both Joseph and I, the next step was to join the extremely talented team at JASK and contribute to building next-generation machine learning technology to automate triage in the SOC.

This post kicks off a blog series by our Data Science team. We hope it will foster a conversation in the community and capture some of the excitement we have found through whiteboarding ideas and prototyping algorithms; this is truly an exciting time during which machine learning is finally starting to make an real impact for the network security analyst. We will highlight algorithms and analytics we are using at JASK in production, interesting ideas from customer and team analysts in hopes of achieving this bigger goal.

## Introduction

One of the most challenging aspects of developing reliable and accurate machine learning algorithms for network security applications is the availability of training data, particularly labeled training data. One can still find recent academic research papers citing the use of these ancient data sets such as the Enron email corpus and the SIG-KDD Cup Challenge from 1999. Even when current data is available, it is rarely labeled. Labels can be seen as providing answers to question within the data such as: Which IP’s are servers? What are normal traffic patterns? Which devices are BYOD? Labeled data is critical for training and testing so that models will be able to learn structure implied by the labels. Also, basic infrastructure information and nominal network activity patterns are critical for providing context and structure to both automated and human analyses of events.

One way to obtain this type of information that is both low-cost and low-impact is to auto-discover network communities and prominent nodes from flow data. We looked at two algorithms: PageRank (https://en.wikipedia.org/wiki/PageRank) and a probabilistic matrix factorization model described here (http://ideal.ece.utexas.edu/pubs/pdf/2015/kdd%20wksp%202015%20acsz15.pdf).

## PageRank

Briefly, PageRank is a variant of the Eigenvector centrality method of analyzing linked networks. It was originally, and famously, formulated to find popular nodes in referential networks such as the web (see https://en.wikipedia.org/wiki/PageRank for an overview including early use in search results at Google.) Rather than using a simple metric, such as number of links or centrality, PageRank factors the reputation of nodes so that, for instance, fewer links from more reputable nodes will count more than many links from low-ranking ones. Thus, PageRank is calculated iteratively, although there are algebraic methods.

Here, we’ve used PageRank on a network of connected computer systems, using flows to define “links.” Highly-ranked devices will be “popular” systems in the network and are likely to be traffic aggregation points, such as proxies.

## Matrix Factorization

Networks are often represented, for the purposes of mathematical modeling, as an association matrix.

*You can see from an actual association matrix that since most nodes do not communicate with each other, the data is highly sparse, or most cells in the matrix are zero. This means that typical methods for working with matrices (that often require inversion) do not work at all.*

This is a square matrix (can be thought of as a data table) with the same set of row indices and column indices. Each row corresponds to a source IP, each column to a destination IP, and cell values correspond to the number of connections (or packets, or bytes) between source and destination. The diagonal is typically all “0”s since self-connections are not considered, nor observed. We noted that PageRank is a variant of Eigenvector centrality, so although implementation-dependent, PageRank also uses an association matrix input format. Association matrices often make three simplifying assumptions:

- Binary connection: there is either a connection (communication) between two nodes, or there is not
- Symmetric connections: the direction of communication is not important, so the upper-right half of the matrix is identical to the lower-left half
- Homogeneous edges or connections: communication or connections are the same in all cases, or at least irrelevant to the analysis

These assumptions are made for abstracting the problem into a numerical model, and effectively to simplify the mathematical analysis. However, when working with network flow data, none of these assumptions hold.

- Connections between hosts are measured based on the number of sessions, the number of packets, the number of bytes, etc.
- Most connections are highly asymmetric, due to client-server configuration, or other topologies
- At a minimum, level 4 and 7 information (port and application) provide valuable labels to the actual heterogeneous network

The simplest matrix factorization model is non-negative matrix factorization, which assumes that we can factor the observable association matrix, Y, which is typically highly sparse, into a more dense factor matrix, . Note that the factor matrix, is multiplied by its transpose to compose the original observation matrix, as shown in the figure below. The reason is called a factor matrix is that each column will represent a “group” in the network, and the vector of ips in that group with the highest weight will be linked by their connectivity in the original association matrix.

Such matrix factorization models work satisfactorily, however, they require an initial specification of the anticipated number of groups (factors), K. In addition, once trained, the model is rigid with respect to new data. The straightforward way to relax these conditions is to turn this into a probabilistic model. The most widely used probabilistic model of this type is the mixed membership stochastic blockmodel (MMSB) (http://www.jmlr.org/papers/volume9/airoldi08a/airoldi08a.pdf). However, our experiments with using MMSB to model communities of network activity yield poor results. One reason may be that the generative probabilistic model assumed in MMSB, which holds that community membership, is drawn from a multinomial distribution (under the assumption that each node in the network has a probability of belonging to any of the active communities in the network). Although this sounds superficially plausible, it requires that each node be a member of all communities at some non-zero level, which does not match intuition nor descriptive statistics. Instead, we can take a different approach and recognize that since the association matrix is composed of *counts* of interactions between nodes, we can model this data using a Poisson process (a statistical definition we sometimes use for modeling arrival times or discontinuous behaviors) , but this makes both the mathematical model and the inverse inference of the model parameters from data much more challenging. The below figure illustrates a model we helped to develop that extends matrix factorization into a Bayesian framework, where factors are drawn from a Poisson process, using a Gamma (conjugate) prior. The mathematical details get deep here; if you’re interested beyond this point, I recommend you check out one of the academic references published on this work (http://ideal.ece.utexas.edu/pubs/pdf/2015/actz15.pdf, http://ideal.ece.utexas.edu/pubs/pdf/2015/kdd%20wksp%202015%20acsz15.pdf)

Although this looks similar to the basic matrix factorization model, inferring the factors, vectors in , is far more complicated. See the references for some of the details. This model has a couple of advantages:

- Dynamic discovery of group structures: K does not need to be specified in advance, the model will automatically find the number of communities in the network based on the data. This is called a non-parametric model.
- Both binary association matrices and count-based association matrices can be modeled. This addressed frequent problematic assumption #1 in network models (mentioned above).

## Results

The following graphs show node weights for PageRank and color-coded communities inferred by the factorization model, respectively. Recall that a node is just a system in our model and so the most prominent nodes detected by PageRank were also found to be key community nodes by the factor model. The factor model identified 5 prominent interaction communities. We investigated the activity of key nodes in the two strongest communities, one of which overlapped strongly with PageRank strength. Interestingly, all these nodes were acting as http proxies. It is unclear, on the basis of this initial investigation, if the other communities on this network also exhibit this pattern.

Fig 1. Nodes are sized in this figure based on PageRank ranking. Node are colored based on the community detected model. Node the two prominent PageRank nodes in the “orange” and “blue” communities. (These PageRank results were produced by Scott Woods from our Engineering team, during our last Hackathon.)

Fig 2. The primary communities inferred by the matrix factorization model are color coded. These figures were produced with the Python networkx and matplotlib libraries..

As mentioned at the outset, focus is often placed on training machine learning algorithms for detection and classification, but the first step is to orient algorithms by preparing data for training. Skipping this step risks under-conditioning, which results in poor metrics (high false-positives, etc.). Auto-detecting baseline network activity can provide one way to label and condition collected data so that it is not just dumped into a single giant “bin”, but rather subdivided into bins that are useful in a conditional probability model and meaningful to analysts. It’s this last aspect that is perhaps most valuable, for providing real context to alerts and investigations.

Many promising data mining and machine learning models fail to be useful in practice in part because of a failure to solve real-world problems in a reliable way. Consider the problem of the SOC analyst that must investigate every alert in the context of substantial local configuration and domain knowledge. Training and tuning detection and alert analytics and providing context to analysts using a data-driven model of network topology and nominal activity, is but one piece of the puzzle.

## Conclusion

Since JASK can run computations on live and historical data in the cloud, it’s straightforward to prototype algorithm and analytic ideas, whether as a customer or a company analyst. In this case, I was motivated by our recent hackathon to see if community detection methods could learn nominal activity patterns on a customer network. PageRank was able to identify a small number of key nodes. A more sophisticated Poisson matrix factorization model not only identified these same nodes but also several other similar nodes and the related activity networks. Extensions to the factorization model, like including protocol information (link heterogeneity) and allowing for directed communications (an asymmetric association matrix) are likely to further improve the quality and reliability of the inferred network communities.